redis/tests/integration
Vitah Lin 31896140d1
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Fix diskless replicas drop during rdb pipe test (#15131)
This PR is based on: valkey-io/valkey#3511
Close https://github.com/redis/redis/issues/14983

## Summary

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
https://github.com/redis/redis/pull/14946, the socket buffer is more
easy to fill.

## Root Cause

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

## Observed Behavior

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

### Changes

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>  
Co-authored-by: debing.sun <debing.sun@redis.com>
2026-05-19 18:27:33 +08:00
..
aof-multi-part.tcl AOF offset info (#13773) 2025-02-13 17:31:40 +08:00
aof-race.tcl Stabilization and improvements around aof tests (#12626) 2023-10-02 08:20:53 +03:00
aof.tcl Fixes an issue where EXEC checks ACL during AOF loading (#14545) 2025-11-22 11:52:31 +08:00
block-repl.tcl Attempt to solve MacOS CI issues in GH Actions (#12013) 2023-04-12 09:19:21 +03:00
convert-ziplist-hash-on-load.tcl Replace all usage of ziplist with listpack for t_hash (#8887) 2021-08-10 09:18:49 +03:00
convert-ziplist-zset-on-load.tcl Replace all usage of ziplist with listpack for t_zset (#9366) 2021-09-09 18:18:53 +03:00
convert-zipmap-hash-on-load.tcl Replace all usage of ziplist with listpack for t_hash (#8887) 2021-08-10 09:18:49 +03:00
corrupt-dump-fuzzer.tcl Hold GCRA out of the release (#15191) 2026-05-14 16:31:25 +03:00
corrupt-dump.tcl Invalid Memory Access in Redis RESTORE Command (CVE-2026-25243) 2026-05-14 14:10:41 +03:00
dismiss-mem.tcl Implement the new Redis Array type (#15162) 2026-05-14 00:56:44 +08:00
failover.tcl Fix test assertion except from TSAN case (#14852) 2026-03-06 13:11:38 +02:00
logging.tcl Hide user data from log (#13400) 2024-07-09 18:54:18 +08:00
psync2-master-restart.tcl Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
psync2-pingoff.tcl Fix race condition in psync2-pingoff test (#9712) 2021-11-01 16:07:08 +02:00
psync2-reg.tcl Improve test suite to handle external servers better. (#9033) 2021-06-09 15:13:24 +03:00
psync2.tcl Tests: Do not save an RDB by default and add a SIGTERM default AOFRW test (#12064) 2023-04-18 16:14:26 +03:00
rdb.tcl Fix DB hash tables not expanding during RDB load (#14789) 2026-02-14 15:18:08 +08:00
redis-benchmark.tcl Fix divide-by-zero in redis-benchmark and redis-cli (#14371) 2026-03-18 19:57:58 +08:00
redis-cli.tcl Fix divide-by-zero in redis-benchmark and redis-cli (#14371) 2026-03-18 19:57:58 +08:00
replication-2.tcl Attempt to solve MacOS CI issues in GH Actions (#12013) 2023-04-12 09:19:21 +03:00
replication-3.tcl Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326) 2025-09-10 12:52:20 +08:00
replication-4.tcl optimize spopwithcount propagation (#12082) 2023-05-22 10:27:14 +03:00
replication-buffer.tcl Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
replication-iothreads.tcl Handle primary/replica clients in IO threads (#14335) 2026-01-21 16:19:12 +02:00
replication-psync.tcl Introduce flushdb option for repl-diskless-load (#14596) 2025-12-15 11:25:53 +08:00
replication-rdbchannel.tcl Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326) 2025-09-10 12:52:20 +08:00
replication.tcl Fix diskless replicas drop during rdb pipe test (#15131) 2026-05-19 18:27:33 +08:00
shutdown.tcl Fix shutdown blocked client not being properly reset after shutdown cancellation (#14420) 2025-10-15 14:13:40 +08:00