redis

mirror of https://github.com/redis/redis.git synced 2026-05-27 11:43:04 -04:00

History

Vitah Lin 31896140d1 Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Fix diskless replicas drop during rdb pipe test (#15131 ) This PR is based on: valkey-io/valkey#3511 Close https://github.com/redis/redis/issues/14983 ## Summary During diskless replication, if any single replica cannot accept a write (TCP send buffer full / `EAGAIN`), the master stops reading the RDB pipe entirely, stalling data delivery to all replicas — including fast ones that are ready to receive data. The failure reason is similar to https://github.com/redis/redis/pull/14946, the socket buffer is more easy to fill. ## Root Cause In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and writes to all replica sockets in a loop. When `connWrite` to any replica returns a partial write (socket send buffer full), the handler: 1. Installs a per-replica `rdbPipeWriteHandler` and increments `rdb_pipe_numconns_writing` 2. Removes the pipe read event via `aeDeleteFileEvent(server.el, server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads The pipe read event is only re-enabled when all pending write handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the slowest replica dictates the throughput for all replicas. ## Observed Behavior With one slow replica (consuming at ~290 KB/s due to `key-load-delay`): - Master bursts ~1.3 MB of RDB data until the slow replica's socket send buffer fills - `rdbPipeReadHandler` disables the pipe read event - All replicas starve for 4–5 seconds while the slow replica drains its buffer - Cycle repeats: burst → stall → burst → stall Ultimately, it leads to a very slow synchronization process of the entire master and replica. ### Changes 1. Skip the entire `diskless replicas drop during rdb pipe` test under Valgrind to avoid timing flakiness on slow env. 2. Move `start_server` inside the `foreach all_drop` loop so each subcase gets a fresh master instead of sharing state across subcases. 3. For `no / slow / fast / all` subcases, replica 0 runs with `key-load-delay 500`, which combined with the blocked-writer TCP back-pressure can stall the RDB-saving child indefinitely; shrink the dataset to ~40 MB so the transfer still exercises the blocked-writer path but completes in reasonable time instead of hanging on the TCP deadlock. For the timeout subcase, replica 0 does not run with `key-load-delay 500`, so to avoid the TCP deadlock we still reduce the dataset somewhat, but keep it larger than the other subcases. Otherwise the kernel TCP send buffer can absorb the whole RDB, and we'd miss the repl_last_partial_write != 0 "(full sync)" timeout path and only hit the "(streaming sync)" path instead. 5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so the RDB child keeps generating data while both replicas are killed, ensuring the last-replica-drop path is exercised rather than racing with normal completion. 6. Move the slow-replica `pause_process()` so it happens only in the timeout subcase, not after killing replicas, so Redis observes the disconnect promptly in non-timeout flows. 7. In the timeout subcase, set `repl-timeout` 2, wait inline for `Disconnecting timedout replica (full sync)`, then restore `repl-timeout` 60 so the remaining replica can finish the streamed RDB. --------- Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com> Co-authored-by: debing.sun <debing.sun@redis.com>		2026-05-19 18:27:33 +08:00
..
aof-multi-part.tcl	AOF offset info (#13773 )	2025-02-13 17:31:40 +08:00
aof-race.tcl	Stabilization and improvements around aof tests (#12626 )	2023-10-02 08:20:53 +03:00
aof.tcl	Fixes an issue where EXEC checks ACL during AOF loading (#14545 )	2025-11-22 11:52:31 +08:00
block-repl.tcl	Attempt to solve MacOS CI issues in GH Actions (#12013 )	2023-04-12 09:19:21 +03:00
convert-ziplist-hash-on-load.tcl	Replace all usage of ziplist with listpack for t_hash (#8887 )	2021-08-10 09:18:49 +03:00
convert-ziplist-zset-on-load.tcl	Replace all usage of ziplist with listpack for t_zset (#9366 )	2021-09-09 18:18:53 +03:00
convert-zipmap-hash-on-load.tcl	Replace all usage of ziplist with listpack for t_hash (#8887 )	2021-08-10 09:18:49 +03:00
corrupt-dump-fuzzer.tcl	Hold GCRA out of the release (#15191 )	2026-05-14 16:31:25 +03:00
corrupt-dump.tcl	Invalid Memory Access in Redis RESTORE Command (CVE-2026-25243)	2026-05-14 14:10:41 +03:00
dismiss-mem.tcl	Implement the new Redis Array type (#15162 )	2026-05-14 00:56:44 +08:00
failover.tcl	Fix test assertion except from TSAN case (#14852 )	2026-03-06 13:11:38 +02:00
logging.tcl	Hide user data from log (#13400 )	2024-07-09 18:54:18 +08:00
psync2-master-restart.tcl	Handle primary/replica clients in IO threads (#14335 )	2026-01-21 16:19:12 +02:00
psync2-pingoff.tcl	Fix race condition in psync2-pingoff test (#9712 )	2021-11-01 16:07:08 +02:00
psync2-reg.tcl	Improve test suite to handle external servers better. (#9033 )	2021-06-09 15:13:24 +03:00
psync2.tcl	Tests: Do not save an RDB by default and add a SIGTERM default AOFRW test (#12064 )	2023-04-18 16:14:26 +03:00
rdb.tcl	Fix DB hash tables not expanding during RDB load (#14789 )	2026-02-14 15:18:08 +08:00
redis-benchmark.tcl	Fix divide-by-zero in redis-benchmark and redis-cli (#14371 )	2026-03-18 19:57:58 +08:00
redis-cli.tcl	Fix divide-by-zero in redis-benchmark and redis-cli (#14371 )	2026-03-18 19:57:58 +08:00
replication-2.tcl	Attempt to solve MacOS CI issues in GH Actions (#12013 )	2023-04-12 09:19:21 +03:00
replication-3.tcl	Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326 )	2025-09-10 12:52:20 +08:00
replication-4.tcl	optimize spopwithcount propagation (#12082 )	2023-05-22 10:27:14 +03:00
replication-buffer.tcl	Handle primary/replica clients in IO threads (#14335 )	2026-01-21 16:19:12 +02:00
replication-iothreads.tcl	Handle primary/replica clients in IO threads (#14335 )	2026-01-21 16:19:12 +02:00
replication-psync.tcl	Introduce flushdb option for repl-diskless-load (#14596 )	2025-12-15 11:25:53 +08:00
replication-rdbchannel.tcl	Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326 )	2025-09-10 12:52:20 +08:00
replication.tcl	Fix diskless replicas drop during rdb pipe test (#15131 )	2026-05-19 18:27:33 +08:00
shutdown.tcl	Fix shutdown blocked client not being properly reset after shutdown cancellation (#14420 )	2025-10-15 14:13:40 +08:00