redis/tests/integration/replication-buffer.tcl

366 lines
15 KiB
Tcl
Raw Permalink Normal View History

Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
#
# Copyright (c) 2009-Present, Redis Ltd.
# All rights reserved.
#
# Copyright (c) 2024-present, Valkey contributors.
# All rights reserved.
#
# Licensed under your choice of (a) the Redis Source Available License 2.0
# (RSALv2); or (b) the Server Side Public License v1 (SSPLv1); or (c) the
# GNU Affero General Public License v3 (AGPLv3).
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
#
# Portions of this file are available under BSD3 terms; see REDISCONTRIBUTIONS for more information.
#
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# This test group aims to test that all replicas share one global replication buffer,
# two replicas don't make replication buffer size double, and when there is no replica,
# replica buffer will shrink.
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
foreach rdbchannel {"yes" "no"} {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
start_server {tags {"repl external:skip"}} {
start_server {} {
start_server {} {
start_server {} {
set replica1 [srv -3 client]
set replica2 [srv -2 client]
set replica3 [srv -1 client]
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$replica1 config set repl-rdb-channel $rdbchannel
$replica2 config set repl-rdb-channel $rdbchannel
$replica3 config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set save ""
$master config set repl-backlog-size 16384
Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas (#10092) 1. enable diskless replication by default 2. add a new config named repl-diskless-sync-max-replicas that enables replication to start before the full repl-diskless-sync-delay was reached. 3. put replica online sooner on the master (see below) 4. test suite uses repl-diskless-sync-delay of 0 to be faster 5. a few tests that use multiple replica on a pre-populated master, are now using the new repl-diskless-sync-max-replicas 6. fix possible timing issues in a few cluster tests (see below) put replica online sooner on the master ---------------------------------------------------- there were two tests that failed because they needed for the master to realize that the replica is online, but the test code was actually only waiting for the replica to realize it's online, and in diskless it could have been before the master realized it. changes include two things: 1. the tests wait on the right thing 2. issues in the master, putting the replica online in two steps. the master used to put the replica as online in 2 steps. the first step was to mark it as online, and the second step was to enable the write event (only after getting ACK), but in fact the first step didn't contains some of the tasks to put it online (like updating good slave count, and sending the module event). this meant that if a test was waiting to see that the replica is online form the point of view of the master, and then confirm that the module got an event, or that the master has enough good replicas, it could fail due to timing issues. so now the full effect of putting the replica online, happens at once, and only the part about enabling the writes is delayed till the ACK. fix cluster tests -------------------- I added some code to wait for the replica to sync and avoid race conditions. later realized the sentinel and cluster tests where using the original 5 seconds delay, so changed it to 0. this means the other changes are probably not needed, but i suppose they're still better (avoid race conditions)
2022-01-17 07:11:11 -05:00
$master config set repl-diskless-sync-delay 5
$master config set repl-diskless-sync-max-replicas 1
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
$master config set client-output-buffer-limit "replica 0 0 0"
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$master config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# Make sure replica3 is synchronized with master
$replica3 replicaof $master_host $master_port
wait_for_sync $replica3
# Generating RDB will take some 100 seconds
$master config set rdb-key-save-delay 1000000
populate 100 "" 16
# Make sure replica1 and replica2 are waiting bgsave
Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas (#10092) 1. enable diskless replication by default 2. add a new config named repl-diskless-sync-max-replicas that enables replication to start before the full repl-diskless-sync-delay was reached. 3. put replica online sooner on the master (see below) 4. test suite uses repl-diskless-sync-delay of 0 to be faster 5. a few tests that use multiple replica on a pre-populated master, are now using the new repl-diskless-sync-max-replicas 6. fix possible timing issues in a few cluster tests (see below) put replica online sooner on the master ---------------------------------------------------- there were two tests that failed because they needed for the master to realize that the replica is online, but the test code was actually only waiting for the replica to realize it's online, and in diskless it could have been before the master realized it. changes include two things: 1. the tests wait on the right thing 2. issues in the master, putting the replica online in two steps. the master used to put the replica as online in 2 steps. the first step was to mark it as online, and the second step was to enable the write event (only after getting ACK), but in fact the first step didn't contains some of the tasks to put it online (like updating good slave count, and sending the module event). this meant that if a test was waiting to see that the replica is online form the point of view of the master, and then confirm that the module got an event, or that the master has enough good replicas, it could fail due to timing issues. so now the full effect of putting the replica online, happens at once, and only the part about enabling the writes is delayed till the ACK. fix cluster tests -------------------- I added some code to wait for the replica to sync and avoid race conditions. later realized the sentinel and cluster tests where using the original 5 seconds delay, so changed it to 0. this means the other changes are probably not needed, but i suppose they're still better (avoid race conditions)
2022-01-17 07:11:11 -05:00
$master config set repl-diskless-sync-max-replicas 2
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
$replica1 replicaof $master_host $master_port
$replica2 replicaof $master_host $master_port
wait_for_condition 50 100 {
([s rdb_bgsave_in_progress] == 1) &&
[lindex [$replica1 role] 3] eq {sync} &&
[lindex [$replica2 role] 3] eq {sync}
} else {
fail "fail to sync with replicas"
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "All replicas share one global replication buffer rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
set before_used [s used_memory]
populate 1024 "" 1024 ; # Write extra 1M data
Handle primary/replica clients in IO threads (#14335) # Problem While introducing Async IO threads(https://github.com/redis/redis/pull/13695) primary and replica clients were left to be handled inside main thread due to data race and synchronization issues. This PR solves this issue with the additional hope it increases performance of replication. # Overview ## Moving the clients to IO threads Since clients first participate in a handshake and an RDB replication phases it was decided they are moved to IO-thread after RDB replication is done. For primary client this was trivial as the master client is created only after RDB sync (+ some additional checks one can see in `isClientMustHandledByMainThread`). Replica clients though are moved to IO threads immediately after connection (as are all clients) so currently in `unstable` replication happens while this client is in IO-thread. In this PR it was moved to main thread after receiving the first `REPLCONF` message from the replica, but it is a bit hacky and we can remove it. I didn't find issues between the two versions. ## Primary client (replica node) We have few issues here: - during `serverCron` a `replicationCron` is ran which periodically sends `REPLCONF ACK` message to the master, also checks for timed-out master. In order to prevent data races we utilize`IOThreadClientsCron`. The client is periodically sent to main thread and during `processClientsFromIOThread` it's checked if it needs to run the replication cron behaviour. - data races with main thread - specifically `lastinteraction` and `read_reploff` members of the primary client that are written to in `readQueryFromClient` could be accessed at the same time from main thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To solve this the members were duplicated so if the client is in IO-thread it writes to the duplicates and they are synced with the original variables each time the client is send to main thread ( that means `INFO REPLICATION` could potentially return stale values). - During `freeClient` the primary client is fetched to main thread but when caching it(`replicationCacheMaster`) the thread id will remain the id of the IO thread it was from. This creates problems when resurrecting the master client. Here the call to `unbindClientFromIOThreadEventLoop` in `freeClient` was rewritten to call `keepClientInMainThread` which automatically fixes the problem. - During `exitScriptTimedoutMode` the master is queued for reprocessing (specifically process any pending commands ASAP after it's unblocked). We do that by putting it in the `server.unblocked_clients` list, which are processed in the next `beforeSleep` cycle in main thread. Since this will create a contention between main and IO thread, we just skip this queueing in `unblocked_clients` and just queue the client to main thread - the `processClientsFromIOThread` will process the pending commands just as main would have. ## Replica clients (primary node) We move the client after RDB replication is done and after replication backlog is fed with its first message. We do that so that the client's reference to the first replication backlog node is initialized before it's read from IO-thread, hence no contention with main thread on it. ### Shared replication buffer Currently in unstable the replication buffer is shared amongst clients. This is done via clients holding references to the nodes inside the buffer. A node from the buffer can be trimmed once each replica client has read it and send its contents. The reference is `client->ref_repl_buf_node`. The replication buffer is written to by main thread in `feedReplicationBuffer` and the refcounting is intrusive - it's inside the replication-buffer nodes themselves. Since the replica client changes the refcount (decreases the refcount of the node it has just read, and increases the refcount of the next node it starts to read) during `writeToClient` we have a data race with main thread when it feeds the replication buffer. Moreover, main thread also updates the `used` size of the node - how much it has written to it, compared to its capacity which the replica client relies on to know how much to read. Obviously replica being in IO-thread creates another data race here. To mitigate these issues a few new variables were added to the client's struct: - `io_curr_repl_node` - starting node this replica is reading from inside IO-thread - `io_bound_repl_node` - the last node in the replication buffer the replica sees before being send to IO-thread. These values are only allowed to be updated in main thread. The client keeps track of how much it has read into the buffer via the old `ref_repl_buf_node`. Generally while in IO-thread the replica client will now keep refcount of the `io_curr_repl_node` until it's processed all the nodes up to `io_bound_repl_node` - at that point its returned to main thread which can safely update the refcounts. The `io_bound_repl_node` reference is there so the replica knows when to stop reading from the repl buffer - imagine that replica reads from the last node of the replication buffer while main thread feeds data to it - we will create a data race on the `used` value (`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)). That's why this value is updated just before the replica is being send to IO thread. *NOTE*, this means that when replicas are handled by IO threads they will hold more than one node at a time (i.e `io_curr_repl_node` up to `io_bound_repl_node`) meaning trimming will happen a bit less frequently. Tests show no significant problems with that. (tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node` mechanism as my initial implementation had similar semantics but was way less clear) Example of how this works: * Replication buffer state at time N: | node 0| ... | node M, used_size K | * replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and `io_bound_block_pos`=K * replica moves to IO thread and processes all the data it sees * Replication buffer state at time N + 1: | node 0| ... | node M, used_size Full | |node M + 1| |node M + 2, used_size L|, where Full > M * replica moves to main thread at time N + 1, at this point following happens - refcount to node 0 (io_curr_repl_node) is decreased - `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have size-K bytes to process from there) - refcount to node M is increased (now all nodes from 0 up to M-1 including can be trimmed unless some other replica holds reference to them) - And just before the replica is send back to IO thread the following are updated: - `io_bound_repl_node` ref becomes node M+2 - `io_bound_block_pos` becomes L Note that replica client is only moved to main if it has processed all the data it knows about (i.e up to `io_bound_repl_node` + `io_bound_block_pos`) ### Replica clients kept in main as much as possible During implementation an issue arose - how fast is the replica client able to get knowledge about new data from the replication buffer and how fast can it trim it. In order for that to happen ASAP whenever a replica is moved to main it remains there until the replication buffer is fed new data. At that point its put in the pending write queue and special cased in handleClientsWithPendingWrites so that its send to IO thread ASAP to write the new data to replica. Also since each time the replica writes its whole repl data it knows about that means after it's send to main thread `processClientsFromIOThread` is able to immediately update the refcounts and trim whatever it can. ### ACK messages from primary Slave clients need to periodically read `REPLCONF ACK` messages from client. Since replica can remain in main thread indefinitely if no DB change occurs, a new atomic `pending_read` was added during `readQueryFromClient`. If a replica client has a pending read it's returned back to IO-thread in order to process the read even if there is no pending repl data to write. ### Replicas during shutdown During shutdown the main thread pauses write actions and periodically checks if all replicas have reached the same replication offset as the primary node. During `finishShutdown` that may or may not be the case. Either way a client data may be read from the replicas and even we may try to write any pending data to them inside `flushSlavesOutputBuffers`. In order to prevent races all the replicas from IO threads are moved to main via `fetchClientFromIOThread`. A cancel of the shutdown should be ok, since the mechanism employed by `handleClientsWithPendingWrites` should return the client back to IO thread when needed. ## Notes While adding new tests timing issues with Tsan tests were found and fixed. Also there is a data race issue caught by Tsan on the `last_error` member of the `client` struct. It happens when both IO-thread and main thread make a syscall using a `client` instance - this can happen only for primary and replica clients since their data can be accessed by commands send from other clients. Specific example is the `INFO REPLICATION` command. Although other such races were fixed, as described above, this once is insignificant and it was decided to be ignored in `tsan.sup`. --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-01-21 09:19:12 -05:00
# In case we are running with IO-threads we need to give a few cycles
# for IO-threads to start sending the cmd stream. If we don't do that
# the checks related to the repl_buf_mem will be incorrect as the buffer
# will still be full with the above 1Mb data.
set iothreads [s io_threads_active]
if {$iothreads && $rdbchannel == "yes"} {
after 1000
}
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# New data uses 1M memory, but all replicas use only one
# replication buffer, so all replicas output memory is not
# more than double of replication buffer.
set repl_buf_mem [s mem_total_replication_buffers]
set extra_mem [expr {[s used_memory]-$before_used-1024*1024}]
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
if {$rdbchannel == "yes"} {
# master's replication buffers should not grow
assert {$extra_mem < 1024*1024}
assert {$repl_buf_mem < 1024*1024}
} else {
assert {$extra_mem < 2*$repl_buf_mem}
}
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# Kill replica1, replication_buffer will not become smaller
catch {$replica1 shutdown nosave}
wait_for_condition 50 100 {
[s connected_slaves] eq {2}
} else {
fail "replica doesn't disconnect with master"
}
assert_equal $repl_buf_mem [s mem_total_replication_buffers]
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "Replication buffer will become smaller when no replica uses rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# Make sure replica3 catch up with the master
wait_for_ofs_sync $master $replica3
set repl_buf_mem [s mem_total_replication_buffers]
# Kill replica2, replication_buffer will become smaller
catch {$replica2 shutdown nosave}
wait_for_condition 50 100 {
[s connected_slaves] eq {1}
} else {
fail "replica2 doesn't disconnect with master"
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
if {$rdbchannel == "yes"} {
# master's replication buffers should not grow
assert {1024*512 > [s mem_total_replication_buffers]}
} else {
assert {[expr $repl_buf_mem - 1024*1024] > [s mem_total_replication_buffers]}
}
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
}
}
}
}
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
}
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# This test group aims to test replication backlog size can outgrow the backlog
# limit config if there is a slow replica which keep massive replication buffers,
# and replicas could use this replication buffer (beyond backlog config) for
# partial re-synchronization. Of course, replication backlog memory also can
# become smaller when master disconnects with slow replicas since output buffer
# limit is reached.
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
foreach rdbchannel {"yes" "no"} {
start_server {tags {"repl external:skip debug_defrag:skip"}} {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
start_server {} {
start_server {} {
set replica1 [srv -2 client]
set replica1_pid [s -2 process_id]
set replica2 [srv -1 client]
set replica2_pid [s -1 process_id]
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set save ""
$master config set repl-backlog-size 16384
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$master config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
$master config set client-output-buffer-limit "replica 0 0 0"
# Executing 'debug digest' on master which has many keys costs much time
# (especially in valgrind), this causes that replica1 and replica2 disconnect
# with master.
$master config set repl-timeout 1000
$replica1 config set repl-timeout 1000
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$replica1 config set repl-rdb-channel $rdbchannel
$replica1 config set client-output-buffer-limit "replica 1024 0 0"
$replica2 config set repl-timeout 1000
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$replica2 config set client-output-buffer-limit "replica 1024 0 0"
$replica2 config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
$replica1 replicaof $master_host $master_port
wait_for_sync $replica1
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "Replication backlog size can outgrow the backlog limit config rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# Generating RDB will take 1000 seconds
$master config set rdb-key-save-delay 1000000
populate 1000 master 10000
$replica2 replicaof $master_host $master_port
# Make sure replica2 is waiting bgsave
wait_for_condition 5000 100 {
([s rdb_bgsave_in_progress] == 1) &&
[lindex [$replica2 role] 3] eq {sync}
} else {
fail "fail to sync with replicas"
}
# Replication actual backlog grow more than backlog setting since
# the slow replica2 kept replication buffer.
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
populate 20000 master 10000
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
assert {[s repl_backlog_histlen] > [expr 10000*10000]}
}
# Wait replica1 catch up with the master
wait_for_condition 1000 100 {
[s -2 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "Replica could use replication buffer (beyond backlog config) for partial resynchronization rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
# replica1 disconnects with master
$replica1 replicaof [srv -1 host] [srv -1 port]
# Write a mass of data that exceeds repl-backlog-size
populate 10000 master 10000
# replica1 reconnects with master
$replica1 replicaof $master_host $master_port
wait_for_condition 1000 100 {
[s -2 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
# replica2 still waits for bgsave ending
assert {[s rdb_bgsave_in_progress] eq {1} && [lindex [$replica2 role] 3] eq {sync}}
# master accepted replica1 partial resync
assert_equal [s sync_partial_ok] {1}
assert_equal [$master debug digest] [$replica1 debug digest]
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "Replication backlog memory will become smaller if disconnecting with replica rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
assert {[s repl_backlog_histlen] > [expr 2*10000*10000]}
assert_equal [s connected_slaves] {2}
pause_process $replica2_pid
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
r config set client-output-buffer-limit "replica 128k 0 0"
# trigger output buffer limit check
r set key [string repeat A [expr 64*1024]]
# master will close replica2's connection since replica2's output
# buffer limit is reached, so there only is replica1.
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
# In case of rdbchannel=yes, main channel will be disconnected only.
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
wait_for_condition 100 100 {
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
[s connected_slaves] eq {1} ||
([s connected_slaves] eq {2} &&
[string match {*slave*state=wait_bgsave*} [$master info]])
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
} else {
fail "master didn't disconnect with replica2"
}
# Since we trim replication backlog inrementally, replication backlog
# memory may take time to be reclaimed.
wait_for_condition 1000 100 {
[s repl_backlog_histlen] < [expr 10000*10000]
} else {
fail "Replication backlog memory is not smaller"
}
resume_process $replica2_pid
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
}
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
# speed up termination
$master config set shutdown-timeout 0
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
}
}
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
}
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
foreach rdbchannel {"yes" "no"} {
test "Partial resynchronization is successful even client-output-buffer-limit is less than repl-backlog-size rdbchannel=$rdbchannel" {
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
start_server {tags {"repl external:skip"}} {
start_server {} {
r config set save ""
r config set repl-backlog-size 100mb
r config set client-output-buffer-limit "replica 512k 0 0"
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
r config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
set replica [srv -1 client]
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$replica config set repl-rdb-channel $rdbchannel
Replication backlog and replicas use one global shared replication buffer (#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * | refcount = 1 | ... | refcount = 0 | ... | refcount = 2 | * +--------------+ +--------------+ +--------------+ * | / \ * | / \ * | / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. */ /* Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. */ typedef struct replBufBlock { int refcount; /* Number of replicas or repl backlog using. */ long long id; /* The unique incremental number. */ long long repl_offset; /* Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
2021-10-25 02:24:31 -04:00
$replica replicaof [srv 0 host] [srv 0 port]
wait_for_sync $replica
set big_str [string repeat A [expr 10*1024*1024]] ;# 10mb big string
r multi
r client kill type replica
r set key $big_str
r set key $big_str
r debug sleep 2 ;# wait for replica reconnecting
r exec
# When replica reconnects with master, master accepts partial resync,
# and don't close replica client even client output buffer limit is
# reached.
r set key $big_str ;# trigger output buffer limit check
wait_for_ofs_sync r $replica
# master accepted replica partial resync
assert_equal [s sync_full] {1}
assert_equal [s sync_partial_ok] {1}
r multi
r set key $big_str
r set key $big_str
r exec
# replica's reply buffer size is more than client-output-buffer-limit but
# doesn't exceed repl-backlog-size, we don't close replica client.
wait_for_condition 1000 100 {
[s -1 master_repl_offset] eq [s master_repl_offset]
} else {
fail "Replica offset didn't catch up with the master after too long time"
}
assert_equal [s sync_full] {1}
assert_equal [s sync_partial_ok] {1}
}
}
}
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
# This test was added to make sure big keys added to the backlog do not trigger psync loop.
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
test "Replica client-output-buffer size is limited to backlog_limit/16 when no replication data is pending rdbchannel=$rdbchannel" {
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
proc client_field {r type f} {
set client [$r client list type $type]
if {![regexp $f=(\[a-zA-Z0-9-\]+) $client - res]} {
error "field $f not found for in $client"
}
return $res
}
start_server {tags {"repl external:skip"}} {
start_server {} {
set replica [srv -1 client]
set replica_host [srv -1 host]
set replica_port [srv -1 port]
set master [srv 0 client]
set master_host [srv 0 host]
set master_port [srv 0 port]
$master config set maxmemory-policy allkeys-lru
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
$master config set repl-backlog-size 16384
$master config set client-output-buffer-limit "replica 32768 32768 60"
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
$master config set repl-rdb-channel $rdbchannel
$replica config set repl-rdb-channel $rdbchannel
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
# Key has has to be larger than replica client-output-buffer limit.
set keysize [expr 256*1024]
$replica replicaof $master_host $master_port
wait_for_condition 50 100 {
[lindex [$replica role] 0] eq {slave} &&
[string match {*master_link_status:up*} [$replica info replication]]
} else {
fail "Can't turn the instance into a replica"
}
# Write a big key that is gonna breach the obuf limit and cause the replica to disconnect,
# then in the same event loop, add at least 16 more keys, and enable eviction, so that the
# eviction code has a chance to call flushSlavesOutputBuffers, and then run PING to trigger the eviction code
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
set _v [prepare_value $keysize]
$master write "[format_command mset key $_v k1 1 k2 2 k3 3 k4 4 k5 5 k6 6 k7 7 k8 8 k9 9 ka a kb b kc c kd d ke e kf f kg g kh h]config set maxmemory 1\r\nping\r\n"
$master flush
$master read
$master read
$master read
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
wait_for_ofs_sync $master $replica
# Write another key to force the test to wait for another event loop iteration so that we
# give the serverCron a chance to disconnect replicas with COB size exceeding the limits
$master config set maxmemory 0
$master set key1 1
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
wait_for_ofs_sync $master $replica
assert {[status $master connected_slaves] == 1}
wait_for_condition 50 100 {
[client_field $master replica tot-mem] < $keysize
} else {
fail "replica client-output-buffer usage is higher than expected."
}
# now we expect the replica to re-connect but fail partial sync (it doesn't have large
# enough COB limit and must result in a full-sync)
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
assert {[status $master sync_partial_ok] == 0}
# Before this fix (#11905), the test would trigger an assertion in 'o->used >= c->ref_block_pos'
test {The update of replBufBlock's repl_offset is ok - Regression test for #11666} {
set rd [redis_deferring_client]
set replid [status $master master_replid]
set offset [status $master repl_backlog_first_byte_offset]
$rd psync $replid $offset
assert_equal {PONG} [$master ping] ;# Make sure the master doesn't crash.
$rd close
}
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00
}
}
}
Rdb channel replication (#13732) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
2025-01-13 07:09:52 -05:00
}
Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>
2023-03-12 13:47:06 -04:00