redis

mirror of https://github.com/redis/redis.git synced 2026-05-28 04:02:46 -04:00

Author	SHA1	Message	Date
Mincho Paskalev	e3c38aab66	Handle primary/replica clients in IO threads (#14335 ) # Problem While introducing Async IO threads(https://github.com/redis/redis/pull/13695) primary and replica clients were left to be handled inside main thread due to data race and synchronization issues. This PR solves this issue with the additional hope it increases performance of replication. # Overview ## Moving the clients to IO threads Since clients first participate in a handshake and an RDB replication phases it was decided they are moved to IO-thread after RDB replication is done. For primary client this was trivial as the master client is created only after RDB sync (+ some additional checks one can see in `isClientMustHandledByMainThread`). Replica clients though are moved to IO threads immediately after connection (as are all clients) so currently in `unstable` replication happens while this client is in IO-thread. In this PR it was moved to main thread after receiving the first `REPLCONF` message from the replica, but it is a bit hacky and we can remove it. I didn't find issues between the two versions. ## Primary client (replica node) We have few issues here: - during `serverCron` a `replicationCron` is ran which periodically sends `REPLCONF ACK` message to the master, also checks for timed-out master. In order to prevent data races we utilize`IOThreadClientsCron`. The client is periodically sent to main thread and during `processClientsFromIOThread` it's checked if it needs to run the replication cron behaviour. - data races with main thread - specifically `lastinteraction` and `read_reploff` members of the primary client that are written to in `readQueryFromClient` could be accessed at the same time from main thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To solve this the members were duplicated so if the client is in IO-thread it writes to the duplicates and they are synced with the original variables each time the client is send to main thread ( that means `INFO REPLICATION` could potentially return stale values). - During `freeClient` the primary client is fetched to main thread but when caching it(`replicationCacheMaster`) the thread id will remain the id of the IO thread it was from. This creates problems when resurrecting the master client. Here the call to `unbindClientFromIOThreadEventLoop` in `freeClient` was rewritten to call `keepClientInMainThread` which automatically fixes the problem. - During `exitScriptTimedoutMode` the master is queued for reprocessing (specifically process any pending commands ASAP after it's unblocked). We do that by putting it in the `server.unblocked_clients` list, which are processed in the next `beforeSleep` cycle in main thread. Since this will create a contention between main and IO thread, we just skip this queueing in `unblocked_clients` and just queue the client to main thread - the `processClientsFromIOThread` will process the pending commands just as main would have. ## Replica clients (primary node) We move the client after RDB replication is done and after replication backlog is fed with its first message. We do that so that the client's reference to the first replication backlog node is initialized before it's read from IO-thread, hence no contention with main thread on it. ### Shared replication buffer Currently in unstable the replication buffer is shared amongst clients. This is done via clients holding references to the nodes inside the buffer. A node from the buffer can be trimmed once each replica client has read it and send its contents. The reference is `client->ref_repl_buf_node`. The replication buffer is written to by main thread in `feedReplicationBuffer` and the refcounting is intrusive - it's inside the replication-buffer nodes themselves. Since the replica client changes the refcount (decreases the refcount of the node it has just read, and increases the refcount of the next node it starts to read) during `writeToClient` we have a data race with main thread when it feeds the replication buffer. Moreover, main thread also updates the `used` size of the node - how much it has written to it, compared to its capacity which the replica client relies on to know how much to read. Obviously replica being in IO-thread creates another data race here. To mitigate these issues a few new variables were added to the client's struct: - `io_curr_repl_node` - starting node this replica is reading from inside IO-thread - `io_bound_repl_node` - the last node in the replication buffer the replica sees before being send to IO-thread. These values are only allowed to be updated in main thread. The client keeps track of how much it has read into the buffer via the old `ref_repl_buf_node`. Generally while in IO-thread the replica client will now keep refcount of the `io_curr_repl_node` until it's processed all the nodes up to `io_bound_repl_node` - at that point its returned to main thread which can safely update the refcounts. The `io_bound_repl_node` reference is there so the replica knows when to stop reading from the repl buffer - imagine that replica reads from the last node of the replication buffer while main thread feeds data to it - we will create a data race on the `used` value (`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)). That's why this value is updated just before the replica is being send to IO thread. NOTE, this means that when replicas are handled by IO threads they will hold more than one node at a time (i.e `io_curr_repl_node` up to `io_bound_repl_node`) meaning trimming will happen a bit less frequently. Tests show no significant problems with that. (tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node` mechanism as my initial implementation had similar semantics but was way less clear) Example of how this works: * Replication buffer state at time N: \| node 0\| ... \| node M, used_size K \| * replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and `io_bound_block_pos`=K * replica moves to IO thread and processes all the data it sees * Replication buffer state at time N + 1: \| node 0\| ... \| node M, used_size Full \| \|node M + 1\| \|node M + 2, used_size L\|, where Full > M * replica moves to main thread at time N + 1, at this point following happens - refcount to node 0 (io_curr_repl_node) is decreased - `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have size-K bytes to process from there) - refcount to node M is increased (now all nodes from 0 up to M-1 including can be trimmed unless some other replica holds reference to them) - And just before the replica is send back to IO thread the following are updated: - `io_bound_repl_node` ref becomes node M+2 - `io_bound_block_pos` becomes L Note that replica client is only moved to main if it has processed all the data it knows about (i.e up to `io_bound_repl_node` + `io_bound_block_pos`) ### Replica clients kept in main as much as possible During implementation an issue arose - how fast is the replica client able to get knowledge about new data from the replication buffer and how fast can it trim it. In order for that to happen ASAP whenever a replica is moved to main it remains there until the replication buffer is fed new data. At that point its put in the pending write queue and special cased in handleClientsWithPendingWrites so that its send to IO thread ASAP to write the new data to replica. Also since each time the replica writes its whole repl data it knows about that means after it's send to main thread `processClientsFromIOThread` is able to immediately update the refcounts and trim whatever it can. ### ACK messages from primary Slave clients need to periodically read `REPLCONF ACK` messages from client. Since replica can remain in main thread indefinitely if no DB change occurs, a new atomic `pending_read` was added during `readQueryFromClient`. If a replica client has a pending read it's returned back to IO-thread in order to process the read even if there is no pending repl data to write. ### Replicas during shutdown During shutdown the main thread pauses write actions and periodically checks if all replicas have reached the same replication offset as the primary node. During `finishShutdown` that may or may not be the case. Either way a client data may be read from the replicas and even we may try to write any pending data to them inside `flushSlavesOutputBuffers`. In order to prevent races all the replicas from IO threads are moved to main via `fetchClientFromIOThread`. A cancel of the shutdown should be ok, since the mechanism employed by `handleClientsWithPendingWrites` should return the client back to IO thread when needed. ## Notes While adding new tests timing issues with Tsan tests were found and fixed. Also there is a data race issue caught by Tsan on the `last_error` member of the `client` struct. It happens when both IO-thread and main thread make a syscall using a `client` instance - this can happen only for primary and replica clients since their data can be accessed by commands send from other clients. Specific example is the `INFO REPLICATION` command. Although other such races were fixed, as described above, this once is insignificant and it was decided to be ignored in `tsan.sup`. --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2026-01-21 16:19:12 +02:00
debing.sun	60adba48aa	Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326 ) This PR is based on https://github.com/valkey-io/valkey/pull/1303 This PR introduces a DEBUG_DEFRAG compilation option that enables activedefrag functionality even when the allocator is not jemalloc, and always forces defragmentation regardless of the amount or ratio of fragmentation. ## Using ``` make SANITIZER=address DEBUG_DEFRAG=<force\|fully> ./runtest --debug-defrag ``` * DEBUG_DEFRAG=force * Ignore the threshold for defragmentation to ensure that defragmentation is always triggered. * Always reallocate pointers to probe for correctness issues in pointer reallocation. * DEBUG_DEFRAG=fully * Includes everything in the option `force`. * Additionally performs a full defrag on every defrag cycle, which is significantly slower but more accurate. --------- Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: oranagra <oran@redislabs.com>	2025-09-10 12:52:20 +08:00
Pieter Cailliau	d65102861f	Adding AGPLv3 as a license option to Redis! (#13997 ) Read more about [the new license option](http://redis.io/blog/agplv3/) and [the Redis 8 release](http://redis.io/blog/redis-8-ga/).	2025-05-01 14:04:22 +01:00
Ozan Tezcan	73a9b916c9	Rdb channel replication (#13732 ) This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 Motivation During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. Implementation steps 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. Some details - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. API changes in INFO output: 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` API changes in CLIENT LIST In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` Config changes: - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) Internal API changes: Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. Testing: As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. Replica state diagram ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>	2025-01-13 15:09:52 +03:00
Oran Agra	f228ec1ea5	flushSlavesOutputBuffers should not write to replicas scheduled to drop (#12242 ) This will increase the size of an already large COB (one already passed the threshold for disconnection) This could also mean that we'll attempt to write that data to the socket and the replica will manage to read it, which will result in an undesired partial sync (undesired for the test)	2023-06-12 14:05:34 +03:00
Oran Agra	997fa41e99	Attempt to solve MacOS CI issues in GH Actions (#12013 ) The MacOS CI in github actions often hangs without any logs. GH argues that it's due to resource utilization, either running out of disk space, memory, or CPU starvation, and thus the runner is terminated. This PR contains multiple attempts to resolve this: 1. introducing pause_process instead of SIGSTOP, which waits for the process to stop before resuming the test, possibly resolving race conditions in some tests, this was a suspect since there was one test that could result in an infinite loop in that case, in practice this didn't help, but still a good idea to keep. 2. disable the `save` config in many tests that don't need it, specifically ones that use heavy writes and could create large files. 3. change the `populate` proc to use short pipeline rather than an infinite one. 4. use `--clients 1` in the macos CI so that we don't risk running multiple resource demanding tests in parallel. 5. enable `--verbose` to be repeated to elevate verbosity and print more info to stdout when a test or a server starts.	2023-04-12 09:19:21 +03:00
Binbin	7997874f4d	Fix tail->repl_offset update in feedReplicationBuffer (#11905 ) In #11666, we added a while loop and will split a big reply node to multiple nodes. The update of tail->repl_offset may be wrong. Like before #11666, we would have created at most one new reply node, and now we will create multiple nodes if it is a big reply node. Now we are creating more than one node, and the tail->repl_offset of all the nodes except the last one are incorrect. Because we update master_repl_offset at the beginning, and then use it to update the tail->repl_offset. This would have lead to an assertion during PSYNC, a test was added to validate that case. Besides that, the calculation of size was adjusted to fix tests that failed due to a combination of a very low backlog size, and some thresholds of that get violated because of the relatively high overhead of replBufBlock. So now if the backlog size / 16 is too small, we'll take PROTO_REPLY_CHUNK_BYTES instead. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-03-13 16:12:29 +02:00
xbasel	7be7834e65	Large blocks of replica client output buffer could lead to psync loops and unnecessary memory usage (#11666 ) This can happen when a key almost equal or larger than the client output buffer limit of the replica is written. Example: 1. DB is empty 2. Backlog size is 1 MB 3. Client out put buffer limit is 2 MB 4. Client writes a 3 MB key 5. The shared replication buffer will have a single node which contains the key written above, and it exceeds the backlog size. At this point the client output buffer usage calculation will report the replica buffer to be 3 MB (or more) even after sending all the data to the replica. The primary drops the replica connection for exceeding the limits, the replica reconnects and successfully executes partial sync but the primary will drop the connection again because the buffer usage is still 3 MB. This happens over and over. To mitigate the problem, this fix limits the maximum size of a single backlog node to be (repl_backlog_size/16). This way a single node can't exceed the limits of the COB (the COB has to be larger than the backlog). It also means that if the backlog has some excessive data it can't trim, it would be at most about 6% overuse. other notes: 1. a loop was added in feedReplicationBuffer which caused a massive LOC change due to indentation, the actual changes are just the `min(max` and the loop. 3. an unrelated change in an existing test to speed up a server termination which took 10 seconds. Co-authored-by: Oran Agra <oran@redislabs.com>	2023-03-12 19:47:06 +02:00
Oran Agra	ae89958972	Set repl-diskless-sync to yes by default, add repl-diskless-sync-max-replicas (#10092 ) 1. enable diskless replication by default 2. add a new config named repl-diskless-sync-max-replicas that enables replication to start before the full repl-diskless-sync-delay was reached. 3. put replica online sooner on the master (see below) 4. test suite uses repl-diskless-sync-delay of 0 to be faster 5. a few tests that use multiple replica on a pre-populated master, are now using the new repl-diskless-sync-max-replicas 6. fix possible timing issues in a few cluster tests (see below) put replica online sooner on the master ---------------------------------------------------- there were two tests that failed because they needed for the master to realize that the replica is online, but the test code was actually only waiting for the replica to realize it's online, and in diskless it could have been before the master realized it. changes include two things: 1. the tests wait on the right thing 2. issues in the master, putting the replica online in two steps. the master used to put the replica as online in 2 steps. the first step was to mark it as online, and the second step was to enable the write event (only after getting ACK), but in fact the first step didn't contains some of the tasks to put it online (like updating good slave count, and sending the module event). this meant that if a test was waiting to see that the replica is online form the point of view of the master, and then confirm that the module got an event, or that the master has enough good replicas, it could fail due to timing issues. so now the full effect of putting the replica online, happens at once, and only the part about enabling the writes is delayed till the ACK. fix cluster tests -------------------- I added some code to wait for the replica to sync and avoid race conditions. later realized the sentinel and cluster tests where using the original 5 seconds delay, so changed it to 0. this means the other changes are probably not needed, but i suppose they're still better (avoid race conditions)	2022-01-17 14:11:11 +02:00
Wang Yuan	68886de085	Fix timing issue in replication buffer test (#9697 ) Introduced in #9166	2021-10-29 08:04:12 +03:00
Wang Yuan	c1718f9d86	Replication backlog and replicas use one global shared replication buffer (#9166 ) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * \| refcount = 1 \| ... \| refcount = 0 \| ... \| refcount = 2 \| * +--------------+ +--------------+ +--------------+ * \| / \ * \| / \ * \| / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. / / Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. / typedef struct replBufBlock { int refcount; / Number of replicas or repl backlog using. / long long id; / The unique incremental number. / long long repl_offset; / Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.	2021-10-25 09:24:31 +03:00

11 commits