# Problem
While introducing Async IO
threads(https://github.com/redis/redis/pull/13695) primary and replica
clients were left to be handled inside main thread due to data race and
synchronization issues. This PR solves this issue with the additional
hope it increases performance of replication.
# Overview
## Moving the clients to IO threads
Since clients first participate in a handshake and an RDB replication
phases it was decided they are moved to IO-thread after RDB replication
is done. For primary client this was trivial as the master client is
created only after RDB sync (+ some additional checks one can see in
`isClientMustHandledByMainThread`). Replica clients though are moved to
IO threads immediately after connection (as are all clients) so
currently in `unstable` replication happens while this client is in
IO-thread. In this PR it was moved to main thread after receiving the
first `REPLCONF` message from the replica, but it is a bit hacky and we
can remove it. I didn't find issues between the two versions.
## Primary client (replica node)
We have few issues here:
- during `serverCron` a `replicationCron` is ran which periodically
sends `REPLCONF ACK` message to the master, also checks for timed-out
master. In order to prevent data races we utilize`IOThreadClientsCron`.
The client is periodically sent to main thread and during
`processClientsFromIOThread` it's checked if it needs to run the
replication cron behaviour.
- data races with main thread - specifically `lastinteraction` and
`read_reploff` members of the primary client that are written to in
`readQueryFromClient` could be accessed at the same time from main
thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To
solve this the members were duplicated so if the client is in IO-thread
it writes to the duplicates and they are synced with the original
variables each time the client is send to main thread ( that means `INFO
REPLICATION` could potentially return stale values).
- During `freeClient` the primary client is fetched to main thread but
when caching it(`replicationCacheMaster`) the thread id will remain the
id of the IO thread it was from. This creates problems when resurrecting
the master client. Here the call to `unbindClientFromIOThreadEventLoop`
in `freeClient` was rewritten to call `keepClientInMainThread` which
automatically fixes the problem.
- During `exitScriptTimedoutMode` the master is queued for reprocessing
(specifically process any pending commands ASAP after it's unblocked).
We do that by putting it in the `server.unblocked_clients` list, which
are processed in the next `beforeSleep` cycle in main thread. Since this
will create a contention between main and IO thread, we just skip this
queueing in `unblocked_clients` and just queue the client to main thread
- the `processClientsFromIOThread` will process the pending commands
just as main would have.
## Replica clients (primary node)
We move the client after RDB replication is done and after replication
backlog is fed with its first message.
We do that so that the client's reference to the first replication
backlog node is initialized before it's read from IO-thread, hence no
contention with main thread on it.
### Shared replication buffer
Currently in unstable the replication buffer is shared amongst clients.
This is done via clients holding references to the nodes inside the
buffer. A node from the buffer can be trimmed once each replica client
has read it and send its contents. The reference is
`client->ref_repl_buf_node`. The replication buffer is written to by
main thread in `feedReplicationBuffer` and the refcounting is intrusive
- it's inside the replication-buffer nodes themselves.
Since the replica client changes the refcount (decreases the refcount of
the node it has just read, and increases the refcount of the next node
it starts to read) during `writeToClient` we have a data race with main
thread when it feeds the replication buffer. Moreover, main thread also
updates the `used` size of the node - how much it has written to it,
compared to its capacity which the replica client relies on to know how
much to read. Obviously replica being in IO-thread creates another data
race here. To mitigate these issues a few new variables were added to
the client's struct:
- `io_curr_repl_node` - starting node this replica is reading from
inside IO-thread
- `io_bound_repl_node` - the last node in the replication buffer the
replica sees before being send to IO-thread.
These values are only allowed to be updated in main thread. The client
keeps track of how much it has read into the buffer via the old
`ref_repl_buf_node`. Generally while in IO-thread the replica client
will now keep refcount of the `io_curr_repl_node` until it's processed
all the nodes up to `io_bound_repl_node` - at that point its returned to
main thread which can safely update the refcounts.
The `io_bound_repl_node` reference is there so the replica knows when to
stop reading from the repl buffer - imagine that replica reads from the
last node of the replication buffer while main thread feeds data to it -
we will create a data race on the `used` value
(`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)).
That's why this value is updated just before the replica is being send
to IO thread.
*NOTE*, this means that when replicas are handled by IO threads they
will hold more than one node at a time (i.e `io_curr_repl_node` up to
`io_bound_repl_node`) meaning trimming will happen a bit less
frequently. Tests show no significant problems with that.
(tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node`
mechanism as my initial implementation had similar semantics but was way
less clear)
Example of how this works:
* Replication buffer state at time N:
| node 0| ... | node M, used_size K |
* replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and
`io_bound_block_pos`=K
* replica moves to IO thread and processes all the data it sees
* Replication buffer state at time N + 1:
| node 0| ... | node M, used_size Full | |node M + 1| |node M + 2,
used_size L|, where Full > M
* replica moves to main thread at time N + 1, at this point following
happens
- refcount to node 0 (io_curr_repl_node) is decreased
- `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have
size-K bytes to process from there)
- refcount to node M is increased (now all nodes from 0 up to M-1
including can be trimmed unless some other replica holds reference to
them)
- And just before the replica is send back to IO thread the following
are updated:
- `io_bound_repl_node` ref becomes node M+2
- `io_bound_block_pos` becomes L
Note that replica client is only moved to main if it has processed all
the data it knows about (i.e up to `io_bound_repl_node` +
`io_bound_block_pos`)
### Replica clients kept in main as much as possible
During implementation an issue arose - how fast is the replica client
able to get knowledge about new data from the replication buffer and how
fast can it trim it. In order for that to happen ASAP whenever a replica
is moved to main it remains there until the replication buffer is fed
new data. At that point its put in the pending write queue and special
cased in handleClientsWithPendingWrites so that its send to IO thread
ASAP to write the new data to replica. Also since each time the replica
writes its whole repl data it knows about that means after it's send to
main thread `processClientsFromIOThread` is able to immediately update
the refcounts and trim whatever it can.
### ACK messages from primary
Slave clients need to periodically read `REPLCONF ACK` messages from
client. Since replica can remain in main thread indefinitely if no DB
change occurs, a new atomic `pending_read` was added during
`readQueryFromClient`. If a replica client has a pending read it's
returned back to IO-thread in order to process the read even if there is
no pending repl data to write.
### Replicas during shutdown
During shutdown the main thread pauses write actions and periodically
checks if all replicas have reached the same replication offset as the
primary node. During `finishShutdown` that may or may not be the case.
Either way a client data may be read from the replicas and even we may
try to write any pending data to them inside `flushSlavesOutputBuffers`.
In order to prevent races all the replicas from IO threads are moved to
main via `fetchClientFromIOThread`. A cancel of the shutdown should be
ok, since the mechanism employed by `handleClientsWithPendingWrites`
should return the client back to IO thread when needed.
## Notes
While adding new tests timing issues with Tsan tests were found and
fixed.
Also there is a data race issue caught by Tsan on the `last_error`
member of the `client` struct. It happens when both IO-thread and main
thread make a syscall using a `client` instance - this can happen only
for primary and replica clients since their data can be accessed by
commands send from other clients. Specific example is the `INFO
REPLICATION` command.
Although other such races were fixed, as described above, this once is
insignificant and it was decided to be ignored in `tsan.sup`.
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
1) Replace fixed sleep with wait_for_condition to avoid flaky test
failures when checking master_current_sync_attempts counter.
2) Similar to https://github.com/redis/redis/pull/14674, use
assert_lessthan_equal instead of assert_lessthan to verify the idle
time.
Follow https://github.com/redis/redis/pull/14423
In https://github.com/redis/redis/pull/14423,
I thought the last lpNext operation of the iterator occurred at the end
of streamIteratorGetID.
However, I overlooked the fact that after calling
`streamIteratorGetID()`, we might still use `streamIteratorGetField()`
to continue moving within the current entry.
This means that during reverse iteration, the iterator could move back
to a previous entry position.
To fix this, in this PR I record the current position at the beginning
of streamIteratorGetID().
When we enter it again next time, we ensure that the entry position does
not exceed the previous one,
that is, during forward iteration the entry must be greater than the
last entry position,
and during reverse iteration it must be smaller than the last entry
position.
Note that the fix for https://github.com/redis/redis/pull/14423 has been
replaced by this fix.
`repl-diskless-load` feature can effectively reduce the time of full
synchronization, but maybe it is not widely used.
`swapdb` option needs double `maxmemory`, and `on-empty-db` only works
on the first full sync (the replica must have no data).
This PR introduce a new option: `flushdb` - Always flush the entire
dataset before diskless load. If the diskless load fails, the replica
will lose all existing data.
Of course, it brings the risk of data loss, but it provides a choice if
you want to reduce full sync time and accept this risk.
Fixes#14538
If the master uses diskless synchronization and the replica uses
diskless load, we can disable RDB compression to reduce full sync time.
I tested on AWS and found we could reduce time by 20-40%.
In terms of implementation, when the replica can use diskless load, the
replica will send `replconf rdb-no-compress 1` to master to deliver a
RDB without compression.
If your network is slow, please disable repl-diskless-load, and maybe
even repl-diskless-sync
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR fixes an issue(#14541) where EXEC’s ACL recheck was still being
performed during AOF loading, that may cause AOF loading failed, if ACL
rules are changed and don't allow some commands in MULTI-EXEC.
Currently, it is difficult to determine whether an RDB persistence
failure is caused by a persistent issue in the process or by a temporary
error.
Tracking the number of consecutive RDB persistence failures would help
clarify this distinction. Multiple failures in a row could indicate an
ongoing problem, while isolated cases may point to transient or
environmental issues.
This PR follows https://github.com/redis/redis/pull/14058
When Redis is shut down uncleanly (e.g., due to power loss or system
crashes), invalid bytes may remain at the end of AOF files.
This PR refactors the AOF corruption handling system and introduces
automatic recovery capabilities to improve resilience and reduce
downtime. The feature only handles corruption at the end of files,
respects the configured size limit to prevent excessive data loss,
maintains all valid commands before the corruption point.
Removed aof-load-broken configuration option
Rename aof-load-broken-max-size to aof-load-corrupt-tail-max-size.
New Config: aof-load-corrupt-tail-max-size - enables recovery for
corruption up to specified size
When the `lp_count` of a stream entry is incorrect, and we are
performing a reverse iteration, we can normally move backward by
`lp_count` to locate the previous entry.
However, if `lp_count` is wrong, during forward iteration, after
obtaining the flag, ID, and other fields, we may step beyond the current
entry into the start position of the next entry(where we came from),
which could cause an infinite loop.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This issue was introduced by https://github.com/redis/redis/pull/10440
In that PR, we avoided resetting the current user during processCommand,
but overlooked the fact that this client might not be the current one,
it could be a client that was previously blocked due to shutdown.
If we don’t reset these clients, and the shutdown is canceled, then when
these clients continue executing other commands, they will trigger an
assertion.
This PR delays the operation of resetting the client to
processUnblockedClients and no longer skips SHUTDOWN_BLOCKED clients.
This PR is based on https://github.com/valkey-io/valkey/pull/1303
This PR introduces a DEBUG_DEFRAG compilation option that enables
activedefrag functionality even when the allocator is not jemalloc, and
always forces defragmentation regardless of the amount or ratio of
fragmentation.
## Using
```
make SANITIZER=address DEBUG_DEFRAG=<force|fully>
./runtest --debug-defrag
```
* DEBUG_DEFRAG=force
* Ignore the threshold for defragmentation to ensure that
defragmentation is always triggered.
* Always reallocate pointers to probe for correctness issues in pointer
reallocation.
* DEBUG_DEFRAG=fully
* Includes everything in the option `force`.
* Additionally performs a full defrag on every defrag cycle, which is
significantly slower but more accurate.
---------
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: oranagra <oran@redislabs.com>
When Redis is shut down uncleanly (e.g., due to power loss), invalid
bytes may remain at the end of the AOF file. Currently, Redis detects
such corruption only after parsing most of the AOF, leading to delayed
error detection and increased downtime. Manual recovery via
`redis-check-aof --fix` is also time-consuming.
This fix introduces two new options to improve resilience and reduce
downtime:
- `aof-load-broken`: Enables automatic detection and repair of broken
AOF tails.
- `aof-load-broken-max-size`: Sets a maximum threshold (in bytes) for
the corrupted tail size that Redis will attempt to fix automatically
without requiring user intervention.
- tests/unit/maxmemory.tcl
If multithreaded, we need to let IO threads have chance to reply output
buffer, to avoid next commands causing eviction. After eviction is
performed,
the next command becomes ready immediately in IO threads, and now we
enqueue
the client to be processed in main thread’s beforeSleep without
notification.
However, invalidation messages generated by eviction may not have been
fully
delivered by that time. As a result, executing the command in
beforeSleep of
the event loop (running eviction) can cause additional keys to be
evicted.
```
Expected '73' to be between to '200' and '300' (context: type source line 473 file
redis/tests/unit/maxmemory.tcl cmd {assert_range [r dbsize] 200 300} proc ::test)
```
the reason why CI doesn't find this issue is that we skill this test
`tsan:skip` as below
`start_server {tags {"maxmemory external:skip tsan:skip"}} `,so remove
this tag.
- tests/integration/aof.tcl
Because IO and the main thread are working in better parallelism without
notification,
the main thread may haven't write AOF buffer into file, but the IO
thread just writes
the reply, so the clients receive the reply before AOF file is changed.
We should use `appendfsync always` policy to make the command is written
into
AOF file when receiving reply.
```
Expected '0' to be equal to '54' (context: type source line 249 file
redis/tests/integration/aof.tcl cmd {assert_equal $before $after} proc ::test)
```
#13969 makes these scenarios easy to appear.
If replica detects broken connection while reading min expiration time
of hfe key, it calls exit().
Fixed it to handle the error gracefully without calling exit.
To reproduce the issue, the short-read test was modified to generate
many small hfe keys, increasing the likelihood of a connection drop
while reading min expiration time:
```tcl
for {set k 0} {$k < 50000} {incr k} {
for {set i 0} {$i < 1} {incr i} {
r hsetex "$k hfe_small" EX [expr {int(rand()*10)}] FIELDS 1 [string repeat A [expr {int(rand()*10)}]] 0[string repeat A [expr {int(rand()*10)}]]
}
}
```
We can't have the test use only hfe keys, so a few were added alongside
other data. I couldn't reproduce the issue this way but with the test's
randomization, it should hit this scenario in one of the runs.
Merge fork counters with https://github.com/redis/redis/pull/12957
repl_current_sync_attempts - Total number of attempts to connect to a
master since the last time we disconnected from a good connection (or a
configuration change). any number greater than 1 (even if the link is
currently up), indicates an issue.
repl_total_sync_attempts - Number of times in current configuration, the
replica attempted to sync to a master. (dosent reset on master
reconnect.)
repl_total_disconnect_time - Total cumulative time we've been
disconnected as a replica, visible when the link is up too.
master_link_up_since_seconds - Number of seconds since the link is down,
just maintain symmetry with master_link_down_since_seconds.
Add thread sanitizer run to daily CI.
Few tests are skipped in tsan runs for two reasons:
* Stack trace producing tests (oom, `unit/moduleapi/crash`, etc) are
tagged `tsan:skip` because redis calls `backtrace()` in signal handler
which turns out to be signal-unsafe since it might allocate memory (e.g.
glibc 2.39 does it through a call to `_dl_map_object_deps()`).
* Few tests become flaky with thread sanitizer builds and don't finish
in expected deadlines because of the additional tsan overhead. Instead
of skipping those tests, this can improved in the future by allowing
more iterations when waiting for tsan builds.
Deadlock detection is disabled for now because of tsan limitation where
max 64 locks can be taken at once.
There is one outstanding (false-positive?) race in jemalloc which is
suppressed in `tsan.sup`.
Fix few races thread sanitizer reported having to do with writes from
signal handlers. Since in multi-threaded setting signal handlers might
be called on any thread (modulo pthread_sigmask) while the main thread
is running, `volatile sig_atomic_t` type is not sufficient and atomics
are used instead.
Hi all, this PR fixes two things:
1. An assertion, that prevented the RDB loading from recovery if there
was a quantization type mismatch (with regression test).
2. Two code paths that just returned NULL without proper cleanup during
RDB loading.
The idea of packing the key (`sds`), value (`robj`) and optionally TTL
into a single struct in memory was mentioned a few times in the past by
the community in various flavors. This approach improves memory
efficiency, reduces pointer dereferences for faster lookups, and
simplifies expiration management by keeping all relevant data in one
place. This change goes along with setting keyspace's dict to
no_value=1, and saving considerable amount of memory.
Two more motivations that well aligned with this unification are:
- Prepare the groundwork for replacing EXPIRE scan based implementation
and evaluate instead new `ebuckets` data structure that was introduced
as part of [Hash Field Expiration
feature](https://redis.io/blog/hash-field-expiration-architecture-and-benchmarks/).
Using this data structure requires embedding the ExpireMeta structure
within each object.
- Consider replacing dict with a more space efficient open addressing
approach hash table that might rely on keeping a single pointer to
object.
Before this PR, I POC'ed on a variant of open addressing hash-table and
was surprised to find that dict with no_value actually could provide a
good balance between performance, memory efficiency, and simplicity.
This realization prompted the separation of the unification step from
the evaluation of a new hash table to avoid introducing too many changes
at once and to evaluate its impact independently before considering
replacement of existing hash-table. On an earlier
[commit](https://github.com/redis/redis/pull/13683) I extended dict
no_value optimization (which saves keeping dictEntry where possible) to
be relevant also for objects with even addresses in memory. Combining it
with this unification saves a considerable amount of memory for
keyspace.
# kvobj
This PR adopts Valkey’s
[packing](3eb8314be6)
layout and logic for key, value, and TTL. However, unlike Valkey
implementation, which retained a common `robj` throughout the project,
this PR distinguishes between the general-purpose, overused `robj`, and
the new `kvobj`, which embeds both the key and value and used by the
keyspace. Conceptually, `robj` serves as a base class, while `kvobj`
acts as a derived class.
Two new flags introduced into redis object, `iskvobj` and `expirable`:
```
struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS;
unsigned iskvobj : 1; /* new flag */
unsigned expirable : 1; /* new flag */
unsigned refcount : 30; /* modified: 32bits->30bits */
void *ptr;
};
typedef struct redisObject robj;
typedef struct redisObject kvobj;
```
When the `iskvobj` flag is set, the object includes also the key and it
is appended to the end of the object. If the `expirable` flag is set, an
additional 8 bytes are added to the object. If the object is of type
string, and the string is rather short, then it will be embedded as
well.
As a result, all keys in the keyspace are promoted to be of type
`kvobj`. This term attempts to align with the existing Redis object,
robj, and the kvstore data structure.
# EXPIRE Implementation
As `kvobj` embeds expiration time as well, looking up expiration times
is now an O(1) operation. And the hash-table of EXPIRE is set now to be
`no_value` mode, directly referencing `kvobj` entries, and in turn,
saves memory.
Next, I plan to evaluate replacing the EXPIRE implementation with the
[ebuckets](https://github.com/redis/redis/blob/unstable/src/ebuckets.h)
data structure, which would eliminate keyspace scans for expired keys.
This requires embedding `ExpireMeta` within each `kvobj` of each key
with expiration. In such implementation, the `expirable` flag will be
shifted to indicate whether `ExpireMeta` is attached.
# Implementation notes
## Manipulating keyspace (find, modify, insert)
Initially, unifying the key and value into a single object and storing
it in dict with `no_value` optimization seemed like a quick win.
However, it (quickly) became clear that this change required deeper
modifications to how keys are manipulated. The challenge was handling
cases where a dictEntry is opt-out due to no_value optimization. In such
cases, many of the APIs that return the dictEntry from a lookup become
insufficient, as it just might be the key itself. To address this issue,
a new-old approach of returning a "link" to the looked-up key's
`dictEntry` instead of the `dictEntry` itself. The term `link` was
already somewhat available in dict API, and is well aligned with the new
dictEntLink declaration:
```
typedef dictEntry **dictEntLink;
```
This PR introduces two new function APIs to dict to leverage returned
link from the search:
```
dictEntLink dictFindLink(dict *d, const void *key, dictEntLink *bucket);
void dictSetKeyAtLink(dict *d, void *key, dictEntLink *link, int newItem);
```
After calling `link = dictFindLink(...)`, any necessary updates must be
performed immediately after by calling `dictSetKeyAtLink()` without any
intervening operations on given dict. Otherwise, `dictEntLink` may
become invalid. Example:
```
/* replace existing key */
link = dictFindLink(d, key, &bucket, 0);
// ... Do something, but don't modify the dict ...
// assert(link != NULL);
dictSetKeyAtLink(d, kv, &link, 0);
/* Add new value (If no space for the new key, dict will be expanded and
bucket will be looked up again.) */
link = dictFindLink(d, key, &bucket);
// ... Do something, but don't modify the dict ...
// assert(link == NULL);
dictSetKeyAtLink(d, kv, &bucket, 1);
```
## dict.h
- The dict API has became cluttered with many unused functions. I have
removed these from dict.h.
- Additionally, APIs specifically related to hash maps (no_value=0),
primarily those handling key-value access, have been gathered and
isolated.
- Removed entirely internal functions ending with “*ByHash()” that were
originally added for optimization and not required any more.
- Few other legacy dict functions were adapted at API level to work with
the term dictEntLink as well.
- Simplified and generalized an optimization that related to comparison
of length of keys of type strings.
## Hash Field Expiration
Until now each hash object with expiration on fields needed to maintain
a reference to its key-name (of the hash object), such that in case it
will be active-expired, then it will be possible to resolve the key-name
for the notification sake. Now there is no need anymore.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Now we have RDB channel in https://github.com/redis/redis/pull/13732,
child process can transfer RDB in a background method, instead of
handled by main thread. So when redis-cli gets RDB from server, we can
adopt this way to reduce the main thread load.
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR adds support for REDISMODULE_OPTIONS_HANDLE_IO_ERRORS.
and tests for short read and corrupted RESTORE payload.
Please: note that I also removed the comment about async loading support
since we should be already covered. No manipulation of global data
structures in Vector Sets, if not for the unique ID used to create new
vector sets with different IDs.
from the master's perspective, the replica can become online before it's
actually done loading the rdb file.
this was always like that, in disk-based repl, and thus ok with diskless
and rdb channel.
in this test, because all the keys are added before the backlog is
created, the replication offset is 0, so the test proceeds and could get
a LOADING error when trying to run the function.
Test fails time to time:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
Depending on the timing, "debug sleep" may occur during rdbchannel
handshake and required log statement won't be printed to the log in that
case. Better to wait after rdbchannel handshake.
With RDB channel replication, we introduced parallel replication stream
and RDB delivery to the replica during a full sync. Currently, after the
replica loads the RDB and begins streaming the accumulated buffer to the
database, it does not read from the master connection during this
period. Although streaming the local buffer is generally a fast
operation, it can take some time if the buffer is large. This PR
introduces buffering during the streaming of the local buffer. One
important consideration is ensuring that we consume more than we read
during this operation; otherwise, it could take indefinitely. To
guarantee that it will eventually complete, we limit the read to at most
half of what we consume, e.g. read at most 1 mb once we consume at least
2 mb.
**Additional changes**
**Bug fix**
- Currently, when replica starts draining accumulated buffer, we call
protectClient() for the master client as we occasionally yield back to
event loop via processEventsWhileBlocked(). So, it prevents freeing the
master client. While we are in this loop, if replica receives "replicaof
newmaster" command, we call replicaSetMaster() which expects to free the
master client and trigger a new connection attempt. As the client object
is protected, its destruction will happen asynchronously. Though, a new
connection attempt to new master will be made immediately. Later, when
the replication buffer is drained, we realize master client was marked
as CLOSE_ASAP, and freeing master client triggers another connection
attempt to the new master. In most cases, we realize something is wrong
in the replication state machine and abort the second attempt later. So,
the bug may go undetected. Fix is not calling protectClient() for the
master client. Instead, trying to detect if master client is
disconnected during processEventsWhileBlocked() and if so, breaking the
loop immediately.
**Related improvement:**
- Currently, the replication buffer is a linked list of buffers, each of
which is 1 MB in size. While consuming the buffer, we process one buffer
at a time and check if we need to yield back to
`processEventsWhileBlocked()`. However, if
`loading-process-events-interval-bytes` is set to less than 1 MB, this
approach doesn't handle it well. To improve this, I've modified the code
to process 16KB at a time and check
`loading-process-events-interval-bytes` more frequently. This way,
depending on the configuration, we may yield back to networking more
often.
- In replication.c, `disklessLoadingRio` will be set before a call to
`emptyData()`. This change should not introduce any behavioral change
but it is logically more correct as emptyData() may yield to networking
and we may need to call rioAbort() on disklessLoadingRio. Otherwise,
failure of main channel may go undetected until a failure on rdb channel
on a corner case.
**Config changes**
- The default value for the `loading-process-events-interval-bytes`
configuration is being lowered from 2MB to 512KB. This configuration
primarily used for testing and controls the frequency of networking
during the loading phase, specifically when loading the RDB or applying
accumulated buffers during a full sync on the replica side.
Before the introduction of RDB channel replication, the 2MB value was
sufficient for occasionally yielding to networking, mainly to reply
-loading to the clients. However, with RDB channel replication, during a
full sync on the replica side (either while loading the RDB or applying
the accumulated buffer), we need to yield back to networking more
frequently to continue accumulating the replication stream. If this
doesn’t happen often enough, the replication stream can accumulate on
the master side, which is undesirable.
To address this, we’ve decided to lower the default value to 512KB. One
concern with frequent yielding to networking is the potential
performance impact, as each call to processEventsWhileBlocked() involves
4 syscalls, which could slow down the RDB loading phase. However,
benchmarking with various configuration values has shown that using
512KB or higher does not negatively impact RDB loading performance.
Based on these results, 512KB is now selected as the default value.
**Test changes**
- Added improved version of a replication test which checks memory usage
on master during full sync.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Before https://github.com/redis/redis/pull/13732, replicas were brought
online immediately after master wrote the last bytes of the RDB file to
the socket. This behavior remains unchanged if rdbchannel replication is
not used. However, with rdbchannel replication, the replica is brought
online after receiving the first ack which is sent by replica after rdb
is loaded.
To align the behavior, reverting this change to put replica online once
bgsave is done.
Additonal changes:
- INFO field `mem_total_replication_buffers` will also contain
`server.repl_full_sync_buffer.mem_used` which shows accumulated
replication stream during rdbchannel replication on replica side.
- Deleted debug level logging from some replication tests. These tests
generate thousands of keys and it may cause per key logging on some
cases.
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.
I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.
Refactored code should not introduce any runtime implications.
Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0)
Co-authored-by: Oran Agra <oran@redislabs.com>
Recently encountered some errors as bellow,
HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate
current time by `[clock seconds]` that may have a delay that causes
results greater than expected.
Dismiss memory test error, now we introduced rdb-channel replication,
the full synchronization might finish before the child process exits. So
we may fail if calling `bgsave` immediately after full sync.
### Background
AOF is often used as an effective data recovery method, but now if we
have two AOFs from different nodes, it is hard to learn which one has
latest data. Generally, we determine whose data is more up-to-date by
reading the latest modification time of the AOF file, but because of
replication delay, even if both master and replica write to the AOF at
the same time, the data in the master is more up-to-date (there are
commands that didn't arrive at the replica yet, or a large number of
commands have accumulated on replica side ), so we may make wrong
decision.
### Solution
The replication offset always increments when AOF is enabled even if
there is no replica, we think replication offset is better method to
determine which one has more up-to-date data, whoever has a larger
offset will have newer data, so we add the start replication offset info
for AOF, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224
```
And if we close gracefully the AOF file, not a crash, such as
`shutdown`, `kill signal 15` or `config set appendonly no`, we will add
the end replication offset, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532
```
#### Things to pay attention to
- For BASE AOF, we do not add `startoffset` and `endoffset` info, since
we could not know the start replication replication of data, and it is
useless to help us to determine which one has more up-to-date data.
- For AOFs from old version, we also don't add `startoffset` and
`endoffset` info, since we also don't know start replication replication
of them. If we add the start offset from 0, we might make the judgment
even less accurate. For example, if the master has just rewritten the
AOF, its INCR AOF will inevitably be very small. However, if the replica
has not rewritten AOF for a long time, its INCR AOF might be much
larger. By applying the following method, we might make incorrect
decisions, so we still just check timestamp instead of adding offset
info
- If the last INCR AOF has `startoffset` or `endoffset`, we need to
restore `server.master_repl_offset` according to them to avoid the
rollback of the `startoffset` of next INCR AOF. If it has `endoffset`,
we just use this value as `server.master_repl_offset`, and a very
important thing is to remove this information from the manifest file to
avoid the next time we load the manifest file with wrong `endoffset`. If
it only has `startoffset`, we calculate `server.master_repl_offset` by
the `startoffset` plus the file size.
### How to determine which one has more up-to-date data
If AOF has a larger replication offset, it will have more up-to-date
data. The following is how to get AOF offset:
Read the AOF manifest file to obtain information about **the last INCR
AOF**
1. If the last INCR AOF has `endoffset` field, we can directly use the
`endoffset` to present the replication offset of AOF
2. If there is no `endoffset`(such as redis crashes abnormally), but
there is `startoffset` filed of the last INCR AOF, we can get the
replication offset of AOF by `startoffset` plus the file size
3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`,
maybe from old version, and new version redis has not rewritten AOF yet,
we still need to check the modification timestamp of the last INCR AOF
### TODO
Fix ping causing inconsistency between AOF size and replication
offset in the future PR. Because we increment the replication offset
when sending PING/REPLCONF to the replica but do not write data to the
AOF file, this might cause the starting offset of the AOF file plus its
size to be inconsistent with the actual replication offset.
During fullsync, before loading RDB on the replica, we stop aof child to
prevent copy-on-write disaster.
Once rdb is loaded, aof is started again and it will trigger aof
rewrite. With https://github.com/redis/redis/pull/13732 , for rdbchannel
replication, this behavior was changed. Currently, we start aof after
replication buffer is streamed to db. This PR changes it back to start
aof just after rdb is loaded (before repl buffer is streamed)
Both approaches may have pros and cons. If we start aof before streaming
repl buffers, we may still face with copy-on-write issues as repl
buffers potentially include large amount of changes. If we wait until
replication buffer drained, it means we are delaying starting aof
persistence.
Additional changes are introduced as part of this PR:
- Interface change:
Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command
reply. During full sync, it shows total memory consumed by accumulated
replication stream buffer on replica. Added same metric to `MEMORY
STATS` command reply as `replica.fullsync.buffer` field.
- Fixes:
- Count repl stream buffer size of replica as part of 'memory overhead'
calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs.
Before this PR, repl buffer was not counted as part of memory overhead
calculation, causing misreports for fields like `used_memory_overhead`
and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field
in "MEMORY STATS" command reply.
- Dismiss replication stream buffers memory of replica in the fork to
reduce COW impact during a fork.
- Fixed a few time sensitive flaky tests, deleted a noop statement,
fixed some comments and fail messages in rdbchannel tests.
This PR is based on:
https://github.com/redis/redis/pull/12109https://github.com/valkey-io/valkey/pull/60
Closes: https://github.com/redis/redis/issues/11678
**Motivation**
During a full sync, when master is delivering RDB to the replica,
incoming write commands are kept in a replication buffer in order to be
sent to the replica once RDB delivery is completed. If RDB delivery
takes a long time, it might create memory pressure on master. Also, once
a replica connection accumulates replication data which is larger than
output buffer limits, master will kill replica connection. This may
cause a replication failure.
The main benefit of the rdb channel replication is streaming incoming
commands in parallel to the RDB delivery. This approach shifts
replication stream buffering to the replica and reduces load on master.
We do this by opening another connection for RDB delivery. The main
channel on replica will be receiving replication stream while rdb
channel is receiving the RDB.
This feature also helps to reduce master's main process CPU load. By
opening a dedicated connection for the RDB transfer, the bgsave process
has access to the new connection and it will stream RDB directly to the
replicas. Before this change, due to TLS connection restriction, the
bgsave process was writing RDB bytes to a pipe and the main process was
forwarding
it to the replica. This is no longer necessary, the main process can
avoid these expensive socket read/write syscalls. It also means RDB
delivery to replica will be faster as it avoids this step.
In summary, replication will be faster and master's performance during
full syncs will improve.
**Implementation steps**
1. When replica connects to the master, it sends 'rdb-channel-repl' as
part of capability exchange to let master to know replica supports rdb
channel.
2. When replica lacks sufficient data for PSYNC, master sends
+RDBCHANNELSYNC reply with replica's client id. As the next step, the
replica opens a new connection (rdb-channel) and configures it against
the master with the appropriate capabilities and requirements. It also
sends given client id back to master over rdbchannel, so that master can
associate these channels. (initial replica connection will be referred
as main-channel) Then, replica requests fullsync using the RDB channel.
3. Prior to forking, master attaches the replica's main channel to the
replication backlog to deliver replication stream starting at the
snapshot end offset.
4. The master main process sends replication stream via the main
channel, while the bgsave process sends the RDB directly to the replica
via the rdb-channel. Replica accumulates replication stream in a local
buffer, while the RDB is being loaded into the memory.
5. Once the replica completes loading the rdb, it drops the rdb channel
and streams the accumulated replication stream into the db. Sync is
completed.
**Some details**
- Currently, rdbchannel replication is supported only if
`repl-diskless-sync` is enabled on master. Otherwise, replication will
happen over a single connection as in before.
- On replica, there is a limit to replication stream buffering. Replica
uses a new config `replica-full-sync-buffer-limit` to limit number of
bytes to accumulate. If it is not set, replica inherits
`client-output-buffer-limit <replica>` hard limit config. If we reach
this limit, replica stops accumulating. This is not a failure scenario
though. Further accumulation will happen on master side. Depending on
the configured limits on master, master may kill the replica connection.
**API changes in INFO output:**
1. New replica state: `send_bulk_and_stream`. Indicates full sync is
still in progress for this replica. It is receiving replication stream
and rdb in parallel.
```
slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0
```
Replica state changes in steps:
- First, replica sends psync and receives +RDBCHANNELSYNC
:`state=wait_bgsave`
- After replica connects with rdbchannel and delivery starts:
`state=send_bulk_and_stream`
- After full sync: `state=online`
2. On replica side, replication stream buffering metrics:
- replica_full_sync_buffer_size: Currently accumulated replication
stream data in bytes.
- replica_full_sync_buffer_peak: Peak number of bytes that this instance
accumulated in the lifetime of the process.
```
replica_full_sync_buffer_size:20485
replica_full_sync_buffer_peak:1048560
```
**API changes in CLIENT LIST**
In `client list` output, rdbchannel clients will have 'C' flag in
addition to 'S' replica flag:
```
id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0
```
**Config changes:**
- `replica-full-sync-buffer-limit`: Controls how much replication data
replica can accumulate during rdbchannel replication. If it is not set,
a value of 0 means replica will inherit `client-output-buffer-limit
<replica>` hard limit config to limit accumulated data.
- `repl-rdb-channel` config is added as a hidden config. This is mostly
for testing as we need to support both rdbchannel replication and the
older single connection replication (to keep compatibility with older
versions and rdbchannel replication will not be enabled if
repl-diskless-sync is not enabled). it affects both the master (not to
respond to rdb channel requests), and the replica (not to declare
capability)
**Internal API changes:**
Changes that were introduced to Redis replication:
- New replication capability is added to replconf command: `capa
rdb-channel-repl`. Indicates replica is capable of rdb channel
replication. Replica sends it when it connects to master along with
other capabilities.
- If replica needs fullsync, master replies `+RDBCHANNELSYNC
<client-id>` to the replica's PSYNC request.
- When replica opens rdbchannel connection, as part of replconf command,
it sends `rdb-channel 1` to let master know this is rdb channel. Also,
it sends `main-ch-client-id <client-id>` as part of replconf command so
master can associate channels.
**Testing:**
As rdbchannel replication is enabled by default, we run whole test suite
with it. Though, as we need to support both rdbchannel and single
connection replication, we'll be running some tests twice with
`repl-rdb-channel yes/no` config.
**Replica state diagram**
```
* * Replica state machine *
*
* Main channel state
* ┌───────────────────┐
* │RECEIVE_PING_REPLY │
* └────────┬──────────┘
* │ +PONG
* ┌────────▼──────────┐
* │SEND_HANDSHAKE │ RDB channel state
* └────────┬──────────┘ ┌───────────────────────────────┐
* │+OK ┌───► RDB_CH_SEND_HANDSHAKE │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid>
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_PORT_REPLY │ │ │ +OK
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_IP_REPLY │ │ │ +OK
* └────────┬──────────┘ │ ┌──────────────▼────────────────┐
* │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │
* ┌────────▼──────────┐ │ └──────────────┬────────────────┘
* │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC
* └────────┬──────────┘ │ │Rdb delivery
* │ │ ┌──────────────▼────────────────┐
* ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │
* │SEND_PSYNC │ │ └──────────────┬────────────────┘
* └─┬─────────────────┘ │ │ Done loading
* │PSYNC (use cached-master) │ │
* ┌─▼─────────────────┐ │ │
* │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication
* └─┬─────────────────┘ │ │ │ buffer into memory
* │ │ │ │
* │+RDBCHANNELSYNC client-id │ │ │
* ├──────┬───────────────────┘ │ │
* │ │ Main channel │ │
* │ │ accumulates repl data │ │
* │ ┌──▼────────────────┐ │ ┌───────▼───────────┐
* │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │
* │ └───────────────────┘ └────▲───▲──────────┘
* │ │ │
* │ │ │
* │ +FULLRESYNC ┌───────────────────┐ │ │
* ├────────────────► REPL_TRANSFER ├────┘ │
* │ └───────────────────┘ │
* │ +CONTINUE │
* └──────────────────────────────────────────────┘
*/
```
-----
This PR also contains changes and ideas from:
https://github.com/valkey-io/valkey/pull/837https://github.com/valkey-io/valkey/pull/1173https://github.com/valkey-io/valkey/pull/804https://github.com/valkey-io/valkey/pull/945https://github.com/valkey-io/valkey/pull/989
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Moti Cohen <moticless@gmail.com>
Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com>
Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com>
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.
- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.
- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540
Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.
## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.
the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.
### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.
Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.
### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.
### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.
**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.
**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.
**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.
**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.
**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.
**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.
**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.
### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```
Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.
`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`
## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.
For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.
For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.
### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.
To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.
However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.
### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
They don't even use SDS, so there is no reference issues.
But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
the key names will be kept in `c->db->blocking_keys`.
3. watch command
the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
channel name will be kept in `serverPubSubChannels`.
5. script load command
script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`
Those issues will be handled in other PRs.
## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.
### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.
## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.
When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.
Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
After running test in local, there will be a file named
`.rediscli_history_test`, and it is not in `.gitignore` file, so this is
considered to have changed the code base. It is a little annoying, this
commit just clean up the temporary file.
We should delete `.rediscli_history_test` in the end since the second
server tests also write somethings into it, to make it corresponding, i
put `set ::env(REDISCLI_HISTFILE) ".rediscli_history_test"` at the
beginning.
Maybe we also can add this file into `.gitignore`
Similar to #13530 , applied to HSCAN and ZSCAN in case of listpack
encoding.
**Preliminary benchmark results showcase an improvement of 108% on the
achievable ops/sec for ZSCAN and 65% for HSCAN**.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
On SSCAN, in case of listpack and intset encoding we actually reply the
entire set, and always reply with the cursor 0.
For those cases, we don't need to accumulate the replies in a list and
can completely avoid the overhead of list appending and then iterating
over the list again -- meaning we do N iterations instead of 2N
iterations over the SET and save intermediate memory as well.
Preliminary benchmarks, `SSCAN set:100 0`, showcased an improvement of
60% as visible bellow on a SET with 100 string elements (listpack
encoded).
After https://github.com/redis/redis/pull/13499, If the length set by
`addReplySetLen()` does not match the actual number of elements in the
reply, it will cause protocol broken and result in the client hanging.
On a full sync, replica starts discarding existing db. If the existing
db is huge and flush is happening synchronously, replica may become
unresponsive.
Adding a change to yield back to event loop while flushing db on
a replica. Replica will reply -LOADING in this case. Note that while
replica is loading the new rdb, it may get an error and start flushing
the partial db. This step may take a long time as well. Similarly,
replica will reply -LOADING in this case.
To call processEventsWhileBlocked() and reply -LOADING, we need to do:
- Set connSetReadHandler() null not to process further data from the master
- Set server.loading flag
- Call blockingOperationStarts()
rdbload() already does these steps and calls processEventsWhileBlocked()
while loading the rdb. Added a new call rdbLoadWithEmptyFunc() which
accepts callback to flush db before loading rdb or when an error
happens while loading.
For diskless replication, doing something similar and calling emptyData()
after setting required flags.
Additional changes:
- Allow `appendonly` config change during loading.
Config can be changed while loading data on startup or on replication
when slave is loading RDB. We allow config change command to update
`server.aof_enabled` and then lazily apply config change after loading
operation is completed.
- Added a test for `replica-lazy-flush` config
When the server restarts while the CLI is connecting, the reconnection
does not automatically select the previous db.
This may lead users to believe they are still in the previous db, in
fact, they are in db0.
This PR will automatically reset the current dbnum and `cliSelect()`
again when reconnecting.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
1. Fix fuzzer test failure when the key was deleted due to expiration
before sending random traffic for the key.
After HFE, when all fields in a hash are expired, the hash might be
deleted due to expiration.
If the key was expired in the mid of `RESTORE` command and sending rand
trafic, `fuzzer` test will fail in the following code because the 'TYPE
key' will return `none` and then throw an exception because it cannot be
found in `$commands`
94b9072e44/tests/support/util.tcl (L712-L713)
This PR adds a `None` check for the reply of `KEY TYPE` command and adds
a print of `err` to avoid false positives in the future.
failed CI:
https://github.com/redis/redis/actions/runs/10127334121/job/28004985388
2. Fix the issue where key was deleted due to expiration before the
`scan.scan_key` command could execute, caused by premature enabling of
`set-active-expire`.
failed CI:
https://github.com/redis/redis/actions/runs/10153722715/job/28077610552
---------
Co-authored-by: oranagra <oran@redislabs.com>
* some tests didn't wait for replication offset sync
* tests that used deferring client, didn't wait for it to get blocked.
an in some cases, the replication offset sync ended before the deferring
client finished, so the digest match failed.
* some tests used deferring clients excessively
* the tests didn't read the client response
* the tests didn't close the client (fd leak)
[exception]: Executing test client: ERR FAILOVER target replica is not
online.. ERR FAILOVER target replica is not online.
while executing
"$node_0 failover to $node_1_host $node_1_port"
("uplevel" body line 16)
invoked from within
"uplevel 1 $code"
(procedure "test" line 58)
invoked from within
"test {failover command to specific replica works} {
[err]: client evicted due to percentage of maxmemory in
tests/unit/client-eviction.tcl
Expected 33622 >= 220200 && 33622 < 440401 (context: type eval line 17
cmd {assert {$tot_mem >= $n && $tot_mem < $maxmemory_clients_actual}}
proc ::test)
test failure:
```
[err]: Interactive CLI: should find second search result if user presses ctrl+s in tests/integration/redis-cli.tcl
Expected '1' to be equal to '0' (context: type eval line 10 cmd {assert_equal 1 [regexp {\(i-search\): \x1B\[0mk\x1B\[1mey\x1B\[0ms one} $result]} proc ::test)
```
this test (introduced in #12543) depends on the local history file, so
it can fail if there's some match there.
the fix is to use a different history file, and delete it before each
run.
This PR is based on the commits from PR #11747.
In the event of an assertion failure, hide command arguments from the
operator.
In some cases, private client information can be voluntarily exposed
when a redis instance crashes due to an assertion failure.
This commit prevent וnintentional client info exposure.
Operators can still access the hidden data, but they must actively
request it.
Any of the client info commands remains the unchanged.
### Config
Add a new config `hide-user-data-from-log` to turn this feature on and
off, default off.
---------
Co-authored-by: naglera <anagler123@gmail.com>
Co-authored-by: naglera <58042354+naglera@users.noreply.github.com>
* INFO command : rename `hashes_with_expiry_fields` to `subexpiry`
* INFO command : rename `expired_hash_fields` to `expired_subkeys`
* Fix statistic of `expired_subkeys` to count also lazy expired
* Remove TODOs comments leftover in TCL
* Fix potential flaky test of rdb load of hash-field-expiration
Reserve 2 bits out of hash-field expiration time (`EB_EXPIRE_TIME_MAX`)
for possible future lightweight indexing/categorizing of fields. It can
be achieved by hacking HFE as follows:
```
HPEXPIREAT key [ 2^47 + USER_INDEX ] FIELDS numfields field [field …]
```
Redis will also need to expose kind of `HEXPIRESCAN` and `HEXPIRECOUNT`
for this idea. Yet to be better defined.
`HFE_MAX_ABS_TIME_MSEC` constraint must be enforced only at API level.
Internally, the expiration time can be up to `EB_EXPIRE_TIME_MAX` for
future readiness.
## Background
This PR introduces support for field-level expiration in Redis hashes. Previously, Redis supported expiration only at the key level, but this enhancement allows setting expiration times for individual fields within a hash.
## New commands
* HEXPIRE
* HEXPIREAT
* HEXPIRETIME
* HPERSIST
* HPEXPIRE
* HPEXPIREAT
* HPEXPIRETIME
* HPTTL
* HTTL
## Short example
from @moticless
```sh
127.0.0.1:6379> hset myhash f1 v1 f2 v2 f3 v3
(integer) 3
127.0.0.1:6379> hpexpire myhash 10000 NX fields 2 f2 f3
1) (integer) 1
2) (integer) 1
127.0.0.1:6379> hpttl myhash fields 3 f1 f2 f3
1) (integer) -1
2) (integer) 9997
3) (integer) 9997
127.0.0.1:6379> hgetall myhash
1) "f3"
2) "v3"
3) "f2"
4) "v2"
5) "f1"
6) "v1"
... after 10 seconds ...
127.0.0.1:6379> hgetall myhash
1) "f1"
2) "v1"
127.0.0.1:6379>
```
## Expiration strategy
1. Integrate active
Redis periodically performs active expiration and deletion of hash keys that contain expired fields, with a maximum attempt limit.
3. Lazy expiration
When a client touches fields within a hash, Redis checks if the fields are expired. If a field is expired, it will be deleted. However, we do not delete expired fields during a traversal, we implicitly skip over them.
## RDB changes
Add two new rdb type s`RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_EX`.
## Notification
1. Add `hpersist` notification for `HPERSIST` command.
5. Add `hexpire` notification for `HEXPIRE`, `HEXPIREAT`, `HPEXPIRE` and `HPEXPIREAT` commands.
## Internal
1. Add new data structure `ebuckets`, which is used to store TTL and keys, enabling quick retrieval of keys based on TTL.
2. Add new data structure `mstr` like sds, which is used to store a string with TTL.
This work was done by @moticless, @tezc, @ronen-kalish, @sundb, I just release it.