This PR is based on: valkey-io/valkey#3511
Close https://github.com/redis/redis/issues/14983
## Summary
During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.
The failure reason is similar to
https://github.com/redis/redis/pull/14946, the socket buffer is more
easy to fill.
## Root Cause
In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:
1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads
The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.
## Observed Behavior
With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):
- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall
Ultimately, it leads to a very slow synchronization process of the
entire master and replica.
### Changes
1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.
2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.
3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.
5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.
6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.
7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.
---------
Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
After introducing GCRA algorithm into redis
https://github.com/redis/redis/pull/14826 and subsequent introduction of
new RATE_LIMIT object type - https://github.com/redis/redis/pull/14905.
It was internally decided not to introduce GCRA into the new release.
As still no decision is made on whether it will be kept or not in the
future, this PR only makes the code related to GCRA dead - commands are
inaccessible and AOF/RDB load+save is disabled.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Close#15177
Follow [Fix use-after-free when fullsync happens while replica is
running a timed out script
(CVE-2026-23631)](0cca172a17)
Remove the `repl-diskless-load yes` test configuration because this
option exists only in the Redis fork and is not available in Redis OSS.
(cherry picked from commit 5033e15143)
Fullsync triggers emptyData and scriptingReset which free the scripting/function engine. If a timed out script is still running on the replica, this causes a use-after-free. Delay fullsync processing in readSyncBulkPayload until the script finishes.
# Redis Array
For years, Redis has been missing a real indexed data structure for the
use cases where the index and the spatial relationship of elements are
semantic. Hashes give you random lookups, but you have to store an index
as a key, and have no range visibility. Lists give you appending and
trimming, but what is in the middle remains hard to access. Streams give
you append-only events, which is another (useful, indeed) beast. None of
these is what you want when the *position itself* has business meaning —
slot 37, step 4, row 18552, day from 2934 to 2949, file line 11, 12, 15
and so forth. And, all those types, for different reasons, are all
suboptimal when you want a **ring buffer** able to store the latest N
observed samples of something.
Up to now, users found ways (they always do \o/) using the fact that the
data structures that are obvious in this universe are also extremely
powerful, if well implemented. But this forces compromises. Arrays
handle these index-first requirements natively, and usually with much
better memory and CPU usage than the workarounds. If the use case is the
right one, Arrays often provide much better space, time and usability at
the same time.
## Internal encoding
1. When dense, an Array is essentially a more fancy C array. You don't
pay anything for storing the index.
2. Yet, instead of going really flat, arrays are sliced into
4096-element slices, and each slice, when it contains just a few
elements, uses a special sparse encoding. When a slice is empty it's
just a `NULL` stored in the directory.
3. Small ints, floats, and short strings are pointer-tagged, so they
cost zero additional memory beyond the pointer slot itself.
4. When very sparse, a super-directory of windowed directories is used.
This allows the data type to be safe, instead of exhibiting pathological
space or time behavior. This representation is only triggered when there
are more than 8 million elements or very high indexes set.
## Use cases
Arrays are mostly stateless if not for the fact that each array
remembers the index of the latest added item, allowing `ARINSERT` and
`ARRING` to work properly. Otherwise it is a set/get at this index game,
with solid support for both setting / getting ranges, server-side
scanning, returning only populated elements in a time which is
proportional not to the range size, but to the population size.
A few concrete examples, that may work as mental models for the set of
problems that are similar to them (from the POV of the data modeling).
**Thermometer.** A sensor reporting once per minute, with gaps:
```
ARSET temp:room12:day7 123 22.3
ARGETRANGE temp:room12:day7 600 660 # the 10:00–11:00 window, with NULLs
ARSCAN temp:room12:day7 600 660 # only populated elements
AROP temp:room12:day7 0 1439 MAX # peak of the day, server-side
```
Missing minutes cost little to nothing. Numeric aggregation runs inside
Redis. Telemetry, IoT, meter readings, KPI rollups.
**Calendar.** A clinic with 96 fifteen-minute slots per day:
```
ARSET sched:room12:day 32 booking:991
ARSCAN sched:room12:day 0 95 # only occupied slots
ARGETRANGE sched:room12:day 48 63 # the afternoon full view to render
```
The slot number is the business key in this case. Room booking, parking
spaces, warehouse bins, lockers, ...
**Ring buffer.** ARRING replaces the classic LPUSH+LTRIM pattern.
Imagine remote `dmesg`.
```
ARRING machine:123 200 "[141087.430123]: arm_cpu_init(): cpu 14 online" # Capped to 200 entries
ARLASTITEMS machine:123 50 REV # 50 newest first
```
Faster than LPUSH+LTRIM, keep indexed access to past elements. Last-N
alarms, recent fraud scores, access history, remote logs, device events.
Ok here the use cases are mainly the ones of the old pattern: it is just
a better fit and allows to access random items in the middle, aggregate
server-side, and so forth.
**Workflow.** Step number is the index, value is the status. Gaps are
meaningful:
```
ARSET claim:99172 0 received
ARSET claim:99172 3 waiting:reviewer42
ARSET claim:99172 5 approved
ARGETRANGE claim:99172 0 5 # full workflow view, with NULLs for missing steps
ARSCAN claim:99172 0 5 # only steps that have a state
ARCOUNT claim:99172 # number of recorded steps
ARLEN claim:99172 # highest reached step + 1
```
**Skills knowledge base for agents.** Arrays are good at representing /
grepping into Markdown files:
```
ARSET skill:metal_gpu 0 "...."
ARSET skill:metal_gpu 1 "...."
ARSET skill:metal_gpu 2 "...."
ARGREP skill:metal_gpu - + RE "M3|M4" WITHVALUES
```
ARGREP has EXACT, MATCH, GLOB, RE, you can have multiple predicates, can
select AND or OR behavior.
**Bulk import results.** Sparse row annotations over millions of rows /
CSV / ...:
```
ARSET import:job551 18552 ERR:bad_email
ARSCAN import:job551 0 1000000 # Provides only rows that have something
```
## TLDR
If the position is part of the meaning, use an Array. If you want to
aggregate or grep remotely, use an Array. Feedback welcome :)
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Shubham S Taple <155555100+ShubhamTaple@users.noreply.github.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
Co-authored-by: Marc Gravell <marc.gravell@gmail.com>
## Summary
The stream RDB loader and listpack integrity validator had two gaps that
allowed corrupted payloads to be silently accepted, potentially leading
to crashes or incorrect behaviour at query time.
**1. `deleted_count` in the listpack header was trusted without
verification (`t_stream.c`)**
`streamValidateListpackIntegrity` already walks every entry in a stream
listpack during deep validation, inspecting each entry's flags. However,
the `deleted_count` value stored in the listpack header was never
cross-checked against the actual number of entries carrying
`STREAM_ITEM_FLAG_DELETED`. A corrupted or crafted payload could set an
arbitrary `deleted_count`, causing the `entry_count` (live) and
`deleted_count` to be inconsistent with the real data. This PR adds a
running `actual_deleted` counter during the entry walk and rejects the
listpack when it disagrees with the header.
**2. `s->length` was only loosely validated against the rax (`rdb.c`)**
The old check (`s->length && !raxSize(s->rax)`) only caught the
degenerate case where the stream claimed a non-zero length but had zero
rax nodes. It did not detect a mismatch where the stream's `length`
field differed from the sum of live (non-tombstone) entries across all
listpacks. A corrupted payload could, for example, report `length = 2`
while every listpack's live-entry count adds up to only 1, bypassing the
check entirely and causing incorrect results from `XLEN`, `XRANGE`,
`XREAD`, etc.
This PR accumulates `live_entries` from each listpack's first element
(the live-entry count) during the rax-loading loop and then performs an
exact equality check (`s->length != live_entries`).
The live-entry count itself is also validated against the in-memory
invariant maintained by `streamIteratorRemoveEntry`: when the last live
entry of a listpack is deleted, the whole rax node is removed, so a node
in the rax must have `lp_live >= 1`. The loader enforces this by
rejecting `lp_live <= 0` (not just `< 0`). Without this stricter bound,
a payload where every listpack has `lp_live = 0` and `s->length = 0`
would pass the equality check (`0 == 0`) and load into an inconsistent
state: non-empty rax containing only-tombstone listpacks, with `XLEN`
reporting 0.
**3. Tests**
Three new `corrupt-dump.tcl` tests exercise the new rejection paths:
- **stream listpack with wrong deleted count in header** — a crafted
payload where the header says `deleted_count = 1` but the only entry is
live, caught by the new `actual_deleted` check in
`streamValidateListpackIntegrity`. Requires `sanitize-dump-payload yes`
since the check runs only during deep validation.
- **stream length inconsistent with live entries** — a crafted payload
where the listpack reports `lp_live = 1` (so the `lp_live <= 0` guard
passes) but `s->length = 2`, caught by the `s->length != live_entries`
check in `rdbLoadObject`.
- **stream all-tombstone listpack with zero length** — a crafted payload
where the listpack header reports `lp_live = 0` and `s->length = 0`.
This case slips past the equality check (`0 == 0`) and is uniquely
caught by the tightened `lp_live <= 0` rejection at the listpack header.
Optimize SET key value GET propagation rewriting in setGenericCommand()
by removing GET arguments in-place with rewriteClientCommandArgument().
This avoids the overhead of allocating a new argv vector and
incrementing reference counts for every retained argument.
The optimization is scoped to the no-expire SET ... GET rewrite path.
It also adds test coverage for cases with repeated GET tokens to
ensure robust string semantics and consistent replication behavior.
Changes:
- Use rewriteClientCommandArgument(c, j, NULL) for in-place removal.
- Eliminate redundant argv allocations and refcount increments.
- Improve performance of SET GET in high-throughput write streams.
**Summary**
Detects and rejects corrupt stream RDB payloads where the same NACK
(pending entry) is referenced by more than one consumer, which violates
a stream data-structure.
**Changes**
- **`rdbLoadObject` (stream consumer PEL loading)**: Added a guard that
checks `nack->consumer != NULL` before assigning the consumer pointer.
When a second consumer's PEL references a NACK that was already claimed
by a prior consumer, the loader now reports a corrupt RDB error and
aborts instead of silently overwriting the pointer. Without this check,
two consumers share the same `streamNACK`, and freeing the first
consumer's PEL leaves the second with a dangling pointer.
- **`corrupt-dump.tcl`**: Added a regression test that crafts a stream
with two consumers (`consumerA`, `consumerB`) whose PELs both reference
the same entry (`1-0`). The `RESTORE` command is expected to fail with
`"Bad data format"`, and the server must remain responsive (`PING`
succeeds).
**Benefits**
- **Fail-fast on corrupt data**: The invariant violation is caught at
load time with a clear diagnostic message rather than manifesting as a
crash later during normal operation.
- **Regression coverage**: The crafted payload in the test ensures this
class of corruption is permanently guarded against.
### Problem
While the new type `OBJ_GCRA` was added, several related code paths were
not updated accordingly, leading to failures in the
`reply-schemas-validator` CI job and `corrupt-dump-fuzzer.tcl`
##### reply-schemas-validator
Failed CI:
https://github.com/redis/redis/actions/runs/24485248057/job/71558533290#step:10:903
```shell
Traceback (most recent call last):
File "/home/runner/work/redis/redis/./utils/req-res-log-validator.py", line 238, in process_file
jsonschema.validate(instance=res.json, schema=req.schema, cls=schema_validator)
File "/home/runner/.local/lib/python3.12/site-packages/jsonschema/validators.py", line 1121, in validate
raise error
jsonschema.exceptions.ValidationError: 'rate_limit' is not valid under any of the given schemas
Failed validating 'oneOf' in schema['patternProperties']['^.*$']['properties']['group']:
{'description': 'the functional group to which the command belongs',
'oneOf': [{'const': 'bitmap'},
{'const': 'cluster'},
{'const': 'connection'},
{'const': 'generic'},
{'const': 'geo'},
{'const': 'hash'},
{'const': 'hyperloglog'},
{'const': 'list'},
{'const': 'module'},
{'const': 'pubsub'},
{'const': 'scripting'},
{'const': 'sentinel'},
{'const': 'server'},
{'const': 'set'},
{'const': 'sorted-set'},
{'const': 'stream'},
{'const': 'string'},
{'const': 'transactions'}]}
On instance['gcrasetvalue']['group']:
'rate_limit'
```
##### `corrupt-dump-fuzzer.tcl`
Also fixed `: Fuzzer corrupt restore payloads - sanitize_dump: yes in
tests/integration/corrupt-dump-fuzzer.tcl`
Failed daily test :
https://github.com/redis/redis/actions/runs/24485248057/job/71558533312#step:6:8652
```shell
Server crashed (by signal: 0, err: key "gcra" not known in dictionary), with payload: "\x1C\x0A\x02\x5F\x37\xC0\x06\xC0\x00\x02\x5F\x39\xC0\x08\x02\x5F\x33\x02\x5F\x35\x02\x5F\x31\xC0\x02\xC0\x04\x0E\x00\xA9\x71\xBF\xEE\x6F\x46\xEF\xA6"
violating commands:
Done 1434 cycles in 600 seconds.
RESTORE: successful: 601, rejected: 833
Total commands sent in traffic: 1194776, crashes during traffic: 1 (0 by signal).
[: Fuzzer corrupt restore payloads - sanitize_dump: yes in tests/integration/corrupt-dump-fuzzer.tcl
Expected '1' to be equal to '0' (context: type eval line 155 cmd {assert_equal $stat_terminated_in_traffic 0} proc ::test)
[147/147 done]: integration/corrupt-dump-fuzzer (1201 seconds)
```
### Changed
This change completes the necessary updates across all relevant
components to ensure consistent handling of the rate_limit group and
restores CI stability.
## Summary
This PR fixes two issues when processing corrupt data in
rdbLoadCheckModuleValue():
1. When handling `RDB_MODULE_OPCODE_STRING` opcode,
rdbGenericLoadStringObject() can return NULL on a corrupt payload. The
code called decrRefCount(o) unconditionally without a NULL check,
resulting in a NULL pointer dereference crash.
2. The while loop condition was `!= RDB_MODULE_OPCODE_EOF`, which means
a truncated payload (causing rdbLoadLen to return RDB_LENERR) would
never exit the loop, since `RDB_LENERR != RDB_MODULE_OPCODE_EOF` is
always true, potentially causing an infinite hang.
During RDB saving and AOF rewriting, the fork child already dismisses
(madvise(MADV_DONTNEED)) individual key-value objects after serializing them.
However, the hash table bucket arrays of each dict were never dismissed,
leaving large contiguous allocations subject to CoW when the parent
modifies them.
This PR extends the dismiss mechanism to cover dict bucket arrays,
reducing CoW memory overhead.
- **Expires kvstore** — dismissed upfront before saving starts, since the
child never accesses expires directly, after embeding expire time in the key object.
- **Slot dicts** (cluster mode) — dismissed per-slot as the iterator moves
to the next slot during RDB saving or AOF rewriting.
- **DB keys kvstore** (standalone mode) — dismissed per-DB after each DB is
fully serialized during RDB saving or AOF rewriting.
Fixes#14838
## Summary
Fix a crash in `prefetchCommands()` that occurs during replica full sync
when the replica has existing data that needs to be emptied.
## Problem Description
During `emptyData()` → `kvstoreEmpty()` → `dictEmpty()` →
`_dictClear()`, the first hash table is cleared and `d->ht_table[0]` is
set to NULL via `_dictReset`. Then while clearing the second hash table,
every 65536 buckets it invokes `replicationEmptyDbCallback()` →
`processEventsWhileBlocked()` → `readQueryFromClient()` →
`prefetchCommands()`.
At this point, `dictSize() > 0` still holds (because the second hash
table isn't fully cleared yet), but `ht_table[0]` is already NULL. The
prefetch code assumed `ht_table[0]` is always valid when `dictSize() >
0`, leading to a crash.
## Solution
1. **Skip prefetch during loading**: Added a `server.loading` check at
the top of `prefetchCommands()` to return early. During RDB loading, the
main dictionary is being rebuilt, so prefetching keys from it is useless
anyway.
2. **Add defensive assertion**: Added
`serverAssert(batch->current_dicts[i]->ht_table[0])` in
`initBatchInfo()` to catch any future cases where `ht_table[0]` is NULL
while `dictSize() > 0` (which should only happen mid-`dictEmpty` via
`_dictReset`).
---------
Co-authored-by: kairosci <kairosci@users.noreply.github.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
Resolve https://github.com/redis/redis/issues/14358 and
https://github.com/redis/redis/issues/14359. Fix division by zero in
redis-benchmark and redis-cli when histograms are empty or elapsed time
is zero. Guard all affected divisions in showLatencyReport(),
showThroughput(), and displayKeyStatsSizeDist(). Added integration tests
for both cases.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
When loading RDB_TYPE_HASH_ZIPMAP objects, a listpack is allocated for
the zipmap-to-listpack conversion. If the conversion loop encounters a
duplicate key, allocation failure, or oversized listpack, the error
handler frees the dict, field, encoded buffer, and object — but not the
listpack. Add the missing lpFree(lp) call and a regression test that
RESTOREs a zipmap with duplicate keys to exercise this path.
Fix#14835
Executed `failover` with `force` argument in slow environments (TSAN, IO
threads, etc) can force a full resync instead of a partial one.
e3c38aab6 adds TSAN-only workaround in integration/failover test, test
failure occurs in `test‑ubuntu‑io‑threads` job, drop TSAN-only condition
and always check the sum of partial+full syncs will fix it.
Steps to reproduce:
My setup: Hardware: MacBook Pro (Apple chip)
1. Build docker image with Dockerfile below and run the container:
```
FROM ubuntu:22.04
# avoid interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
tcl8.6 \
tclx \
tcl-tls \
pkg-config \
ca-certificates \
git \
bash \
vim \
wget \
curl \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /opt/redis
# if you prefer to copy the source during build uncomment the next line
# COPY . /opt/redis
# configure a non-root user if desired; we stay root for simplicity
# default command keeps you in a shell
CMD ["/bin/bash"]
```
Execute the command:
```
cd /path/to/redis
docker build -f Dockerfile.io-threads -t redis-test-io .
docker run --rm -it --cpuset-cpus=0 --cpus=0.1 -v $(pwd):/opt/redis redis-test-io /bin/bash
```
2. Inside the container you can then run the same commands as the CI
job:
```
make REDIS_CFLAGS='-Werror'
./runtest --single integration/failover --config io-threads 4 --tags "failover" --verbose --accurate --loop --stop
```
If the container still seems too fast, lower `--cpus` further (e.g.
0.05), or run the whole driver under `cpulimit`:
```
apt-get update && apt-get install -y cpulimit
cpulimit -l 1 -- ./runtest --single integration/failover --config io-threads 4 --tags "failover" --verbose --accurate --loop --stop
```
## Problem
In `corrupt-dump-fuzzer.tcl`, after a successful `RESTORE` with a
corrupted payload, the test did a bare `r ping` to check whether the
server was still alive:
If the server crashed at this point, the uncaught exception would
**silently terminate the entire test** without printing the payload that
caused the crash, making the issue extremely hard to reproduce and
debug.
## Fix
Wrap `r ping` with catch to ensure failures trigger the same reporting
and restart logic as `RESTORE`. This ensures offending payloads are
logged for reproduction.
In standalone mode, we create dicts on demand. by default, there is no
dicts, and RESIZEDB hint doesn't expand dict actually. so we should
create dict first when expanding if the dict does not exist.
# Problem
While introducing Async IO
threads(https://github.com/redis/redis/pull/13695) primary and replica
clients were left to be handled inside main thread due to data race and
synchronization issues. This PR solves this issue with the additional
hope it increases performance of replication.
# Overview
## Moving the clients to IO threads
Since clients first participate in a handshake and an RDB replication
phases it was decided they are moved to IO-thread after RDB replication
is done. For primary client this was trivial as the master client is
created only after RDB sync (+ some additional checks one can see in
`isClientMustHandledByMainThread`). Replica clients though are moved to
IO threads immediately after connection (as are all clients) so
currently in `unstable` replication happens while this client is in
IO-thread. In this PR it was moved to main thread after receiving the
first `REPLCONF` message from the replica, but it is a bit hacky and we
can remove it. I didn't find issues between the two versions.
## Primary client (replica node)
We have few issues here:
- during `serverCron` a `replicationCron` is ran which periodically
sends `REPLCONF ACK` message to the master, also checks for timed-out
master. In order to prevent data races we utilize`IOThreadClientsCron`.
The client is periodically sent to main thread and during
`processClientsFromIOThread` it's checked if it needs to run the
replication cron behaviour.
- data races with main thread - specifically `lastinteraction` and
`read_reploff` members of the primary client that are written to in
`readQueryFromClient` could be accessed at the same time from main
thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To
solve this the members were duplicated so if the client is in IO-thread
it writes to the duplicates and they are synced with the original
variables each time the client is send to main thread ( that means `INFO
REPLICATION` could potentially return stale values).
- During `freeClient` the primary client is fetched to main thread but
when caching it(`replicationCacheMaster`) the thread id will remain the
id of the IO thread it was from. This creates problems when resurrecting
the master client. Here the call to `unbindClientFromIOThreadEventLoop`
in `freeClient` was rewritten to call `keepClientInMainThread` which
automatically fixes the problem.
- During `exitScriptTimedoutMode` the master is queued for reprocessing
(specifically process any pending commands ASAP after it's unblocked).
We do that by putting it in the `server.unblocked_clients` list, which
are processed in the next `beforeSleep` cycle in main thread. Since this
will create a contention between main and IO thread, we just skip this
queueing in `unblocked_clients` and just queue the client to main thread
- the `processClientsFromIOThread` will process the pending commands
just as main would have.
## Replica clients (primary node)
We move the client after RDB replication is done and after replication
backlog is fed with its first message.
We do that so that the client's reference to the first replication
backlog node is initialized before it's read from IO-thread, hence no
contention with main thread on it.
### Shared replication buffer
Currently in unstable the replication buffer is shared amongst clients.
This is done via clients holding references to the nodes inside the
buffer. A node from the buffer can be trimmed once each replica client
has read it and send its contents. The reference is
`client->ref_repl_buf_node`. The replication buffer is written to by
main thread in `feedReplicationBuffer` and the refcounting is intrusive
- it's inside the replication-buffer nodes themselves.
Since the replica client changes the refcount (decreases the refcount of
the node it has just read, and increases the refcount of the next node
it starts to read) during `writeToClient` we have a data race with main
thread when it feeds the replication buffer. Moreover, main thread also
updates the `used` size of the node - how much it has written to it,
compared to its capacity which the replica client relies on to know how
much to read. Obviously replica being in IO-thread creates another data
race here. To mitigate these issues a few new variables were added to
the client's struct:
- `io_curr_repl_node` - starting node this replica is reading from
inside IO-thread
- `io_bound_repl_node` - the last node in the replication buffer the
replica sees before being send to IO-thread.
These values are only allowed to be updated in main thread. The client
keeps track of how much it has read into the buffer via the old
`ref_repl_buf_node`. Generally while in IO-thread the replica client
will now keep refcount of the `io_curr_repl_node` until it's processed
all the nodes up to `io_bound_repl_node` - at that point its returned to
main thread which can safely update the refcounts.
The `io_bound_repl_node` reference is there so the replica knows when to
stop reading from the repl buffer - imagine that replica reads from the
last node of the replication buffer while main thread feeds data to it -
we will create a data race on the `used` value
(`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)).
That's why this value is updated just before the replica is being send
to IO thread.
*NOTE*, this means that when replicas are handled by IO threads they
will hold more than one node at a time (i.e `io_curr_repl_node` up to
`io_bound_repl_node`) meaning trimming will happen a bit less
frequently. Tests show no significant problems with that.
(tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node`
mechanism as my initial implementation had similar semantics but was way
less clear)
Example of how this works:
* Replication buffer state at time N:
| node 0| ... | node M, used_size K |
* replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and
`io_bound_block_pos`=K
* replica moves to IO thread and processes all the data it sees
* Replication buffer state at time N + 1:
| node 0| ... | node M, used_size Full | |node M + 1| |node M + 2,
used_size L|, where Full > M
* replica moves to main thread at time N + 1, at this point following
happens
- refcount to node 0 (io_curr_repl_node) is decreased
- `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have
size-K bytes to process from there)
- refcount to node M is increased (now all nodes from 0 up to M-1
including can be trimmed unless some other replica holds reference to
them)
- And just before the replica is send back to IO thread the following
are updated:
- `io_bound_repl_node` ref becomes node M+2
- `io_bound_block_pos` becomes L
Note that replica client is only moved to main if it has processed all
the data it knows about (i.e up to `io_bound_repl_node` +
`io_bound_block_pos`)
### Replica clients kept in main as much as possible
During implementation an issue arose - how fast is the replica client
able to get knowledge about new data from the replication buffer and how
fast can it trim it. In order for that to happen ASAP whenever a replica
is moved to main it remains there until the replication buffer is fed
new data. At that point its put in the pending write queue and special
cased in handleClientsWithPendingWrites so that its send to IO thread
ASAP to write the new data to replica. Also since each time the replica
writes its whole repl data it knows about that means after it's send to
main thread `processClientsFromIOThread` is able to immediately update
the refcounts and trim whatever it can.
### ACK messages from primary
Slave clients need to periodically read `REPLCONF ACK` messages from
client. Since replica can remain in main thread indefinitely if no DB
change occurs, a new atomic `pending_read` was added during
`readQueryFromClient`. If a replica client has a pending read it's
returned back to IO-thread in order to process the read even if there is
no pending repl data to write.
### Replicas during shutdown
During shutdown the main thread pauses write actions and periodically
checks if all replicas have reached the same replication offset as the
primary node. During `finishShutdown` that may or may not be the case.
Either way a client data may be read from the replicas and even we may
try to write any pending data to them inside `flushSlavesOutputBuffers`.
In order to prevent races all the replicas from IO threads are moved to
main via `fetchClientFromIOThread`. A cancel of the shutdown should be
ok, since the mechanism employed by `handleClientsWithPendingWrites`
should return the client back to IO thread when needed.
## Notes
While adding new tests timing issues with Tsan tests were found and
fixed.
Also there is a data race issue caught by Tsan on the `last_error`
member of the `client` struct. It happens when both IO-thread and main
thread make a syscall using a `client` instance - this can happen only
for primary and replica clients since their data can be accessed by
commands send from other clients. Specific example is the `INFO
REPLICATION` command.
Although other such races were fixed, as described above, this once is
insignificant and it was decided to be ignored in `tsan.sup`.
---------
Co-authored-by: Yuan Wang <wangyuancode@163.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
1) Replace fixed sleep with wait_for_condition to avoid flaky test
failures when checking master_current_sync_attempts counter.
2) Similar to https://github.com/redis/redis/pull/14674, use
assert_lessthan_equal instead of assert_lessthan to verify the idle
time.
Follow https://github.com/redis/redis/pull/14423
In https://github.com/redis/redis/pull/14423,
I thought the last lpNext operation of the iterator occurred at the end
of streamIteratorGetID.
However, I overlooked the fact that after calling
`streamIteratorGetID()`, we might still use `streamIteratorGetField()`
to continue moving within the current entry.
This means that during reverse iteration, the iterator could move back
to a previous entry position.
To fix this, in this PR I record the current position at the beginning
of streamIteratorGetID().
When we enter it again next time, we ensure that the entry position does
not exceed the previous one,
that is, during forward iteration the entry must be greater than the
last entry position,
and during reverse iteration it must be smaller than the last entry
position.
Note that the fix for https://github.com/redis/redis/pull/14423 has been
replaced by this fix.
`repl-diskless-load` feature can effectively reduce the time of full
synchronization, but maybe it is not widely used.
`swapdb` option needs double `maxmemory`, and `on-empty-db` only works
on the first full sync (the replica must have no data).
This PR introduce a new option: `flushdb` - Always flush the entire
dataset before diskless load. If the diskless load fails, the replica
will lose all existing data.
Of course, it brings the risk of data loss, but it provides a choice if
you want to reduce full sync time and accept this risk.
Fixes#14538
If the master uses diskless synchronization and the replica uses
diskless load, we can disable RDB compression to reduce full sync time.
I tested on AWS and found we could reduce time by 20-40%.
In terms of implementation, when the replica can use diskless load, the
replica will send `replconf rdb-no-compress 1` to master to deliver a
RDB without compression.
If your network is slow, please disable repl-diskless-load, and maybe
even repl-diskless-sync
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR fixes an issue(#14541) where EXEC’s ACL recheck was still being
performed during AOF loading, that may cause AOF loading failed, if ACL
rules are changed and don't allow some commands in MULTI-EXEC.
Currently, it is difficult to determine whether an RDB persistence
failure is caused by a persistent issue in the process or by a temporary
error.
Tracking the number of consecutive RDB persistence failures would help
clarify this distinction. Multiple failures in a row could indicate an
ongoing problem, while isolated cases may point to transient or
environmental issues.
This PR follows https://github.com/redis/redis/pull/14058
When Redis is shut down uncleanly (e.g., due to power loss or system
crashes), invalid bytes may remain at the end of AOF files.
This PR refactors the AOF corruption handling system and introduces
automatic recovery capabilities to improve resilience and reduce
downtime. The feature only handles corruption at the end of files,
respects the configured size limit to prevent excessive data loss,
maintains all valid commands before the corruption point.
Removed aof-load-broken configuration option
Rename aof-load-broken-max-size to aof-load-corrupt-tail-max-size.
New Config: aof-load-corrupt-tail-max-size - enables recovery for
corruption up to specified size
When the `lp_count` of a stream entry is incorrect, and we are
performing a reverse iteration, we can normally move backward by
`lp_count` to locate the previous entry.
However, if `lp_count` is wrong, during forward iteration, after
obtaining the flag, ID, and other fields, we may step beyond the current
entry into the start position of the next entry(where we came from),
which could cause an infinite loop.
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This issue was introduced by https://github.com/redis/redis/pull/10440
In that PR, we avoided resetting the current user during processCommand,
but overlooked the fact that this client might not be the current one,
it could be a client that was previously blocked due to shutdown.
If we don’t reset these clients, and the shutdown is canceled, then when
these clients continue executing other commands, they will trigger an
assertion.
This PR delays the operation of resetting the client to
processUnblockedClients and no longer skips SHUTDOWN_BLOCKED clients.
This PR is based on https://github.com/valkey-io/valkey/pull/1303
This PR introduces a DEBUG_DEFRAG compilation option that enables
activedefrag functionality even when the allocator is not jemalloc, and
always forces defragmentation regardless of the amount or ratio of
fragmentation.
## Using
```
make SANITIZER=address DEBUG_DEFRAG=<force|fully>
./runtest --debug-defrag
```
* DEBUG_DEFRAG=force
* Ignore the threshold for defragmentation to ensure that
defragmentation is always triggered.
* Always reallocate pointers to probe for correctness issues in pointer
reallocation.
* DEBUG_DEFRAG=fully
* Includes everything in the option `force`.
* Additionally performs a full defrag on every defrag cycle, which is
significantly slower but more accurate.
---------
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: oranagra <oran@redislabs.com>
When Redis is shut down uncleanly (e.g., due to power loss), invalid
bytes may remain at the end of the AOF file. Currently, Redis detects
such corruption only after parsing most of the AOF, leading to delayed
error detection and increased downtime. Manual recovery via
`redis-check-aof --fix` is also time-consuming.
This fix introduces two new options to improve resilience and reduce
downtime:
- `aof-load-broken`: Enables automatic detection and repair of broken
AOF tails.
- `aof-load-broken-max-size`: Sets a maximum threshold (in bytes) for
the corrupted tail size that Redis will attempt to fix automatically
without requiring user intervention.
- tests/unit/maxmemory.tcl
If multithreaded, we need to let IO threads have chance to reply output
buffer, to avoid next commands causing eviction. After eviction is
performed,
the next command becomes ready immediately in IO threads, and now we
enqueue
the client to be processed in main thread’s beforeSleep without
notification.
However, invalidation messages generated by eviction may not have been
fully
delivered by that time. As a result, executing the command in
beforeSleep of
the event loop (running eviction) can cause additional keys to be
evicted.
```
Expected '73' to be between to '200' and '300' (context: type source line 473 file
redis/tests/unit/maxmemory.tcl cmd {assert_range [r dbsize] 200 300} proc ::test)
```
the reason why CI doesn't find this issue is that we skill this test
`tsan:skip` as below
`start_server {tags {"maxmemory external:skip tsan:skip"}} `,so remove
this tag.
- tests/integration/aof.tcl
Because IO and the main thread are working in better parallelism without
notification,
the main thread may haven't write AOF buffer into file, but the IO
thread just writes
the reply, so the clients receive the reply before AOF file is changed.
We should use `appendfsync always` policy to make the command is written
into
AOF file when receiving reply.
```
Expected '0' to be equal to '54' (context: type source line 249 file
redis/tests/integration/aof.tcl cmd {assert_equal $before $after} proc ::test)
```
#13969 makes these scenarios easy to appear.
If replica detects broken connection while reading min expiration time
of hfe key, it calls exit().
Fixed it to handle the error gracefully without calling exit.
To reproduce the issue, the short-read test was modified to generate
many small hfe keys, increasing the likelihood of a connection drop
while reading min expiration time:
```tcl
for {set k 0} {$k < 50000} {incr k} {
for {set i 0} {$i < 1} {incr i} {
r hsetex "$k hfe_small" EX [expr {int(rand()*10)}] FIELDS 1 [string repeat A [expr {int(rand()*10)}]] 0[string repeat A [expr {int(rand()*10)}]]
}
}
```
We can't have the test use only hfe keys, so a few were added alongside
other data. I couldn't reproduce the issue this way but with the test's
randomization, it should hit this scenario in one of the runs.
Merge fork counters with https://github.com/redis/redis/pull/12957
repl_current_sync_attempts - Total number of attempts to connect to a
master since the last time we disconnected from a good connection (or a
configuration change). any number greater than 1 (even if the link is
currently up), indicates an issue.
repl_total_sync_attempts - Number of times in current configuration, the
replica attempted to sync to a master. (dosent reset on master
reconnect.)
repl_total_disconnect_time - Total cumulative time we've been
disconnected as a replica, visible when the link is up too.
master_link_up_since_seconds - Number of seconds since the link is down,
just maintain symmetry with master_link_down_since_seconds.
Add thread sanitizer run to daily CI.
Few tests are skipped in tsan runs for two reasons:
* Stack trace producing tests (oom, `unit/moduleapi/crash`, etc) are
tagged `tsan:skip` because redis calls `backtrace()` in signal handler
which turns out to be signal-unsafe since it might allocate memory (e.g.
glibc 2.39 does it through a call to `_dl_map_object_deps()`).
* Few tests become flaky with thread sanitizer builds and don't finish
in expected deadlines because of the additional tsan overhead. Instead
of skipping those tests, this can improved in the future by allowing
more iterations when waiting for tsan builds.
Deadlock detection is disabled for now because of tsan limitation where
max 64 locks can be taken at once.
There is one outstanding (false-positive?) race in jemalloc which is
suppressed in `tsan.sup`.
Fix few races thread sanitizer reported having to do with writes from
signal handlers. Since in multi-threaded setting signal handlers might
be called on any thread (modulo pthread_sigmask) while the main thread
is running, `volatile sig_atomic_t` type is not sufficient and atomics
are used instead.
Hi all, this PR fixes two things:
1. An assertion, that prevented the RDB loading from recovery if there
was a quantization type mismatch (with regression test).
2. Two code paths that just returned NULL without proper cleanup during
RDB loading.
The idea of packing the key (`sds`), value (`robj`) and optionally TTL
into a single struct in memory was mentioned a few times in the past by
the community in various flavors. This approach improves memory
efficiency, reduces pointer dereferences for faster lookups, and
simplifies expiration management by keeping all relevant data in one
place. This change goes along with setting keyspace's dict to
no_value=1, and saving considerable amount of memory.
Two more motivations that well aligned with this unification are:
- Prepare the groundwork for replacing EXPIRE scan based implementation
and evaluate instead new `ebuckets` data structure that was introduced
as part of [Hash Field Expiration
feature](https://redis.io/blog/hash-field-expiration-architecture-and-benchmarks/).
Using this data structure requires embedding the ExpireMeta structure
within each object.
- Consider replacing dict with a more space efficient open addressing
approach hash table that might rely on keeping a single pointer to
object.
Before this PR, I POC'ed on a variant of open addressing hash-table and
was surprised to find that dict with no_value actually could provide a
good balance between performance, memory efficiency, and simplicity.
This realization prompted the separation of the unification step from
the evaluation of a new hash table to avoid introducing too many changes
at once and to evaluate its impact independently before considering
replacement of existing hash-table. On an earlier
[commit](https://github.com/redis/redis/pull/13683) I extended dict
no_value optimization (which saves keeping dictEntry where possible) to
be relevant also for objects with even addresses in memory. Combining it
with this unification saves a considerable amount of memory for
keyspace.
# kvobj
This PR adopts Valkey’s
[packing](3eb8314be6)
layout and logic for key, value, and TTL. However, unlike Valkey
implementation, which retained a common `robj` throughout the project,
this PR distinguishes between the general-purpose, overused `robj`, and
the new `kvobj`, which embeds both the key and value and used by the
keyspace. Conceptually, `robj` serves as a base class, while `kvobj`
acts as a derived class.
Two new flags introduced into redis object, `iskvobj` and `expirable`:
```
struct redisObject {
unsigned type:4;
unsigned encoding:4;
unsigned lru:LRU_BITS;
unsigned iskvobj : 1; /* new flag */
unsigned expirable : 1; /* new flag */
unsigned refcount : 30; /* modified: 32bits->30bits */
void *ptr;
};
typedef struct redisObject robj;
typedef struct redisObject kvobj;
```
When the `iskvobj` flag is set, the object includes also the key and it
is appended to the end of the object. If the `expirable` flag is set, an
additional 8 bytes are added to the object. If the object is of type
string, and the string is rather short, then it will be embedded as
well.
As a result, all keys in the keyspace are promoted to be of type
`kvobj`. This term attempts to align with the existing Redis object,
robj, and the kvstore data structure.
# EXPIRE Implementation
As `kvobj` embeds expiration time as well, looking up expiration times
is now an O(1) operation. And the hash-table of EXPIRE is set now to be
`no_value` mode, directly referencing `kvobj` entries, and in turn,
saves memory.
Next, I plan to evaluate replacing the EXPIRE implementation with the
[ebuckets](https://github.com/redis/redis/blob/unstable/src/ebuckets.h)
data structure, which would eliminate keyspace scans for expired keys.
This requires embedding `ExpireMeta` within each `kvobj` of each key
with expiration. In such implementation, the `expirable` flag will be
shifted to indicate whether `ExpireMeta` is attached.
# Implementation notes
## Manipulating keyspace (find, modify, insert)
Initially, unifying the key and value into a single object and storing
it in dict with `no_value` optimization seemed like a quick win.
However, it (quickly) became clear that this change required deeper
modifications to how keys are manipulated. The challenge was handling
cases where a dictEntry is opt-out due to no_value optimization. In such
cases, many of the APIs that return the dictEntry from a lookup become
insufficient, as it just might be the key itself. To address this issue,
a new-old approach of returning a "link" to the looked-up key's
`dictEntry` instead of the `dictEntry` itself. The term `link` was
already somewhat available in dict API, and is well aligned with the new
dictEntLink declaration:
```
typedef dictEntry **dictEntLink;
```
This PR introduces two new function APIs to dict to leverage returned
link from the search:
```
dictEntLink dictFindLink(dict *d, const void *key, dictEntLink *bucket);
void dictSetKeyAtLink(dict *d, void *key, dictEntLink *link, int newItem);
```
After calling `link = dictFindLink(...)`, any necessary updates must be
performed immediately after by calling `dictSetKeyAtLink()` without any
intervening operations on given dict. Otherwise, `dictEntLink` may
become invalid. Example:
```
/* replace existing key */
link = dictFindLink(d, key, &bucket, 0);
// ... Do something, but don't modify the dict ...
// assert(link != NULL);
dictSetKeyAtLink(d, kv, &link, 0);
/* Add new value (If no space for the new key, dict will be expanded and
bucket will be looked up again.) */
link = dictFindLink(d, key, &bucket);
// ... Do something, but don't modify the dict ...
// assert(link == NULL);
dictSetKeyAtLink(d, kv, &bucket, 1);
```
## dict.h
- The dict API has became cluttered with many unused functions. I have
removed these from dict.h.
- Additionally, APIs specifically related to hash maps (no_value=0),
primarily those handling key-value access, have been gathered and
isolated.
- Removed entirely internal functions ending with “*ByHash()” that were
originally added for optimization and not required any more.
- Few other legacy dict functions were adapted at API level to work with
the term dictEntLink as well.
- Simplified and generalized an optimization that related to comparison
of length of keys of type strings.
## Hash Field Expiration
Until now each hash object with expiration on fields needed to maintain
a reference to its key-name (of the hash object), such that in case it
will be active-expired, then it will be possible to resolve the key-name
for the notification sake. Now there is no need anymore.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Now we have RDB channel in https://github.com/redis/redis/pull/13732,
child process can transfer RDB in a background method, instead of
handled by main thread. So when redis-cli gets RDB from server, we can
adopt this way to reduce the main thread load.
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR adds support for REDISMODULE_OPTIONS_HANDLE_IO_ERRORS.
and tests for short read and corrupted RESTORE payload.
Please: note that I also removed the comment about async loading support
since we should be already covered. No manipulation of global data
structures in Vector Sets, if not for the unique ID used to create new
vector sets with different IDs.
from the master's perspective, the replica can become online before it's
actually done loading the rdb file.
this was always like that, in disk-based repl, and thus ok with diskless
and rdb channel.
in this test, because all the keys are added before the backlog is
created, the replication offset is 0, so the test proceeds and could get
a LOADING error when trying to run the function.
Test fails time to time:
```
*** [err]: Slave is able to detect timeout during handshake in tests/integration/replication.tcl
Replica is not able to detect timeout
```
Depending on the timing, "debug sleep" may occur during rdbchannel
handshake and required log statement won't be printed to the log in that
case. Better to wait after rdbchannel handshake.
With RDB channel replication, we introduced parallel replication stream
and RDB delivery to the replica during a full sync. Currently, after the
replica loads the RDB and begins streaming the accumulated buffer to the
database, it does not read from the master connection during this
period. Although streaming the local buffer is generally a fast
operation, it can take some time if the buffer is large. This PR
introduces buffering during the streaming of the local buffer. One
important consideration is ensuring that we consume more than we read
during this operation; otherwise, it could take indefinitely. To
guarantee that it will eventually complete, we limit the read to at most
half of what we consume, e.g. read at most 1 mb once we consume at least
2 mb.
**Additional changes**
**Bug fix**
- Currently, when replica starts draining accumulated buffer, we call
protectClient() for the master client as we occasionally yield back to
event loop via processEventsWhileBlocked(). So, it prevents freeing the
master client. While we are in this loop, if replica receives "replicaof
newmaster" command, we call replicaSetMaster() which expects to free the
master client and trigger a new connection attempt. As the client object
is protected, its destruction will happen asynchronously. Though, a new
connection attempt to new master will be made immediately. Later, when
the replication buffer is drained, we realize master client was marked
as CLOSE_ASAP, and freeing master client triggers another connection
attempt to the new master. In most cases, we realize something is wrong
in the replication state machine and abort the second attempt later. So,
the bug may go undetected. Fix is not calling protectClient() for the
master client. Instead, trying to detect if master client is
disconnected during processEventsWhileBlocked() and if so, breaking the
loop immediately.
**Related improvement:**
- Currently, the replication buffer is a linked list of buffers, each of
which is 1 MB in size. While consuming the buffer, we process one buffer
at a time and check if we need to yield back to
`processEventsWhileBlocked()`. However, if
`loading-process-events-interval-bytes` is set to less than 1 MB, this
approach doesn't handle it well. To improve this, I've modified the code
to process 16KB at a time and check
`loading-process-events-interval-bytes` more frequently. This way,
depending on the configuration, we may yield back to networking more
often.
- In replication.c, `disklessLoadingRio` will be set before a call to
`emptyData()`. This change should not introduce any behavioral change
but it is logically more correct as emptyData() may yield to networking
and we may need to call rioAbort() on disklessLoadingRio. Otherwise,
failure of main channel may go undetected until a failure on rdb channel
on a corner case.
**Config changes**
- The default value for the `loading-process-events-interval-bytes`
configuration is being lowered from 2MB to 512KB. This configuration
primarily used for testing and controls the frequency of networking
during the loading phase, specifically when loading the RDB or applying
accumulated buffers during a full sync on the replica side.
Before the introduction of RDB channel replication, the 2MB value was
sufficient for occasionally yielding to networking, mainly to reply
-loading to the clients. However, with RDB channel replication, during a
full sync on the replica side (either while loading the RDB or applying
the accumulated buffer), we need to yield back to networking more
frequently to continue accumulating the replication stream. If this
doesn’t happen often enough, the replication stream can accumulate on
the master side, which is undesirable.
To address this, we’ve decided to lower the default value to 512KB. One
concern with frequent yielding to networking is the potential
performance impact, as each call to processEventsWhileBlocked() involves
4 syscalls, which could slow down the RDB loading phase. However,
benchmarking with various configuration values has shown that using
512KB or higher does not negatively impact RDB loading performance.
Based on these results, 512KB is now selected as the default value.
**Test changes**
- Added improved version of a replication test which checks memory usage
on master during full sync.
---------
Co-authored-by: Oran Agra <oran@redislabs.com>
Before https://github.com/redis/redis/pull/13732, replicas were brought
online immediately after master wrote the last bytes of the RDB file to
the socket. This behavior remains unchanged if rdbchannel replication is
not used. However, with rdbchannel replication, the replica is brought
online after receiving the first ack which is sent by replica after rdb
is loaded.
To align the behavior, reverting this change to put replica online once
bgsave is done.
Additonal changes:
- INFO field `mem_total_replication_buffers` will also contain
`server.repl_full_sync_buffer.mem_used` which shows accumulated
replication stream during rdbchannel replication on replica side.
- Deleted debug level logging from some replication tests. These tests
generate thousands of keys and it may cause per key logging on some
cases.
When the diskless load configuration is set to on-empty-db, we retain a
pointer to the function library context. When emptyData() is called, it
frees this function library context pointer, leading to a use-after-free
situation.
I refactored code to ensure that emptyData() is called first, followed
by retrieving the valid pointer to the function library context.
Refactored code should not introduce any runtime implications.
Bug introduced by https://github.com/redis/redis/pull/13495 (Redis 8.0)
Co-authored-by: Oran Agra <oran@redislabs.com>
Recently encountered some errors as bellow,
HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate
current time by `[clock seconds]` that may have a delay that causes
results greater than expected.
Dismiss memory test error, now we introduced rdb-channel replication,
the full synchronization might finish before the child process exits. So
we may fail if calling `bgsave` immediately after full sync.
### Background
AOF is often used as an effective data recovery method, but now if we
have two AOFs from different nodes, it is hard to learn which one has
latest data. Generally, we determine whose data is more up-to-date by
reading the latest modification time of the AOF file, but because of
replication delay, even if both master and replica write to the AOF at
the same time, the data in the master is more up-to-date (there are
commands that didn't arrive at the replica yet, or a large number of
commands have accumulated on replica side ), so we may make wrong
decision.
### Solution
The replication offset always increments when AOF is enabled even if
there is no replica, we think replication offset is better method to
determine which one has more up-to-date data, whoever has a larger
offset will have newer data, so we add the start replication offset info
for AOF, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224
```
And if we close gracefully the AOF file, not a crash, such as
`shutdown`, `kill signal 15` or `config set appendonly no`, we will add
the end replication offset, as bellow.
```
file appendonly.aof.2.base.rdb seq 2 type b
file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532
```
#### Things to pay attention to
- For BASE AOF, we do not add `startoffset` and `endoffset` info, since
we could not know the start replication replication of data, and it is
useless to help us to determine which one has more up-to-date data.
- For AOFs from old version, we also don't add `startoffset` and
`endoffset` info, since we also don't know start replication replication
of them. If we add the start offset from 0, we might make the judgment
even less accurate. For example, if the master has just rewritten the
AOF, its INCR AOF will inevitably be very small. However, if the replica
has not rewritten AOF for a long time, its INCR AOF might be much
larger. By applying the following method, we might make incorrect
decisions, so we still just check timestamp instead of adding offset
info
- If the last INCR AOF has `startoffset` or `endoffset`, we need to
restore `server.master_repl_offset` according to them to avoid the
rollback of the `startoffset` of next INCR AOF. If it has `endoffset`,
we just use this value as `server.master_repl_offset`, and a very
important thing is to remove this information from the manifest file to
avoid the next time we load the manifest file with wrong `endoffset`. If
it only has `startoffset`, we calculate `server.master_repl_offset` by
the `startoffset` plus the file size.
### How to determine which one has more up-to-date data
If AOF has a larger replication offset, it will have more up-to-date
data. The following is how to get AOF offset:
Read the AOF manifest file to obtain information about **the last INCR
AOF**
1. If the last INCR AOF has `endoffset` field, we can directly use the
`endoffset` to present the replication offset of AOF
2. If there is no `endoffset`(such as redis crashes abnormally), but
there is `startoffset` filed of the last INCR AOF, we can get the
replication offset of AOF by `startoffset` plus the file size
3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`,
maybe from old version, and new version redis has not rewritten AOF yet,
we still need to check the modification timestamp of the last INCR AOF
### TODO
Fix ping causing inconsistency between AOF size and replication
offset in the future PR. Because we increment the replication offset
when sending PING/REPLCONF to the replica but do not write data to the
AOF file, this might cause the starting offset of the AOF file plus its
size to be inconsistent with the actual replication offset.