Fixes#15269.
This keeps `VADD ... CAS SETATTR` in sync with the normal insert path by
incrementing `numattribs` when a CAS insert actually stores an
attribute.
Before this change, the CAS path attached the attribute to the node but
left the set-wide counter unchanged. That made `VINFO` under-report
`attributes-count`, and if the set-wide count stayed at zero, RDB save
skipped attribute serialization for the whole vector set.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Fixes#15250
## Summary
When `redis-cli --cluster rebalance` is invoked with `--user <username>`
but without `-a <password>`, the MIGRATE command constructed by
redis-cli contains an extra unfilled argv slot that gets serialized as
an empty string. The server interprets this empty string as a key with
slot 0, triggering a CROSSSLOT error when it conflicts with the actual
keys' slot.
## Root Cause
In `src/redis-cli.c`, function `clusterManagerMigrateKeysInReply`:
```c
int c = (replace ? 8 : 7);
if (config.conn_info.auth) c += 2;
if (config.conn_info.user) c += 1; // BUG: adds 1 for user even when auth is NULL
size_t argc = c + reply->elements;
```
When `config.conn_info.user` is set but `config.conn_info.auth` is NULL,
`c` is incremented by 1 for the user parameter. However, the argv
filling logic later only sets the user inside the `if
(config.conn_info.auth)` block (using AUTH2 with both user and
password). This mismatch causes:
- `argc` is 1 larger than the actual number of argv entries filled
- The unfilled argv slot is serialized as `$0\r\n\r\n` (empty string)
- Server's `migrateGetKeys` treats the empty string as a key →
`keyHashSlot("",0)` returns slot 0 → CROSSSLOT
## Fix
Consolidate the two separate increments into one:
```diff
- if (config.conn_info.auth) c += 2;
- if (config.conn_info.user) c += 1;
+ if (config.conn_info.auth)
+ c += config.conn_info.user ? 3 : 2;
```
This is consistent with the argv filling logic where both AUTH and AUTH2
cases are handled inside a single `if (config.conn_info.auth)` block.
---------
Co-authored-by: 2030XiaoGe <2030XiaoGe@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@owtffssent.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Avoid zmalloc_size() in kvobjAllocSize() and use approximation instead.
Since for ongoing key allocation histograms work (#14695) we need to
call
kvobjAllocSize() more often on hot paths, using zmalloc_size() would
cause
unnecessary performance overhead.
## Summary
The cluster bus PING/PONG/MEET packet parser validated extension padding
and total length but never checked that string-carrying extensions are
properly null-terminated, allowing a crafted packet to trigger
out-of-bounds reads when the payload is later consumed as a C string.
1. **Null-termination check for hostname and human-nodename extensions
(`cluster_legacy.c`)**
Added a check inside the existing extension-validation loop in
`clusterProcessPacket`: for `CLUSTERMSG_EXT_TYPE_HOSTNAME` and
`CLUSTERMSG_EXT_TYPE_HUMAN_NODENAME` extension types, it verifies that
the data portion is non-empty (`datalen > 0`) and that the last byte is
`'\0'`. Packets failing this check are rejected with a warning log and
an early return, the same way other malformed-extension cases are
handled.
2. **Test (`hostnames.tcl`)**
A new test exercises the rejection path by constructing a raw
cluster-bus PING packet with a 32-byte hostname extension that contains
no `'\0'`, sending it directly to a node's bus port, and verifying the
packet is dropped (warning logged, hostname not updated in `CLUSTER
NODES`). Two helper procs (`build_cluster_bus_ping` and
`build_hostname_extension`) build the binary packet from scratch in Tcl,
allowing fine-grained control over extension contents without needing a
modified Redis sender.
## Problem
On Linux LoongArch64, the SIGSEGV/SIGBUS signal handler in `src/debug.c`
never prints the `Crashed running the instruction at: ...` line, and the
dumped register block reads Dumping of registers not supported for this
OS/arch. The crash report is significantly less useful than on other
supported architectures.
This also causes the "Test module crash when info crashes with a
segfault" case in `tests/unit/moduleapi/crash.tcl` to fail on
loongarch64-linux.
## Cause
In `getAndSetMcontextEip` (`src/debug.c`), the Linux `#if … #elif …`
chain handles `x86/x86_64/ia64/riscv/arm/aarch64` but has no LoongArch64
branch, so it falls through to `NOT_SUPPORTED()` and returns `NULL`.
`logRegisters` has the same gap and hits its own `NOT_SUPPORTED()` for
LoongArch64.
## Fix
Add LoongArch branches to both functions:
- `getAndSetMcontextEip`: read/write `uc->uc_mcontext.__pc`.
- `logRegisters`: dump `r1..r31` and `pc`, then feed `__gregs[3]` (sp)
to `logStackContent`.
Register names follow the [LoongArch Procedure Call
Standard](https://github.com/loongson/la-abi-specs/blob/release/lapcs.adoc);
field names follow the glibc `<sys/ucontext.h>` for loongarch64-linux.
`r0` (hardwired zero) and `r21` (reserved, non-allocatable) are
intentionally omitted.
## Summary
- Refreshes the per-OS **Build from source** sections in `README.md` to
align with the OSes Redis currently tests against.
- Drops Ubuntu 20.04 (Focal) and the macOS 15 "Support and instructions
will be provided at a later date" placeholder.
- Adds new sections for Ubuntu 26.04 (Resolute), AlmaLinux/Rocky 10.1+,
and Alpine 3.23+.
- Renames Debian 11/12 → 12/13 (Bookworm/Trixie) and AlmaLinux/Rocky 9.5
→ 9.7+, and bumps the tested Docker images in each section.
- Updates the macOS heading to cover macOS 14 (Sonoma), 15 (Sequoia),
and 26 (Tahoe), and notes that the instructions apply to both Intel and
Apple Silicon (ARM) Macs. Adds `LTO=0` to the build step, bumps the Rust
pin from 1.80.1 → 1.94.0, and removes GNU libtool from the front of
`PATH` so `RediSearch` builds cleanly — see "Upstream issues surfaced
during validation" below.
To keep maintenance manageable, versions of the same OS family with
identical steps are grouped (Debian 12+13 in one section;
AlmaLinux/Rocky 8.10, 9.7+, 10.1+ each kept as a single section).
## Upstream issues surfaced during validation
The notes inside the new and updated sections document workarounds (and
remaining gaps) discovered while validating these instructions
end-to-end with `BUILD_WITH_MODULES=yes` against the `unstable` HEAD.
They look like issues in the `RediSearch` module's build/Rust
integration rather than Redis core, so they're called out here so
reviewers can decide whether to fix them at the source or keep the
workarounds in the README:
- **Ubuntu 26.04**: clang/LLVM 21 + CMake 4.x from `apt` are not
compatible with the modules build. The section pins CMake to 3.31.6 via
`pip3` in a venv and adds `lld`, `llvm`, and `libcrypt-dev`. With those,
all four modules build cleanly.
- **AlmaLinux/Rocky 10.1+**: The system clang (LLVM 20) lags the Rust
toolchain (LLVM 21) installed via `INSTALL_RUST_TOOLCHAIN=yes`, which
trips `RediSearch`'s cross-language LTO check. This still needs an
upstream fix in `RediSearch` and is the only remaining gap preventing
`redisearch.so` from building on AlmaLinux/Rocky 10.
- **Alpine 3.23**: the section builds `redis-server`, `redisbloom.so`,
`rejson.so`, and `redistimeseries.so`. `redisearch.so` does not build on
Alpine 3.23 because `RediSearch` source uses Rust 1.94 stabilized
features (e.g. `Box::new_zeroed_slice`) while Alpine 3.23 ships Rust
1.91; expected to build once Alpine bumps Rust to ≥ 1.94 (1.95 is
already available on `alpine:edge`). The section deliberately omits
`INSTALL_RUST_TOOLCHAIN=yes` and uses Alpine's packaged
dynamically-linked `rust`/`cargo`, because the official rust-lang.org
musl toolchain is fully statically linked and prevents `bindgen` from
`dlopen`-ing `libclang.so` for `RedisJSON`.
- **macOS 14/15/26**: `RediSearch`'s modules build has three
macOS-specific issues. The section now contains the workarounds; the
underlying issues should probably be fixed in `RediSearch`:
1. `LTO=1` by default but `RediSearch`'s build script aborts on
non-Linux with `Error: LTO is only supported on Linux`. The section sets
`LTO=0`.
2. `RediSearch` source uses Rust edition 2024 and 1.94-stabilized
features (same root cause as the Alpine 3.23 finding above). The
section's Rust pin was bumped from `1.80.1` to `1.94.0`; older Rust
fails with `feature edition2024 is required`.
3. `RediSearch`'s CMake calls `libtool -static` (BSD libtool syntax) to
bundle a unified static archive. The section's `PATH` no longer prepends
`$HOMEBREW_PREFIX/opt/libtool/libexec/gnubin`, so Apple's
`/usr/bin/libtool` wins for that step.
## Test plan
Validated end-to-end via Docker against `unstable` HEAD:
- [x] `BUILD_WITH_MODULES=yes` on `ubuntu:26.04` — produces
`redis-server` + all four module `.so`s (`redisbloom`, `redisearch`,
`rejson`, `redistimeseries`).
- [x] `BUILD_WITH_MODULES=yes` on `almalinux:10.1` — produces
`redis-server` + three module `.so`s (`redisbloom`, `rejson`,
`redistimeseries`); `redisearch.so` blocked on the upstream issue noted
above.
- [x] `BUILD_WITH_MODULES=yes` on `alpine:3.23` — produces
`redis-server` + three module `.so`s (`redisbloom`, `rejson`,
`redistimeseries`); `redisearch.so` blocked on Alpine's Rust version
lagging the `RediSearch` source's Rust 1.94 requirement.
- [x] Smoke-test (dependency install) on `ubuntu:26.04`,
`debian:trixie`, `almalinux:9.7`, `almalinux:10.1`, `alpine:3.23` — all
package install commands succeed.
- [x] `BUILD_WITH_MODULES=yes` on macOS 26 (Tahoe), arm64 (mac2.metal
EC2) — produces `redis-server` + all four module `.so`s. Validation
surfaced the three macOS issues now documented above; with the section's
new `LTO=0`, Rust 1.94 pin, and PATH adjustment, the build is end-to-end
clean. macOS 14/15 not separately re-validated (package list and step
structure unchanged across the three versions).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
**Summary**
The stream RDB/RESTORE loader read the first element of each stream
listpack node (the "valid entries" count) and immediately decoded it as
an integer, but under shallow validation (`sanitize-dump-payload no`)
only the listpack header had been checked — never the first entry
itself. A crafted payload whose first entry declares an oversized string
encoding (e.g. `LP_ENCODING_32BIT_STR` claiming `0x7FFFFFFF` bytes)
caused `lpGetIntegerValue()` to read encoding-dependent bytes past the
end of the listpack, triggering an out-of-bounds read / crash.
**Changes**
1. **First-entry validation before integer decode (`rdb.c`)**
In `rdbLoadObject`, the stream-loading path now obtains the first
element via `lpValidateFirst()` instead of `lpFirst()`, and then
validates that entry with `lpValidateNext()` before
`lpGetIntegerValue()` decodes it. If the entry is malformed, the load is
rejected cleanly via `rdbReportCorruptRDB("Stream listpack integrity
check failed.")` with proper cleanup (`sdsfree(nodekey)`,
`decrRefCount(o)`, `zfree(lp)`) and an early `NULL` return, matching how
other corrupt-payload cases are handled. This closes the gap where only
the listpack header — not its first entry — was checked under
`sanitize-dump-payload no`.
2. **Test (`corrupt-dump.tcl`)**
A new test (`corrupt payload: stream listpack entry with corrupt
encoding crashes lpFirst`) exercises the rejection path. It disables
payload sanitization and checksum validation, then issues a `RESTORE`
with a hand-built stream payload whose listpack has a valid header but a
first entry encoded as `LP_ENCODING_32BIT_STR` (`0xF0`) declaring a
`0x7FFFFFFF`-byte string that runs past the end of the listpack. The
test asserts the command fails with `*Bad data format*`, verifies the
`*Stream listpack integrity check failed*` warning is logged, and
confirms the server survives (`r ping`).
## Summary
Add overflow checks for attacker-controlled `uint32_t` length fields in
`clusterProcessPacket()` before adding them to `explen` (`uint32_t`).
Without these checks, crafted PUBLISH/PUBLISHSHARD and MODULE cluster
bus messages can wrap `explen` to a small value, bypassing the `totlen
!= explen` validation and causing heap out-of-bounds reads or OOM aborts
in the processing path.
## Problem
In `clusterProcessPacket()`, the expected packet length for PUBLISH and
MODULE messages is computed by summing struct overhead with
variable-length fields read from the packet header via `ntohl()`:
```c
// PUBLISH (line 2841-2846)
explen += sizeof(clusterMsgDataPublish) - 8 +
ntohl(hdr->data.publish.msg.channel_len) +
ntohl(hdr->data.publish.msg.message_len);
// MODULE (line 2855-2858)
explen += sizeof(clusterMsgModule) - 3 + ntohl(hdr->data.module.msg.len);
```
Both `channel_len + message_len` (PUBLISH) and `len` (MODULE) are
`uint32_t` values from the network packet. Their addition to `explen`
can overflow `uint32_t`, wrapping to a small value that matches
`totlen`, passing the size validation at line 2864.
**Example (PUBLISH):** With `channel_len = 0x80000000` and `message_len
= 0x80000000`, their sum overflows to `0`, making `explen` equal to just
the base struct size. An attacker sets `totlen` to match. The validation
passes, and the processing path at line 3273 calls
`createStringObject()` with the original 2GB lengths, reading far past
the received buffer.
**Attack vector:** Reachable via the cluster bus port (default: data
port + 10000), which does not require authentication by default.
## Fix
Extract `ntohl` values into local variables and check each addition
against `UINT32_MAX` before performing it. Reject the packet with a
warning log if overflow is detected. The computed `explen` is identical
to the original for all non-overflowing (legitimate) inputs.
---------
Signed-off-by: Sebastien Tardif <sebtardif@ncf.ca>
Co-authored-by: debing.sun <debing.sun@redis.com>
Add `raxFindLink()` + `raxInsertAt()`: a two-step commit API that
lets a caller walk the rax once to find a key, then commit an insert
at the recorded position without re-walking. Today's
"lookup then insert" pattern (`raxFind` + `raxInsert` on miss) walks
the tree twice; this change collapses the worst case to a single
walk.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
### Issue
The module datatype defrag test sends 20k commands through a deferred
client before reading any replies. On slower CI environments this can
cause replies to accumulate and fill TCP/socket buffers, leading to
flaky `I/O error reading reply` failures.
### Change
Fix by batching deferred writes and reply drains, following the same
approach used in #14886.
Partial fix#11085
## Bug
`parseRedisUri()` called `percentDecode(curr, 0)` for URIs of the form
`redis://:password@host` and stored the resulting **empty-but-non-NULL**
sds into `connInfo->user`.
`cliAuth()` only checks `user == NULL` when deciding between legacy
`AUTH <pass>` and ACL `AUTH <user> <pass>`, so the empty username was
sent as `AUTH "" <password>`, which the server rejects with `WRONGPASS`.
## Fix
In parseRedisUri(), treat explicitly empty username or password
components as NULL rather than empty SDS strings. This allows cliAuth()
to fall back to legacy single-argument AUTH when the username is empty,
and to skip AUTH entirely when the password is empty.
Before replacing URI-parsed credentials, existing connInfo->user and
connInfo->auth values are freed to avoid leaks and preserve the expected
"later arguments override earlier ones" behavior.
As a related cleanup, -a/--pass and --user now also free previously
assigned values before reassignment, fixing the same leak pattern when
those options are specified multiple times.
Fixes: test faulure in
https://github.com/redis/redis/actions/runs/26698983853/job/78688240089
While investigating the failure, I added temporary debug output:
```tcl
puts "cpu_time_array=$cpu_time_array"
puts "num_returned_cpu=$num_returned_cpu"
puts "res=$res"
```
The failing SAMPLE 1000 run produced:
```
cpu_time_array=key_010 3 key_000 3 key_015 2 key_005 2 key_013 2 key_056 2 key_054 2 key_046 1 key_044 1 key_049 1
num_returned_cpu=10
res=5
```
The test becomes statistically unstable at SAMPLE 1000. With only ~50
sampled operations out of 50k requests, the accumulated CPU times
collapse into a very narrow range (observed: 1–3 µs), causing frequent
ties between hot and cold keys near the Top-K cutoff. The resulting
failures are driven by sampling variance rather than HOTKEYS
correctness.
Since the test already covers sampling behavior with ratios 1, 100, and
500, removing the 1000 case preserves coverage while eliminating a
configuration that is too sparse to produce stable, reproducible
assertions.
asmSyncWithSource() built the error with sdscatfmt() and a "%.40s"
specifier. sdscatfmt() is not printf: it parses only the single byte
after '%' and ignores width/precision, so "%.40s" emits a literal '.',
consumes no argument, and task->source is never printed. The message
rendered as "Source node .40s was not found", dropping the node name.
task->source is a CLUSTER_NAMELEN (40) byte, non-NUL-terminated buffer
(filled via memcpy and always read with an explicit length elsewhere),
so simply switching to sdscatfmt's "%s" would strlen() past the buffer.
Use sdscatprintf(), which honors the "%.40s" precision and bounds the
read to 40 bytes -- matching the sibling error paths in this function
that already use sdscatprintf().
## Issue
The vector set Python tests intentionally use two clients:
- the default client (`self.redis`) for the existing RESP2-oriented test
expectations
- `self.redis3` for RESP3-specific coverage.
However, the default client did not explicitly set a protocol, so it
depended on redis-py's default behavior. With newer redis-py versions,
RESP3 is now the default
protocol(https://github.com/redis/redis-py/pull/4052). In particular,
vector set replies such as `VSIM ... WITHSCORES` may be parsed into
map/dict-like structures instead of the RESP2 flat-array shape assumed
by existing tests.
## Changes
Explicitly create the default primary and replica Redis clients with
`protocol=2`.
`self.redis3` is left unchanged and continues to use `protocol=3` for
RESP3-specific test coverage.
## Summary
In the reply copy-avoidance path (bulkStrRef), the RESP bulk-string
prefix `$<len>\r\n` was formatted eagerly on the main thread inside
`_addBulkStrRefToBufferOrList()`. This PR defers that formatting to
write time via a new idempotent helper `formatBulkStrRefPrefix()`, so
the work happens on the IO thread for clients served by IO threads.
---------
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
### Issue
CLUSTER SLOT-STATS network-bytes-in, in-line buffer processing can fail
because the test sends an inline SET through a deferring client and
immediately checks slot stats from another client.
$rd flush only guarantees the request was written to the socket; it does
not guarantee Redis has processed the command and updated
network-bytes-in.
### Change
Wait for the inline SET reply before reading CLUSTER SLOT-STATS.
### Issue
This refines the fix from #15119.
The previous change increased the generic `read_cli` empty-read retry
threshold from 5 to 100 to reduce flakiness in the redis-cli reverse
search no-result test. While effective, that made every interactive CLI
read potentially wait longer.
### Changes
This restores the original generic `read_cli` behavior and uses targeted
pattern-based waiting only where needed.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
`stat_io_reads_processed[]` and `stat_io_writes_processed[]` were
per-IO-thread arrays inside `struct redisServer` that suffer from false
sharing. This PR moves the two stat counters into the IOThread struct,
which is already `__attribute__((aligned(CACHE_LINE_SIZE)))`. Each IO
thread's counters now sit on a separate cache line, eliminating the
cross-thread contention.
- Added io_reads_processed and io_writes_processed fields to IOThread struct
- Removed stat_io_reads_processed[] and stat_io_writes_processed[] from struct redisServer
- Made IOThreads[] non-static with extern declaration in server.h
## Problem
`strtod()` handles some `nan(n-char-sequence)` inputs differently across
libc implementations. For example, `nan(ab!c)` and `nan(ab c)` may be
accepted on some platforms but rejected on others.
The existing test treated these inputs as fixed invalid cases, which can
fail on platforms whose libc accepts them.
## Changes
- Move libc-dependent `nan(...)` cases out of the fixed invalid test
list.
- Add a helper to verify `fast_float_strtod()` matches the platform
`strtod()` behavior for these cases, including value, `endptr`, and
success/failure status.
- Keep the existing parser behavior unchanged.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
## Problem
The `redis-cli` reverse-search test for the no-result case can be flaky
in slower CI environments.
`read_cli` may return too early when CLI output is fragmented or
delayed. It currently gives up after only 5 consecutive empty reads,
with a 10ms delay between reads, which can make the test assert before
the expected `(empty array)` output is printed.
## Changes
Increase the `read_cli` consecutive empty-read threshold from `5` to
`100`.
This keeps the existing read behavior unchanged when data is available,
but allows the helper to wait longer for delayed/fragmented CLI output
before giving up.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Add new module API `RM_GetClusterNodeSlotRanges` that allows modules
to query slot ranges for any cluster node by its node ID, not just the
local node:
```c
RedisModuleSlotRangeArray *RM_GetClusterNodeSlotRanges(RedisModuleCtx *ctx, const char *nodeid)
```
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
This is a follow-up to
[redis/redis#14938](https://github.com/redis/redis/pull/14938), which
upgraded GitHub Actions to newer stable versions for the upcoming
Node.js 20 deprecation on GitHub Actions runners.
That PR missed two remaining action updates in `daily.yml`:
- `cross-platform-actions/action`
- `py-actions/py-dependency-install`
### Why replace `py-actions/py-dependency-install`
`py-actions/py-dependency-install` is no longer an actively maintained
dependency installation action, so keeping it in CI increases
maintenance and supply-chain risk over time.
The replacement uses GitHub's official `actions/setup-python` action,
which is actively maintained and supports built-in `pip` dependency
caching. Installing dependencies with `python -m pip install -r
./utils/req-res-validator/requirements.txt` also makes the workflow
behavior explicit and easier to debug.
On macOS, running `make test` often fails with "too many open files" due
to the low default limit (usually 256).
This PR increases the limit by adding `ulimit -n 4096` so that the tests
have enough file descriptors for concurrent connections.
Follow https://github.com/redis/redis/issues/15045
## Summary
Simplify INCREX's out-of-bounds policy:
The original INCREX shipped with three out-of-bounds policies — OVERFLOW
FAIL, OVERFLOW SAT, OVERFLOW REJECT — but FAIL and REJECT are
functionally redundant: both leave the key untouched when the result is
out of bounds. They differ only in how the caller is notified (error
reply vs. [current_value, 0] array reply), which forces the user to make
a stylistic choice with no real semantic difference.
This PR collapses the three policies into one clear behavior:
* Default: the operation is rejected; the key value and TTL are left
unchanged, and the reply is [current_value, 0]. Callers detect
non-application by checking the applied-increment field; no
error-handling branch is required.
* SATURATE: the result is saturated to UBOUND / LBOUND, or to the type
limits (LLONG_MAX/MIN for BYINT, ±LDBL_MAX for BYFLOAT) when no explicit
bound is given.
New syntax:
INCREX <key> [BYFLOAT increment | BYINT increment]
[LBOUND lowerbound] [UBOUND upperbound] [SATURATE]
[EX seconds | PX milliseconds | EXAT seconds-timestamp | PXAT
milliseconds-timestamp | PERSIST] [ENX]
---------
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
This PR is based on: valkey-io/valkey#3511
Close https://github.com/redis/redis/issues/14983
## Summary
During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.
The failure reason is similar to
https://github.com/redis/redis/pull/14946, the socket buffer is more
easy to fill.
## Root Cause
In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:
1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads
The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.
## Observed Behavior
With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):
- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall
Ultimately, it leads to a very slow synchronization process of the
entire master and replica.
### Changes
1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.
2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.
3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.
5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.
6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.
7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.
---------
Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
Fixes:
1. After #15096, we pass -flto to jemalloc. On Azure Linux, the
resulting jemalloc library cannot be handled at link time and the build
fails. Adding -ffat-lto-objects so the compiler also emits regular
object code that the linker can fall back to when it cannot handle the
LTO-compiled library.
2. Fixed a warning about `path` being NULL in
`moduleLoadInternalModules()`.
3. Fixed compile warnings on older GCC versions introduced by #15162
(reported on Ubuntu 20.04)
Co-authored-by: debing.sun <debing.sun@redis.com>
Enabling memory tracking is forbidden during runtime if it is already
disabled. In non-clustered mode though the checks were incorrect so this
PR enforces the correct behavior in non-clustered environment.
Ensure backward compatibility and consistent behavior across different
architectures by explicitly setting the default value.
Fixes#15175
Co-authored-by: ofiryanai <ofiryanai1@gmail.com>
After introducing GCRA algorithm into redis
https://github.com/redis/redis/pull/14826 and subsequent introduction of
new RATE_LIMIT object type - https://github.com/redis/redis/pull/14905.
It was internally decided not to introduce GCRA into the new release.
As still no decision is made on whether it will be kept or not in the
future, this PR only makes the code related to GCRA dead - commands are
inaccessible and AOF/RDB load+save is disabled.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Close#15177
Follow [Fix use-after-free when fullsync happens while replica is
running a timed out script
(CVE-2026-23631)](0cca172a17)
Remove the `repl-diskless-load yes` test configuration because this
option exists only in the Redis fork and is not available in Redis OSS.
(cherry picked from commit 5033e15143)
Fixes [#15183](https://github.com/redis/redis/issues/15183).
## Motivation
Commit
[cf668f2c2](cf668f2c2c)
tightened cluster-announce-ip validation to require a valid IPv4 or IPv6
address, which is a regression for users that legitimately announce a
hostname.
## Changes
* isValidClusterAnnounceIp() now accepts either:
* A valid IPv4/IPv6 address
* A valid hostname — same character rules as cluster-announce-hostname,
length-bounded by NET_IP_STR_LEN to match the storage buffer.
(cherry picked from commit 21f2569f9b)
* Limit VADD REDUCE dim to not exceed original dim
Enforce VADD key [REDUCE dim] to reject dim that is bigger than the HNSW original dim, as dimension reduction makes no sense for reduce_dim > original_dim.
This also avoids OOM and possible heap overflow on later allocations using reduce_dim.
This should be backported to Redis version 8.0, 8.2 and 8.4.
Fullsync triggers emptyData and scriptingReset which free the scripting/function engine. If a timed out script is still running on the replica, this causes a use-after-free. Delay fullsync processing in readSyncBulkPayload until the script finishes.
Add the DENYOOM flag to SUBSCRIBE, PSUBSCRIBE, and SSUBSCRIBE commands
to bring their memory protection behavior in line with other Redis commands.
Problem:
Currently, subscribe commands lack memory protection when Redis reaches its
memory limit. This becomes problematic in two specific scenarios:
1. When the eviction policy doesn't allow eviction (e.g., noeviction)
2. When there are no evictable keys remaining in the database
In these cases, memory usage from pub/sub subscribers can keep growing
unchecked, potentially causing the Redis server to run out of memory. This
behavior is inconsistent with other Redis commands, which are protected by
the DENYOOM flag.
Solution:
Add the DENYOOM flag to all subscribe commands. When memory limits are
reached, these commands will be rejected, preventing uncontrolled memory
growth and aligning their behavior with other Redis commands.