Commit graph

465 commits

Author SHA1 Message Date
debing.sun
95040d61d5
Replace INCREX out-of-bounds policy to a single SATURATE option (#15237)
Follow https://github.com/redis/redis/issues/15045

## Summary

Simplify INCREX's out-of-bounds policy:

The original INCREX shipped with three out-of-bounds policies — OVERFLOW
FAIL, OVERFLOW SAT, OVERFLOW REJECT — but FAIL and REJECT are
functionally redundant: both leave the key untouched when the result is
out of bounds. They differ only in how the caller is notified (error
reply vs. [current_value, 0] array reply), which forces the user to make
a stylistic choice with no real semantic difference.

This PR collapses the three policies into one clear behavior:

* Default: the operation is rejected; the key value and TTL are left
unchanged, and the reply is [current_value, 0]. Callers detect
non-application by checking the applied-increment field; no
error-handling branch is required.
* SATURATE: the result is saturated to UBOUND / LBOUND, or to the type
limits (LLONG_MAX/MIN for BYINT, ±LDBL_MAX for BYFLOAT) when no explicit
bound is given.

New syntax:

    INCREX <key> [BYFLOAT increment | BYINT increment]
                 [LBOUND lowerbound] [UBOUND upperbound] [SATURATE]
[EX seconds | PX milliseconds | EXAT seconds-timestamp | PXAT
milliseconds-timestamp | PERSIST] [ENX]

---------

Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
2026-05-20 18:07:14 +08:00
Salvatore Sanfilippo
0d9576435f
Implement the new Redis Array type (#15162)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Redis Array

For years, Redis has been missing a real indexed data structure for the
use cases where the index and the spatial relationship of elements are
semantic. Hashes give you random lookups, but you have to store an index
as a key, and have no range visibility. Lists give you appending and
trimming, but what is in the middle remains hard to access. Streams give
you append-only events, which is another (useful, indeed) beast. None of
these is what you want when the *position itself* has business meaning —
slot 37, step 4, row 18552, day from 2934 to 2949, file line 11, 12, 15
and so forth. And, all those types, for different reasons, are all
suboptimal when you want a **ring buffer** able to store the latest N
observed samples of something.

Up to now, users found ways (they always do \o/) using the fact that the
data structures that are obvious in this universe are also extremely
powerful, if well implemented. But this forces compromises. Arrays
handle these index-first requirements natively, and usually with much
better memory and CPU usage than the workarounds. If the use case is the
right one, Arrays often provide much better space, time and usability at
the same time.

## Internal encoding

1. When dense, an Array is essentially a more fancy C array. You don't
pay anything for storing the index.
2. Yet, instead of going really flat, arrays are sliced into
4096-element slices, and each slice, when it contains just a few
elements, uses a special sparse encoding. When a slice is empty it's
just a `NULL` stored in the directory.
3. Small ints, floats, and short strings are pointer-tagged, so they
cost zero additional memory beyond the pointer slot itself.
4. When very sparse, a super-directory of windowed directories is used.
This allows the data type to be safe, instead of exhibiting pathological
space or time behavior. This representation is only triggered when there
are more than 8 million elements or very high indexes set.

## Use cases

Arrays are mostly stateless if not for the fact that each array
remembers the index of the latest added item, allowing `ARINSERT` and
`ARRING` to work properly. Otherwise it is a set/get at this index game,
with solid support for both setting / getting ranges, server-side
scanning, returning only populated elements in a time which is
proportional not to the range size, but to the population size.

A few concrete examples, that may work as mental models for the set of
problems that are similar to them (from the POV of the data modeling).

**Thermometer.** A sensor reporting once per minute, with gaps:

```
ARSET       temp:room12:day7 123 22.3
ARGETRANGE  temp:room12:day7 600 660     # the 10:00–11:00 window, with NULLs
ARSCAN      temp:room12:day7 600 660     # only populated elements
AROP        temp:room12:day7 0 1439 MAX  # peak of the day, server-side
```

Missing minutes cost little to nothing. Numeric aggregation runs inside
Redis. Telemetry, IoT, meter readings, KPI rollups.

**Calendar.** A clinic with 96 fifteen-minute slots per day:

```
ARSET       sched:room12:day 32 booking:991
ARSCAN      sched:room12:day 0 95      # only occupied slots
ARGETRANGE  sched:room12:day 48 63     # the afternoon full view to render
```

The slot number is the business key in this case. Room booking, parking
spaces, warehouse bins, lockers, ...

**Ring buffer.** ARRING replaces the classic LPUSH+LTRIM pattern.
Imagine remote `dmesg`.

```
ARRING       machine:123 200 "[141087.430123]: arm_cpu_init(): cpu 14 online" # Capped to 200 entries
ARLASTITEMS  machine:123 50 REV       # 50 newest first
```

Faster than LPUSH+LTRIM, keep indexed access to past elements. Last-N
alarms, recent fraud scores, access history, remote logs, device events.
Ok here the use cases are mainly the ones of the old pattern: it is just
a better fit and allows to access random items in the middle, aggregate
server-side, and so forth.

**Workflow.** Step number is the index, value is the status. Gaps are
meaningful:

```
ARSET       claim:99172 0 received
ARSET       claim:99172 3 waiting:reviewer42
ARSET       claim:99172 5 approved
ARGETRANGE  claim:99172 0 5     # full workflow view, with NULLs for missing steps
ARSCAN      claim:99172 0 5     # only steps that have a state
ARCOUNT     claim:99172         # number of recorded steps
ARLEN       claim:99172         # highest reached step + 1
```

**Skills knowledge base for agents.** Arrays are good at representing /
grepping into Markdown files:

```
ARSET   skill:metal_gpu 0 "...."
ARSET   skill:metal_gpu 1 "...."
ARSET   skill:metal_gpu 2 "...."
ARGREP  skill:metal_gpu - + RE "M3|M4" WITHVALUES
```

ARGREP has EXACT, MATCH, GLOB, RE, you can have multiple predicates, can
select AND or OR behavior.

**Bulk import results.** Sparse row annotations over millions of rows /
CSV / ...:

```
ARSET  import:job551 18552 ERR:bad_email
ARSCAN import:job551 0 1000000        # Provides only rows that have something
```

## TLDR

If the position is part of the meaning, use an Array. If you want to
aggregate or grep remotely, use an Array. Feedback welcome :)

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Shubham S Taple <155555100+ShubhamTaple@users.noreply.github.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
Co-authored-by: Marc Gravell <marc.gravell@gmail.com>
2026-05-14 00:56:44 +08:00
raffertyyu
9c1ecd044e
Add INCREX command for atomic increment with TTL and bounds (#15045)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
Close #14278 

## Overview

Rate limiters, sliding windows, request counters, and numerous other
network-facing patterns share a common primitive: **atomically increment
a counter and set its expiration**. Achieving this in Redis requires
either multiple round-trips or a Lua script that bundles `INCR` /
`INCRBY` / `INCRBYFLOAT` with `EXPIRE` / `PEXPIRE`.

We propose a new command, **`INCREX`**, that collapses this two-step
pattern into a single, native, O(1) command. `INCREX` atomically:

1. Increments (or decrements) a key's numeric value — by integer or
float.
2. Optionally enforces lower and/or upper bounds, with a configurable
overflow policy (error out, saturate, or no-op), enabling built-in cap
enforcement (e.g., max request count) without additional client logic.
3. Optionally sets or removes the key's expiration.
4. Returns both the **new value** and the **actual increment applied**,
giving the caller immediate feedback on whether the operation was
saturated or skipped.

## Use Cases

### Basic Usage

```
# Increment by 1 (default) and set a 60-second TTL.
> SET mykey 10
> INCREX mykey EX 60
1) (integer) 11        # new value
2) (integer) 1         # actual increment

# Use 0 as initial value if the key doesn't exist.
> DEL mykey
> INCREX mykey
1) (integer) 1         # new value
2) (integer) 1         # actual increment

# Default policy (OVERFLOW FAIL): exceeding a bound returns an error.
> SET mykey 5
> INCREX mykey BYINT 20 UBOUND 10
(error) value is out of bounds

# Opt into saturation with OVERFLOW SAT.
> INCREX mykey BYINT 20 UBOUND 10 OVERFLOW SAT
1) (integer) 10     # saturated to upper bound
2) (integer) 5      # only 5 was actually applied

# Skip the operation with OVERFLOW REJECT — the key and its TTL are
# untouched, and the reply reports the current value with a zero delta.
> SET mykey 5
> INCREX mykey BYINT 20 UBOUND 10 OVERFLOW REJECT
1) (integer) 5      # current (unchanged) value
2) (integer) 0      # nothing was applied

# Increment by a float
> SET mykey 1
> INCREX mykey BYFLOAT 0.5
1) "1.5"
2) "0.5"
```

### Use Case: Rate Limiter

**Before (Lua script):**
```lua
-- KEYS[1] = rate limit key, ARGV[1] = limit, ARGV[2] = window in seconds
local current = redis.call('INCR', KEYS[1])
if current > tonumber(ARGV[1]) then
    return 0  -- rejected
end
if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[2])
end
return 1  -- allowed
```

Client invocation:

```python
result = redis.eval(LUA_SCRIPT, 1, f"ratelimit:{user_id}", 100, 60)
if result == 0:
    reject_request()
```

**After (INCREX):**

```python
new_val, actual_incr = redis.execute_command(
    "INCREX", f"ratelimit:{user_id}", "UBOUND", 100, "OVERFLOW", "REJECT", "EX", 60, "ENX"
)
if actual_incr == 0:
    # Rate limit exceeded — key left unchanged.
    reject_request()
```

`ENX` means: set expiration only if the key doesn't already have an
expiration. This ensures the sliding window's TTL is only set on the
first request.

### Use Case: Token Bucket Refill

Refill tokens periodically up to a capacity ceiling, saturating at the
cap instead of erroring:

```
> INCREX tokens:user123 BYINT 10 UBOUND 100 OVERFLOW SAT EX 3600 ENX
1) (integer) 10
2) (integer) 10
```

Tokens cannot exceed 100, and the key auto-expires after inactivity.

### Use Case: Countdown / Resource Consumption

Decrement a resource counter down to zero, saturating at the floor:

```
> SET credits:user123 50
> INCREX credits:user123 BYINT -1 LBOUND 0 OVERFLOW SAT
1) (integer) 49
2) (integer) -1
```

When credits are exhausted, `OVERFLOW SAT` prevents negative balances
without client-side checks.

## Parameter Reference

### Syntax

```
INCREX key
      [BYFLOAT increment | BYINT increment]
      [LBOUND lowerbound] [UBOUND upperbound] [OVERFLOW <FAIL | SAT | REJECT>]
      [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] [ENX]
```

### Parameters

| Parameter | Description |
|-----------|-------------|
| `key` | The key to increment. Created with value `0` if it does not
exist. |
| `BYFLOAT increment` | Increment the value by the given long-double
float. |
| `BYINT increment` | Increment the value by the given 64-bit signed
integer. |
| `LBOUND lowerbound` | Set lower bound for the increment result.
Defaults to `LLONG_MIN` (integer) or `-LDBL_MAX` (float). |
| `UBOUND upperbound` | Set upper bound for the increment result.
Defaults to `LLONG_MAX` (integer) or `LDBL_MAX` (float). |
| `OVERFLOW <FAIL \| SAT \| REJECT>` | Set the overflow policy when the
result would be out of bounds. `FAIL` rejects the operation with an
error (default). `SAT` saturates the result to the bound. `REJECT`
leaves the key and its TTL untouched and replies with the current value
and a zero delta. |
| `EX seconds` | Set the key's TTL to `seconds` seconds. |
| `PX milliseconds` | Set the key's TTL to `milliseconds` milliseconds.
|
| `EXAT unix-time-seconds` | Set the key's expiration to the absolute
Unix timestamp in seconds. |
| `PXAT unix-time-milliseconds` | Set the key's expiration to the
absolute Unix timestamp in milliseconds. |
| `PERSIST` | Remove the key's existing TTL. |
| `ENX` | Set the key's TTL/expiration if it has No eXpiration |

If neither `BYINT` nor `BYFLOAT` is specified, the increment defaults to
integer `1`.

### Return Value

An **array of two elements**:

1. **New value** — the value of the key after the increment (or the
unchanged current value under `OVERFLOW REJECT`).
2. **Actual increment** — the increment that was actually applied. May
differ from the requested increment when `OVERFLOW SAT` saturates the
result to a bound, and is always `0` when `OVERFLOW REJECT` skipped the
operation.

- In integer mode (default or `BYINT`): both elements are **integers**.
- In float mode (`BYFLOAT`): both elements are **bulk strings**
representing the float values on RESP2, and **RESP3 Doubles** on RESP3.

### Overflow Policy (FAIL vs. SAT vs. REJECT)

Controlled by the optional `OVERFLOW` argument. A bound violation
includes both exceeding an explicit `LBOUND`/`UBOUND` and overflowing
the type limits when no explicit bound is given.

- **`OVERFLOW FAIL` (default)**: if the computed result would violate a
bound, the command returns an error and the key is left unchanged. This
matches the existing semantics of `INCRBY` / `INCRBYFLOAT` on overflow.
- **`OVERFLOW SAT`**: the result is silently capped at `UBOUND` /
floored at `LBOUND` (or saturated to the type limits when no explicit
bound is given). The second element of the reply reflects the saturated
delta. If the delta cannot be represented as a 64-bit signed
integer(default or `BYINT`), or would produce Infinity(`BYFLOAT`), an
error is returned.
- **`OVERFLOW REJECT`**: the operation is silently skipped — the key
value and its TTL are left unchanged, no keyspace notification is fired,
and nothing is replicated. The reply is `[current_value, 0]`, allowing
the caller to detect the rejection without handling an error.

### Notes
- If **no expiration option** is given, the key's existing TTL is
preserved (like `INCR`).
- `ENX` requires one of `EX`/`PX`/`EXAT`/`PXAT`.
- If the result is saturated by `OVERFLOW SAT`, the expiration is still
applied as specified.
- Under `OVERFLOW REJECT` the expiration option is ignored on the
rejected branch — TTL is preserved exactly as it was before the call.
- **`BYINT` requires an integer-typed existing value; `BYFLOAT` accepts
both.**
Integers can be promoted to floats losslessly, but a stored float (e.g.
`"1.5"`) cannot be parsed back as an integer. This is consistent with
`INCR`/`INCRBY` (integer-only) and `INCRBYFLOAT` (accepts both).

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Moti Cohen <moti.cohen@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
2026-05-11 18:22:27 +08:00
Filipe Oliveira (Redis)
62551a7b12
Batched MGET/MSET dict prefetch with dictType-driven payload hints (#15133)
Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses
across the keys of a single multi-key command. Builds on the cross-command
batched prefetch framework introduced in #14017 and the dict-prefetch state
machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the
state machine into two new `dictType` callbacks so the same machinery can
be reused for other dict-encoded types later (hash hashtable, sets, sorted
sets) without paying for `kvobj`-specific code paths in the core loop.

Bundles the work originally proposed in #14899 (MGET prefetch framework,
by @mpozniak95) and #15043 (MSET batch prefetch).

## Design

Two new optional callbacks on `dictType`:

```c
typedef struct dictType {
    ...
    /* Bring the entry's key payload into cache before keyCompare runs.
     * Returns the address to prefetch, or NULL if the entry alone is enough. */
    void *(*prefetchEntryKey)(const dictEntry *de);

    /* Called only after a key match. Returns the value-side payload to
     * prefetch (or NULL). */
    void *(*prefetchEntryValue)(const dictEntry *de);
} dictType;
```

`dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()`
shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` /
`OBJ_ENCODING_RAW` values — now lives in two small helpers in
`server.c`:

```c
static void *dbDictPrefetchEntryKey(const dictEntry *de) {
    return dictEntryIsKey(de) ? NULL : dictGetKey(de);
}

static void *dbDictPrefetchEntryValue(const dictEntry *de) {
    kvobj *kv = dictGetKey(de);
    return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW)
            ? kv->ptr : NULL;
}
```

The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data`
parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the
dict's own type drives both ends. This also removes the foot-gun where
callers (like `mgetCommand`) had to remember whether to pass
`prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer
references `kvobj`, `kvobjGetKey`, or any specific value layout.

## State machine

Two file-local types in `memory_prefetch.c`:

| Type | Role |
|---|---|
| `dictPrefetchLookup` | Per-key snapshot of an in-flight,
software-pipelined `dictFind` (mirrors the locals a synchronous
`dictFind` would carry across one bucket walk). |
| `dictPrefetcher` | Driver that advances a batch of
`dictPrefetchLookup`s through the FSM, yielding to the next in-flight
lookup each time a prefetch is issued. |

Five-stage lifecycle for each lookup, driven by the prefetcher:

```text
                                                           │
                                                         start
                                                           │
                                                  ┌────────▼─────────┐
                                       ┌─────────►│  PREFETCH_BUCKET ├────►────────┐
                                       │          └────────┬─────────┘            no more tables
                                       │             bucket│found                  │
                                       │                   │                       │
        entry not found - goto next table         ┌────────▼────────┐              │
                                       └────◄─────┤ PREFETCH_ENTRY  │              ▼
                                    ┌────────────►└────────┬────────┘              │
                                    │                 entry│found                  │
                                    │                      │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    │          │   PREFETCH_ENTRY_KEY    │ ◄── dictType->prefetchEntryKey(de)
                                    │          └───────────┬─────────────┘         │
                                    │                      │                       │
        key mismatch - goto next entry                     │                       │
                                    │          ┌───────────▼─────────────┐         │
                                    └──────◄───│   PREFETCH_ENTRY_VALUE  │ ◄── keyCompare; on match,
                                               └───────────┬─────────────┘     dictType->prefetchEntryValue(de)
                                                           │                       │
                                                 ┌─────────▼─────────────┐         │
                                                 │     PREFETCH_DONE     │◄────────┘
                                                 └───────────────────────┘
```

`PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]`
if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no
more tables remain.

`memory_prefetch.c` exposes a small lifecycle that any caller can drive:

```c
dictPrefetcherInit(p, max_keys);                  /* one-shot heap alloc of lookups[] */
dictPrefetcherReset(p, dicts, keys, nkeys);       /* configure for one batch */
dictPrefetcherRun(p);                             /* drive FSM until all PREFETCH_DONE */
dictPrefetcherFree(p);                            /* release */
```

Each FSM stage is a named static function (`dictPrefetchBucket`,
`dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`),
so the `dictPrefetcherRun` driver is a four-line `switch` over the
state.

The state machine is dict-pure: no `kvobj` field on
`dictPrefetchLookup`,
no `kvobjGetKey` reach-through. Round-robin advance semantics — a state
only advances the cursor if a prefetch was actually issued — are
preserved, so the embedded-kvobj fast path
(`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the
extra prefetch and falls straight into the compare on the next loop
iteration.

The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`)
embeds a `dictPrefetcher` initialized once at startup and reset per
batch, so cross-command prefetching no longer allocates per call.

## Intra-command API

```c
void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys);
```

A single multi-key command (e.g. MGET) can prefetch dict data for a
batch of its own keys, reusing the same state machine that the
cross-command path uses. Single-key calls (`nkeys <= 1`) early-return —
nothing to interleave with. The implementation stack-allocates a
fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no
VLA, predictable stack usage), so the intra-command path doesn't touch
the heap.

## Notes on the call sites

A shared helper picks the next prefetch batch and warms it via
`dictPrefetchKeys`:

```c
/* Pick the next prefetch batch starting at argv[start] and warm it via
 * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for
 * key/value pairs (MSET). Returns the chosen batch size in items. */
static int prefetchKeysBatch(client *c, int slot, int start, int stride);
```

Adaptive batch sizing inside the helper: if at least two full batches
(`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch
(`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one
call. This generalizes the small-request fast path so the trailing
batch of a large request also gets the single-call benefit.

- **MGET (`mgetCommand`)** — gated by
`do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`,
with `already_prefetched = c->current_pending_cmd &&
(c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`.
  When `do_prefetch` is set, each iteration calls
  `prefetchKeysBatch(c, slot, j, 1)` and then sequentially
  `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is
  clear (cross-command path already warmed the keys, or batch
  prefetching is off), the loop takes all remaining items in one go
  and skips the prefetch.

- **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as
  MGET with `stride = 2`. For the NX flag the NX-check loop runs
  `lookupKeyWrite` (which already warmed everything via
  `prefetchKeysBatch`); the SET loop then disables further prefetch
  (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass.
  Going through the full state machine (rather than bucket-only) means
  `dbDictType`'s `prefetchEntryValue` callback runs on a key match —
  warming the old kvobj's payload, which `setKey -> dbReplaceValue ->
  updateKeysizesHist(oldlen, newlen)` then reads to compute the
  histogram delta. The slot dict is re-fetched per batch — in cluster
  mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS`
  + `expireIfNeeded`), so a cached pointer would otherwise dangle.

- **Cross-command batch path (`addCommandToBatch`)** — sets
  `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch,
  even on partial-batch overflow (was: only when ALL keys fit). The
  intra-command path then uniformly skips supplemental prefetching for
  any command the batch touched. Rationale: running both paths
  (cross-command warm + intra-command supplement) caused a measured
  −9.6 % regression on x86 with pipeline-10, and the partial cross-
  command warmup is sufficient for the head of the keyset; the cold
  tail goes through normal lookup, which is still cheaper than running
  the FSM a second time on already-warm keys.

- **Future types**: each dict's `dictType` can register its own
  `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash
  encoding, the field-sds and value-sds payloads), without touching
  `memory_prefetch.c`.

## Benchmark validation

On x86, performance improvements are significant for larger batch sizes:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44%
  - 5Mkeys-string-mget-100B-100keys: +37.33%
  - 5Mkeys-string-mget-100B-30keys: +22.40%
On ARM (Graviton4), the gains are even more pronounced:
  - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34%
  - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76%
Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions.

---------

Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2026-05-11 10:22:25 +08:00
Moti Cohen
5a05863e97
t_string: rewrite SET GET propagation in place (#15114)
Optimize SET key value GET propagation rewriting in setGenericCommand() 
by removing GET arguments in-place with rewriteClientCommandArgument(). 
This avoids the overhead of allocating a new argv vector and 
incrementing reference counts for every retained argument.

The optimization is scoped to the no-expire SET ... GET rewrite path. 
It also adds test coverage for cases with repeated GET tokens to 
ensure robust string semantics and consistent replication behavior.

Changes:
- Use rewriteClientCommandArgument(c, j, NULL) for in-place removal.
- Eliminate redundant argv allocations and refcount increments.
- Improve performance of SET GET in high-throughput write streams.
2026-04-28 10:05:59 +03:00
sggeorgiev
63f02e7876
Fix double ERR prefix in XNACK error replies (#15091)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Several `addReplyError` and `addReplyErrorFormat` calls in
`xnackCommand` included a redundant `"ERR "` prefix in the message
string. Since `addReplyErrorLength` already prepends `-ERR ` to the RESP
reply, clients received `ERR ERR ...` for these error paths.

This PR removes the redundant prefix from all five affected calls and
tightens the corresponding test patterns to match from the beginning of
the error message (`"ERR ..."` instead of `"*...*"`), so any future
double-prefix regression will be caught.
2026-04-22 09:12:04 +03:00
Filipe Oliveira (Redis)
0fa78fd8fd
perf: widen fast_float_strtod fast path to 17-19 digit mantissas (#15061)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
## Root cause

Roughly 50% of random double scores generated by the ZADD listpack
workload have 17-19 significant digits, which exceed
`MAX_MANTISSA_FAST_PATH` (`2^53`). These inputs fall through to the
`strtod()` fallback:

```c
char static_buf[128];
memcpy(buf, nptr, len);           /* memcpy back! */
buf[len] = '\0';                   /* null-term */
double result = strtod(buf, ...);  /* glibc strtod — ~10× slower on ARM */
```

The original C++ `fast_float` library handled the same 17-19 digit
inputs with Eisel-Lemire / bigint arithmetic without falling back to
`strtod()`. That is what the pure-C replacement lost.

## Fix

Compute `mantissa * 10^exponent` in 128-bit integer arithmetic using
`__uint128_t`, then convert to double with a single IEEE
round-to-nearest-even cast. Supported for `|exp| in [0, 19]` where
`10^|exp|` fits in `uint64`; cases outside that range (or otherwise
outside the fast path's preconditions) still fall through to `strtod()`.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2026-04-20 20:45:49 +08:00
Vitah Lin
8aeea8c210
Increase threshold for HPEXPIRETIME persists after RDB reload test (#15047) 2026-04-17 16:36:02 +08:00
Moti Cohen
fa6d4c3d63
Fix SIGABRT in HSETEX when a field appears twice in the FIELDS list (#14956)
HSETEX crashed on assert() with a SIGABRT when the same field appeared
more than once in the FIELDS list and an expiry time was given
(EX/PX/EXAT/PXAT).

Root cause: hfieldPersist() and the KEEP_TTL path in hashTypeSet() both
asserted that dictExpireMeta->expireMeta.trash == 0, meaning the hash
must be globally registered in the HFE DS. This is incorrect during
HSETEX execution because hashTypeSetExDone(), which registers the hash
globally and clears trash, called only at the end of flow. The private
per-field ebuckets are fully valid regardless of the global registration state.

Fix: Remove both incorrect assertions. The operations on the private
ebuckets (ebRemove in hfieldPersist, ebAdd in the KEEP_TTL path) are
correct and do not require the hash to be globally registered.

Tests: Added two regression tests covering the crash scenarios:
- HSETEX EX with a duplicate field (existing field, expiry given)
- HSETEX FNX EX with a duplicate field (no prior field, FNX condition
passes)
2026-04-16 13:16:52 +03:00
Sergei Georgiev
80f1ebda88
Add AGGREGATE COUNT option to ZUNION, ZINTER, ZUNIONSTORE, and ZINTERSTORE (#14892)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
### Overview

This PR adds a new `COUNT` aggregation mode to the `ZUNIONSTORE`,
`ZINTERSTORE`, `ZUNION`, and `ZINTER` sorted set commands. When
`AGGREGATE COUNT` is specified, the resulting score for each element
reflects how many input sets contain it (optionally scaled by
`WEIGHTS`), rather than combining the actual scores of the elements.
This enables a common use case — counting set membership frequency —
directly at the command level, without application-side workarounds.

### Problem Statement

For developers who need to know **how many input sorted sets contain
each element**, there is no single-command solution today.

**Example:** given several game leaderboards, find how many leaderboards
each player appears in.

The existing aggregation modes (`SUM`, `MIN`, `MAX`) all operate on the
elements' scores. To ignore scores and just count set membership, you'd
currently need to copy each sorted set with all scores set to 1, then
run `ZUNIONSTORE`/`ZINTERSTORE` with `SUM` — requiring multiple round
trips, temporary keys, and application-level locking to avoid races.

A `COUNT` aggregation mode solves this directly.

### Solution

Introduces `AGGREGATE COUNT` as a fourth aggregation mode:

- `ZINTER numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE
<SUM | MIN | MAX | COUNT>] [WITHSCORES]`
- `ZINTERSTORE destination numkeys key [key ...] [WEIGHTS weight [weight
...]] [AGGREGATE <SUM | MIN | MAX | COUNT>]`
- `ZUNION numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE
<SUM | MIN | MAX | COUNT>] [WITHSCORES]`
- `ZUNIONSTORE destination numkeys key [key ...] [WEIGHTS weight [weight
...]] [AGGREGATE <SUM | MIN | MAX | COUNT>]`

When `COUNT` is specified, **the scores in the input sets are ignored**.
Note that `WEIGHTS` is **not** ignored — each set contributes its weight
(default 1) per element, and the contributions are summed.

**Implementation details:**

A new helper function `zuiWeightedScore()` computes the per-set
contribution:

```c
inline static double zuiWeightedScore(double score, double weight, int aggregate) {
    return (aggregate == REDIS_AGGR_COUNT) ? weight : weight * score;
}
```

The `zunionInterAggregate()` function treats `COUNT` identically to
`SUM` — it adds the per-set contributions. All four call sites where
`weight * score` was previously computed inline are updated to use
`zuiWeightedScore()`.

### Examples

```
> ZADD s1 1 foo 1 bar
> ZADD s2 2 foo 2 bar
> ZADD s3 3 foo
```

**With `SUM` (existing behavior, for comparison):**

```
> ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "29"

> ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "20"
3) "foo"
4) "29"
```

**With `COUNT` and `WEIGHTS`:**

```
> ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "18"

> ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "15"
3) "foo"
4) "18"
```

**With `COUNT` and no specified `WEIGHTS`** — resulting score equals the
number of input sorted sets containing the element:

```
> ZINTERSTORE t1 3 s1 s2 s3 AGGREGATE COUNT
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "3"

> ZUNIONSTORE t1 3 s1 s2 s3 AGGREGATE COUNT
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "2"
3) "foo"
4) "3"
```

### Backward Compatibility

This is a fully additive change. The new `COUNT` keyword is only
recognized after the `AGGREGATE` token in the four affected commands.
Existing commands, arguments, and default behavior (`AGGREGATE SUM`) are
completely unchanged. No new command is introduced, and no existing
response format is modified.
2026-04-14 09:21:53 +03:00
Moti Cohen
e1d35aca01
Fix HEXPIRE numfields overflow (#15021)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Validate HEXPIRE-family field counts without parser overflow
keep flexible option order; only require fields fit in argv
add tests for INT_MAX numfields across HEXPIRE/HPEXPIRE/HEXPIREAT/HPEXPIREAT
2026-04-13 09:46:46 +03:00
h.o.t. neglected
e8da0e5b47
Fix brittle assert_match patterns for unexpected slowlog fields (#14948) 2026-04-13 14:45:14 +08:00
Sergei Georgiev
0be39e5032
Fix missing consumer propagation on empty XREADGROUP (#14963)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Summary

Fixes consumer replication inconsistency when `XREADGROUP` is called for
a new consumer but no `XCLAIM` commands are propagated to the replica.
Previously, consumer creation was only propagated to replicas when
`noack=true`, relying on `XCLAIM` propagation to implicitly create the
consumer in the non-NOACK path. However, if no messages exist to read,
no `XCLAIM` is generated, and the consumer is silently lost on the
replica.

This is a follow-up to the original fix in
[redis/redis#7140](https://github.com/redis/redis/issues/7140) /
[redis/redis#7526](https://github.com/redis/redis/pull/7526), which
introduced `XGROUP CREATECONSUMER` propagation but only for the `NOACK`
case.

## Changes

- **`xreadgroupCommand` (src/t_stream.c):** Replaced the `if (noack)`
guard around the `streamPropagateConsumerCreation()` call with a
deferred check after `streamReplyWithRange()`. Consumer creation is now
propagated when `noack || propCount == 0` — that is, only when no
`XCLAIM` commands were generated. This avoids redundant propagation in
the common case where `XCLAIM` already implicitly creates the consumer
on the replica, while correctly handling both the NOACK path (where
PEL/XCLAIM is skipped entirely) and the no-messages path (where there is
nothing to XCLAIM).
- **Test (tests/unit/type/stream-cgroups.tcl):** Added replication test
`"XREADGROUP propagates new consumer to replica"` that sets up a
master-replica pair and verifies consumer propagation in two cases: (1)
without NOACK when no messages are available to deliver, and (2) with
NOACK when messages are delivered but XCLAIM is skipped.

## Benefits

- **Master-replica consistency:** Consumers created by `XREADGROUP` are
now visible on replicas whenever no `XCLAIM` would otherwise create them
— covering both the NOACK path and the empty-stream path.
- **No redundant propagation:** The noack || propCount == 0 condition
avoids emitting a superfluous XGROUP CREATECONSUMER when XCLAIM commands
are already propagated and would implicitly create the consumer on the
replica.
2026-04-08 14:59:22 +03:00
Sergei Georgiev
747dfe578e
Add XNACK command for releasing stream messages back to the group (#14797)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
### Overview

This PR enhances Redis Streams consumer groups by adding a new `XNACK`
command that allows consumers to explicitly release pending messages
back to the group without acknowledging them. Released (NACKed) entries
become immediately available for re-delivery to other consumers,
eliminating the idle-timeout delay currently required for message
recovery. The command supports three modes — SILENT, FAIL, and FATAL —
giving consumers fine-grained control over delivery counter semantics to
handle graceful shutdowns, transient failures, and poison messages
respectively.

### Problem Statement

For developers using Redis Streams with consumer groups, there are
several common scenarios where a consumer needs to release a message it
has claimed without acknowledging it:

1. **Transient internal failures**: A consumer may fail to process a
message because of problems unrelated to the message itself — for
example, it cannot connect to an external service to fetch required
context. The message is perfectly valid and should be retried promptly
by another consumer.

2. **Resource pressure**: A consumer under resource stress (low CPU, low
memory) may be unable to handle a specific message (e.g., a complex or
large message) within acceptable QoS. It should leave the opportunity to
other consumers in the group, with minimal delay.

3. **Graceful shutdown**: A consumer about to shut down would like to
immediately release all unprocessed messages it has claimed, so they can
be picked up by remaining consumers without waiting for idle timeouts.

4. **Poison / malicious messages**: A consumer may detect or suspect
that a claimed message is invalid or malicious and wants to mark it as
permanently failed (for dead-letter queue processing when available).

**Currently, a consumer cannot NACK a message.** It can either:

- **XACK** it — marks it as "processed" and removes it from the PEL
entirely, losing the ability to redeliver it
- **Leave it pending** — requires other consumers to discover it via
`XPENDING` and claim it via `XCLAIM`/`XAUTOCLAIM` or `XREADGROUP CLAIM`
after the idle timeout expires, introducing a long, unnecessary delay

In all these cases, the logic that applications must implement
introduces **message handling delays**, **implementation complexity**,
and **code duplication** across consumer implementations.

### Solution

Introduces a new `XNACK` (Negative ACKnowledge) command that explicitly
releases pending messages from their owning consumer back to the group's
PEL, making them immediately claimable via `XCLAIM` and `XAUTOCLAIM`,
and prioritized for re-delivery in `XREADGROUP CLAIM`:

```
XNACK key group <SILENT|FAIL|FATAL> IDS numids id [id ...] [RETRYCOUNT count] [FORCE]
```

When executed, the command:

1. **Disassociates** the entry from its owning consumer (`consumer =
NULL`)
2. **Repositions** the entry to the head of the PEL time-ordered list
(`delivery_time = 0`), making it immediately claimable with any
`min-idle-time` threshold
3. **Adjusts the delivery counter** based on the specified mode, giving
consumers fine-grained control over retry semantics
4. **Returns** the count of successfully NACKed entries

**Mode** controls the delivery counter adjustment and communicates the
reason for the NACK:

| Mode | Delivery Counter Behavior | Use Case |

|----------|---------------------------------------------------|---------------------------------------------|
| `SILENT` | Decrement by 1 (undo the delivery increment) | Consumer
shutdown / transient internal error — the delivery "didn't count" |
| `FAIL` | No change (keep the incremented value) | Message too complex
for this consumer, but may work for others — count this as an attempt |
| `FATAL` | Set to `LLONG_MAX` | Invalid / suspected malicious message —
mark as permanently failed |

The three modes map directly to the real-world scenarios above:

- **SILENT** for graceful shutdown or transient failures unrelated to
the message
- **FAIL** for resource-constrained consumers that cannot handle a
specific message
- **FATAL** for poison message detection and dead-letter queue
integration

**Optional parameters:**

- **`RETRYCOUNT count`**: Directly sets `delivery_count` to the
specified value, overriding the mode-based adjustment
- **`FORCE`**: Creates new unowned PEL entries for IDs that are not
already in the group PEL (the entry must exist in the stream). When
`FORCE` creates an entry, the delivery counter is set to `0` (or to
`RETRYCOUNT` if specified, or to `LLONG_MAX` if mode is `FATAL`). This
is used internally for AOF rewrite and replication.

### Response Format

The command returns an integer — the number of messages successfully
NACKed (released back to the group PEL):

```
127.0.0.1:6379> XADD mystream 1-0 f v1
"1-0"
127.0.0.1:6379> XADD mystream 2-0 f v2
"2-0"
127.0.0.1:6379> XGROUP CREATE mystream grp 0
OK
127.0.0.1:6379> XREADGROUP GROUP grp c1 STREAMS mystream >
1) 1) "mystream"
   2) 1) 1) "1-0"
         2) 1) "f"
            2) "v1"
      2) 1) "2-0"
         2) 1) "f"
            2) "v2"
127.0.0.1:6379> XNACK mystream grp FAIL IDS 2 1-0 2-0
(integer) 2
```

After XNACK, the entries appear with an empty consumer in XPENDING
output:

```
127.0.0.1:6379> XPENDING mystream grp - + 10
1) 1) "1-0"
   2) ""
   3) (integer) -1
   4) (integer) 1
2) 1) "2-0"
   2) ""
   3) (integer) -1
   4) (integer) 1
```

### NACK Zone: Data Structure Extension

To support unowned PEL entries and ensure they are prioritized for
re-delivery, a **NACK zone** is introduced at the head of the existing
PEL time-ordered doubly-linked list. A new `pel_nack_tail` pointer is
added to the `streamCG` structure:

**PEL ordering:**

```
[pel_time_head] <-> ... <-> [pel_nack_tail] <-> [owned entries...] <-> [pel_time_tail]
|_____________ NACK zone ______________|   |_______ normal PEL ________|
```

The head of the PEL contains all NACKed messages (FIFO-ordered),
followed by all delivered messages that were not NACKed (same order as
today). This ensures NACKed messages are always prioritized over idle
pending messages.

The delivery order for `XREADGROUP` is therefore:
1. If `CLAIM` was specified: first deliver NACKed messages, then deliver
due pending messages (current behavior)
2. Deliver new entries after the group's last-delivered-id (current
behavior)

**Structure Design:**

- NACKed entries occupy positions from `pel_time_head` to
`pel_nack_tail` in the time-ordered list
- Their `delivery_time` is set to `0`, ensuring they always appear
"oldest" and are immediately claimable
- Their `consumer` pointer is set to `NULL`, marking them as unowned
- `pel_nack_tail` is `NULL` when no NACKed entries exist

**Key Properties:**

- **O(1) insertion**: New NACKed entries are inserted right after
`pel_nack_tail` (or at the list head if the zone is empty)
- **FIFO ordering** among NACKed entries: entries are NACKed in the
order they are released
- **Immediate claimability**: Since `delivery_time = 0`, NACKed entries
have maximum idle time and satisfy any `min-idle-time` threshold in
`XCLAIM` and `XAUTOCLAIM`, In `XREADGROUP CLAIM`, NACKed entries are
also prioritized over other pending entries due to their position at the
head of the PEL.
- **Zone integrity**: The `pelListInsertSorted` function is updated to
stop scanning at the `pel_nack_tail` boundary, ensuring owned entries
are never placed inside the NACK zone

### Impact on Existing Commands

All commands that interact with the PEL are updated to handle unowned
(`consumer = NULL`) entries:

- **XPENDING**: Shows NACKed entries with an empty consumer name
- **XCLAIM / XAUTOCLAIM**: Can claim NACKed entries (they satisfy any
min-idle-time since `delivery_time = 0`)
- **XREADGROUP CLAIM**: NACKed entries are picked up by the claim phase
- **XACK**: Works correctly on NACKed entries (removes from group PEL)
- **XINFO STREAM FULL**: Displays NACKed entries with an empty consumer
name
- **XGROUP DELCONSUMER**: Unaffected — NACKed entries are not in any
consumer's PEL

Propagation is also updated: when `XCLAIM` or `XAUTOCLAIM` encounters a
deleted stream entry for an unowned NACK, it propagates `XACK` (instead
of `XCLAIM`) to replicas and AOF, since there is no source consumer to
reference.

### Persistence

**RDB:**

- A new RDB type `RDB_TYPE_STREAM_LISTPACKS_5` (type 27) is introduced
- After saving consumer PEL entries, the NACK zone stream IDs are saved
separately (count + encoded IDs)
- On load, NACK zone entries are reconstructed by looking them up in the
group PEL, unlinking from their sorted position, and re-inserting into
the NACK zone via `pelListInsertNacked`
- Backward compatibility is preserved: old RDB types continue to load
with the existing validation (all entries must have consumers)

**AOF:**

- AOF rewrite emits `XNACK <key> <group> FAIL IDS <n> <id...> RETRYCOUNT
<cnt> FORCE` commands for entries in the NACK zone
- Consecutive entries with the same `delivery_count` are batched into a
single command (up to `AOF_REWRITE_ITEMS_PER_CMD` IDs per command)

### Defragmentation

The defragmentation logic is restructured to handle unowned entries:

- **`defragStreamCGPendingEntry`** (new): Walks the group-level PEL rax,
defragments each NACK, updates the doubly-linked list pointers
(`pel_prev`, `pel_next`), `pel_time_head`, `pel_time_tail`,
`pel_nack_tail`, and the consumer PEL back-pointer for owned entries
- **`defragStreamConsumerPendingEntry`** (simplified): Only fixes up
back-pointers to the possibly-relocated consumer and CG, since actual
defragmentation is now done at the group-level walk. Unowned (NACK zone)
entries have no consumer PEL walk, so the group-level pass is their only
chance

### Key Benefits

- **Immediate re-delivery**: NACKed entries are instantly claimable by
other consumers via `XCLAIM` and `XAUTOCLAIM` (since `delivery_time = 0`
satisfies any `min-idle-time`), and prioritized for re-delivery in
`XREADGROUP CLAIM`, eliminating idle-time delays that can range from
seconds to minutes
- **Explicit release semantics**: Consumers can release messages
intentionally, with fine-grained control over retry behavior — a
capability that exists in competing systems like RabbitMQ
- **Flexible retry control**: Three modes (SILENT, FAIL, FATAL) plus
RETRYCOUNT cover the full spectrum of failure handling strategies, from
graceful shutdown to poison message detection
- **Reduced application complexity**: Eliminates the need for
application-level workarounds involving XPENDING polling, arbitrary idle
timeouts, and manual XCLAIM orchestration
- **Dead-letter queue readiness**: FATAL mode + delivery count enables
straightforward poison message detection and future DLQ integration
- **Backward compatibility**: Fully optional new command with zero
breaking changes to existing behavior
2026-04-07 14:17:53 +03:00
debing.sun
a6a27f56f2
Test tcp deadlock fixes (#14946)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This fix follows #14667 and #14886

Several tests pipelined large numbers of commands on deferring clients
without draining replies. That can fill buffers and stall progress.

Fix by draining replies every 500 pipelined requests to avoid TCP
stalls.

---------

Co-authored-by: oranagra <oran@redislabs.com>
2026-03-30 19:50:47 +08:00
Guimu
2ba0194fbe
Fix memory leak in ZDIFF algorithm 2 on early exit (#14932)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
## Problem

`zdiffAlgorithm2()` can break out early once the destination cardinality
reaches zero. In that path, a temporary SDS created by
`zuiSdsFromValue()` is left dirty and never released, because the
cleanup normally happens on the next `zuiNext()` call which is skipped
due to the early `break`.

`zuiClearIterator()` called after the loop does **not** clean up dirty
values — only `zuiNext()` or explicit `zuiDiscardDirtyValue()` does.

## Fix

Add `zuiDiscardDirtyValue(&zval)` before the early `break` to ensure the
temporary SDS is freed on all exit paths.
2026-03-27 22:09:57 +08:00
Moti Cohen
f4d176b3b7
KEYSIZES/ASM: simplify histograms, fix background trim, and refactor debug assertions (#14877)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
- Simplify KEYSIZES tracking and saving memory by removing per‑slot histogram state in kvstoreDictMetadata and routing all updates through kvsUpdateHistogram + updateKeysizesHist(db, type, ...) (per‑DB only).
- Fix KEYSIZES consistency during ASM background trim by introducing asmTrimCtx, passing it through emptyDbDataAsync/BIO lazy‑free, computing histogram deltas in the background, and applying them on completion; add bg_trim_running to coordinate with validation.
- Refactor and relax debug assertions into a unified dbg_assert_flags bitmask and dbgRunAssertions(db), and skip KEYSIZES/ALLOCSIZE checks during nested execution, RDB load, ASM import, and ASM background trim.
- Update commands, module APIs, tests (including UNLINK async deletion coverage), and daily CI workflow to reflect the new histogram behavior and re‑enable ASM/cluster tests.
- Revise the daily CI “debug-assert-keyspace” workflow to run ASM and slot-stats unit tests again

Potential edge case with histogram accuracy:
The histogram could become inaccurate if the database is flushed while ASM trim is running in the background. I considered adding a generation counter to detect this, but decided against it since this is purely an INFO/diagnostic feature and the edge case is quite rare. The target_kvstore pointer check prevents applying stale deltas to the wrong kvstore, and if the histogram does become incorrect we have debugServerAssert() to catch negative values - it won't cause crashes in production. This is a known limitation we can document and revisit if it becomes a real issue in practice.
2026-03-22 15:26:06 +02:00
Cong Chen
9accf8bd24
Refine error message for HSETEX command with PERSIST (#14880)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fixes #14879 

> PERSIST can only be given for HGETEX and KEEPTTL for HSETEX but the
code doesn’t verify it.

**Behavior currently:**
 ```
127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v
(error) ERR Only one of EX, PX, EXAT, PXAT or KEEPTTL arguments can be
specified
```

**With this PR, behavior would be:**

```
127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v
(error) ERR unknown argument: persist
```
2026-03-20 10:19:00 +08:00
Sergei Georgiev
462e603a1f
Fix stream_idmp_keys missing from database lifecycle ops (#14897)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
**Summary**

Ensures `db->stream_idmp_keys` is managed consistently with `keys`,
`expires`, and `subexpires` across every database lifecycle operation —
flush, swap, temp-DB init/discard, lazy-free, and cluster slot
migration.

Without this fix, the dict was absent from these code paths, causing
three classes of bugs: a SIGSEGV during diskless replication in `swapdb`
mode (NULL pointer dereference in `initTempDb`), silently lost IDMP
tracking after `SWAPDB` and diskless replication (pointers never
swapped), and stale IDMP entries surviving `FLUSHDB` (dict never
cleared).

**Changes**

- **`emptyDbStructure`** (`src/db.c`) — Clear `stream_idmp_keys` on
flush so stale entries don't persist across `FLUSHDB`/`FLUSHALL`.
- **`initTempDb` / `discardTempDb`** (`src/db.c`) — Create and release
`stream_idmp_keys` for temp databases used during diskless replication
RDB load. Fixes the SIGSEGV when `rdbLoadRioWithLoadingCtx` calls
`dictAddRaw` on a NULL pointer.
- **`dbSwapDatabases`** (`src/db.c`) — Swap `stream_idmp_keys` alongside
the other per-DB dicts so IDMP tracking follows the data during
`SWAPDB`.
- **`swapMainDbWithTempDb`** (`src/db.c`) — Swap `stream_idmp_keys` when
promoting a temp database after diskless replication, so the cron can
discover and expire IDMP entries on replicas.
- **`streamMoveIdmpKeys`** (`src/db.c`) — New helper that migrates IDMP
entries by slot from one dict to another, used during cluster slot
migration.
- **`emptyDbAsync` / `emptyDbDataAsync`** (`src/lazyfree.c`) — Replace
the dict with a fresh one and hand the old one to the background
lazy-free job.
- **`lazyfreeFreeDatabase`** (`src/lazyfree.c`) — Free
`stream_idmp_keys` in the background job (now takes 4 args instead of
3).
- **`asmTriggerBackgroundTrim`** (`src/cluster_asm.c`) — Move matching
IDMP entries by slot into a temporary dict during background trim, then
pass it to `emptyDbDataAsync` for async cleanup.
- **`server.h`** — Updated `emptyDbDataAsync` signature; added
`streamMoveIdmpKeys` declaration.
- **Tests** (`tests/unit/type/stream.tcl`) — Four new integration tests
covering IDMP expiry after `SAVE` + restart, tracking survival across
`SWAPDB`, cleanup on `FLUSHDB`, and diskless replication in `swapdb`
mode (both `rdbchannel=yes` and `rdbchannel=no`).

**What this fixes**

- **No crash on diskless replication** — `initTempDb` now initializes
the dict, eliminating the NULL dereference during RDB load.
- **Tracking preserved across swaps** — Both `SWAPDB` and diskless
replication correctly transfer the dict, so the cron keeps expiring
entries in the right database.
- **Clean state after flush** — `FLUSHDB`/`FLUSHALL` clear the dict,
preventing ghost entries from interfering with new streams.
- **Correct cluster migration cleanup** — IDMP entries for migrated
slots are moved and freed alongside the key data.
2026-03-19 14:37:19 +02:00
Cong Chen
1b615c774d
Fix FIELDS argument validation in HSETEX/HGETEX (#14883)
Fixes #14879 

Improve validation of the FIELDS argument in HSETEX and HGETEX to ensure
exactly one field is provided, rejecting both missing and multiple
fields with consistent and accurate error messages.

Align behavior across both commands.
2026-03-19 19:54:22 +08:00
Sergei Georgiev
ee376cdc3c
Fix IDMP cron expiration not working after RDB load (#14869)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
**Summary**

Registers streams with IDMP producers in `db->stream_idmp_keys` during
RDB load, so that the periodic cron job can expire stale IDMP entries
after a server restart. Without this fix, streams loaded from RDB were
never registered, causing IDMP entries to accumulate indefinitely and
never be cleaned up by the cron.

**Changes**

- **`rdbLoadRioWithLoadingCtx` (src/rdb.c):** After loading each
key-value pair, added a check for `OBJ_STREAM` type with a non-NULL
`idmp_producers` rax. When found, creates a string object for the key
and inserts it into `db->stream_idmp_keys` via `dictAddRaw`. If the key
already exists (duplicate insert), the reference is decremented to avoid
a leak.
- **Test `XADD IDMP cron expiration works after RDB load`
(tests/unit/type/stream.tcl):** Added an integration test that creates a
stream with IDMP producers, sets a short `IDMP-DURATION`, saves and
restarts the server, then verifies that the cron eventually expires the
entries (pids-tracked and iids-tracked drop to 0) and that
previously-expired IIDs can be re-added as new entries.

**Benefits**

- **Cron expiration works after restart:** Streams with IDMP producers
are now discoverable by the periodic cleanup cron after an RDB load,
ensuring expired entries are reaped as expected.
- **No memory leak on stale entries:** Without registration, IDMP
entries that outlived their duration would persist in memory forever
after a restart, growing unboundedly. This fix ensures they are cleaned
up.
- **Consistency between runtime and post-load behavior:** IDMP
expiration now behaves identically whether the stream was created at
runtime or restored from an RDB snapshot.
2026-03-13 14:00:28 +02:00
Vitah Lin
62059a2438
Chore complete Tcl 9 support and fix regressions in test suite (#14845)
## Problem

PR https://github.com/redis/redis/pull/14787 introduced **Tcl 9 support
for the test suite**, but it still fails on my machine (**macOS 26.3,
Tcl 9.0.3**). Some tests fail and the runner may hang.

Example:

```bash
$ tclsh <<<'puts $tcl_version'
9.0.3

$ make test
[err]: BITCOUNT against test vector #2
Expected [r bitcount str] == 4
```

This is caused by **behavior changes in Tcl 9**, including:

- `string length` returning **character count** instead of **byte
count**
- binary sockets rejecting characters with code points **>255**
- differences in `string is wideinteger`

Parts of the Redis Tcl test framework rely on **byte-oriented
behavior**, which breaks under Tcl 9.



## Changes

### 1. Fix IPC payload encoding in test runner

`tests/unit/memefficiency.tcl` contains a **non-ASCII quote character**:


fe16003e66/tests/unit/memefficiency.tcl (L826)

Under Tcl 9 this can corrupt the IPC stream when the test runner
serializes Tcl code blocks, causing the runner to hang.

Instead of fixing only this character, the IPC payload is now explicitly
encoded using:

```tcl
encoding convertto utf-8
```

This makes the protocol robust against future non-ASCII characters.



### 2. Avoid Tcl 9 glob backtracking issue

Replaced:

```tcl
string match "*[^\u0000-\u00ff]*" $a
```

with:

```tcl
regexp {[^\u0000-\u00ff]} $a
```

This avoids a **catastrophic backtracking issue in Tcl 9's glob
matcher** while preserving the same behavior.


### 3. Update DIGEST validation

The previous check relied on:`string is wideinteger` which behaves
differently in Tcl 9.

The assertion now validates the **expected DIGEST format directly**.
2026-03-06 13:27:56 +08:00
Yuan Wang
dd81afa2c8
Do not add a key into db if it has a past expiration time (#14784)
For **RESTORE** and **SET** command, if the expiration time is already
elapsed, we skip adding it to the DB. However, we still increment the
`expiredkeys` counter. From stats perspective we behave as if we
inserted a new key (possibly an overwrite) and later expired it, but
from the per-key KSN observability, we reflect what we've actually done
in the db (deletion of old key, and no insertion of new one), so we
don't confuse modules.

**Changes**:
- **RESTORE**: Increment the `expiredkeys` counter if the key has an
expiration time in the past.
- **SET**: If the key has an expiration time in the past, do not add it
to the main DB, increment the `expiredkeys` counter instead. And we also
delete the old key if it exists.

For reference, **EXPIREAT** with TTL in the past, which implicitly
deletes the key and return success. Now **SET** command has the same
behavior.
2026-02-25 20:44:13 +08:00
Sergei Georgiev
1f7bcecc94
Add XIDMPRECORD command and AOFRW emission to restore stream IDMP state (#14794)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Summary

- Introduce a new XIDMPRECORD <key> <pid> <iid> <streamID> command that
attaches IDMP (idempotency) metadata — a producer ID and an idempotency
ID — to an existing stream message. This is an internal command used to
replay IDMP state during AOF loading.
- Modify rewriteStreamObject() in AOF rewrite (AOFRW) to emit
XIDMPRECORD commands after the XADD and consumer-group commands,
ensuring IDMP deduplication state is fully preserved across AOF rewrites
and server restarts.
- Additionally emit XCFGSET during AOFRW when per-stream IDMP
configuration (IDMP-DURATION, IDMP-MAXSIZE) differs from server
defaults, so custom settings survive rewrites.

## New command: XIDMPRECORD
Syntax: `XIDMPRECORD <key> <pid> <iid> <streamID>`

- Validates that the key exists, is a stream, the stream ID refers to an
existing (non-deleted) entry, and both pid and iid are non-empty.
- If the (pid, iid) pair already maps to the same stream ID, the command
is idempotent and returns OK.
- If the (pid, iid) pair already maps to a different stream ID, an error
is returned.

## AOF rewrite changes (src/aof.c)

- New helper rioWriteStreamIdmpEntry() writes a single XIDMPRECORD bulk
command to the AOF rio.
- After consumer groups are emitted, iterates all producers in
s->idmp_producers and emits an XIDMPRECORD for every IDMP entry (in
linked-list insertion order).
- Emits XCFGSET for per-stream IDMP config when it diverges from server
defaults.
2026-02-24 11:51:48 +02:00
Martin Dimitrov
98328ae2ec
Pause dict auto-resize during multi-field deletion (#14783)
The idea comes directly from ValKey:
https://github.com/valkey-io/valkey/pull/3144

Deleting many fields from a hash/zset/set stored as a dict can trigger
repeated shrink/rehash work during the loop.

---------

Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
2026-02-14 14:45:54 +08:00
debing.sun
18538461d1
Add separate statistics for active expiration of keys and hash fields (#14727)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
### Summary

Adds `expired_keys_active` and `expired_subkeys_active` counters to
track keys and hash fields expired by the active expiration cycle,
distinguishing them from lazy expirations.
These new metrics are exposed in INFO stats output.

### Motivation

Currently, Redis tracks the total number of expired keys (expired_keys)
and expired hash fields (expired_subkeys), but there's no way to
differentiate between expirations triggered by active expire and lazy
expire.

---------

Co-authored-by: Moti Cohen <moti.cohen@redis.com>
2026-01-22 22:30:25 +08:00
Sergei Georgiev
221409788a
Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Overview

This PR introduces idempotency support to Redis Streams' XADD command,
enabling automatic deduplication of duplicate message submissions
through optional IDMPAUTO and IDMP parameters with producer
identification. This enables reliable at-least-once delivery while
preventing duplicate entries in streams.

## Problem Statement

Current Redis Streams implementations lack built-in idempotency
mechanisms, making reliable at-least-once delivery impossible without
accepting duplicates:

- **Application-level tracking**: Developers must maintain separate data
structures to track submitted messages
- **Race conditions**: Network failures and retries can result in
duplicate stream entries
- **Complexity overhead**: Each producer must implement custom
deduplication logic
- **Memory inefficiency**: External deduplication systems duplicate
Redis's storage capabilities

This lack of native idempotency support creates reliability challenges
in distributed systems where at-least-once delivery semantics are
required but exactly-once processing is desired.

## Solution

Extends XADD with optional idempotency parameters that include producer
identification:

```
XADD key [NOMKSTREAM] [KEEPREF | DELREF | ACKED] [IDMPAUTO pid | IDMP pid iid] [MAXLEN | MINID [= | ~] threshold [LIMIT count]] <* | id> field value [field value ...]
```

### Producer ID (pid)

- **pid** (producer id): A unique identifier for each producer
- Must be unique per producer instance
- Producers must use the same pid after restart to access their
persisted idempotency tracking
- Enables per-producer idempotency tracking, isolating duplicate
detection between different producers

**Format**: Binary or string, recommended max 36 bytes

**Generation**: 
- **Recommended**: UUID v4 for globally unique identification
- **Alternative**: `hostname:process_id` or application-assigned IDs

### Idempotency Modes

**IDMPAUTO pid (Automatic Idempotency)**:

- Producer specifies its pid, Redis automatically calculates a unique
idempotent ID (iid) based on entry content
- Hash calculation combines XXH128 hashing of individual field-value
pairs using an order-independent Sum + XOR approach with rotation (each
pair: `XXH128(field || field_length || value)`)
- 16-byte binary iid with extremely low accidental collision probability
- XXH128 is a non-cryptographic hash function: fast and
well-distributed, but does NOT prevent intentional collision attacks
- For protection against adversarial collision crafting, use IDMP mode
with cryptographically-signed idempotent IDs
- Order-independent: field ordering does not affect the calculated iid
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without creating duplicate entry
- Generally slower than manual mode due to hash calculation overhead

**IDMP pid iid (Manual Idempotency)**:

- Caller provides explicit producer id (pid) and idempotent ID (iid) for
deduplication
- iid must be unique per message (either globally or per pid)
- Faster processing than IDMPAUTO (no hash calculation overhead)
- Enables shorter iids for reduced memory footprint
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without comparing field contents
- Caller responsible for iid uniqueness and consistency across retries

Both modes can only be specified when entry ID is `*` (auto-generated).

### Deduplication Logic

When XADD is called with idempotency parameters:

1. Redis checks if the message was recently added to the stream based on
the (pid, iid) pair
2. If the (pid, iid) pair matches a recently-seen pair for that
producer, the message is assumed to be identical
3. No duplicate message is added to the stream; the existing entry ID is
returned
4. With **IDMP pid iid**: Redis does not compare the specified fields
and their values—two messages with the same (pid, iid) are assumed
identical
5. With **IDMPAUTO pid**: Redis calculates the iid from message content
and checks for duplicates

## IDMP Map: Per-Producer Time and Capacity-Based Expiration

Each producer with idempotency enabled maintains its own isolated IDMP
map (iid → entry_id) with dual expiration criteria:

**Time-based expiration (duration)**:

- Each iid expires automatically after duration seconds from insertion
- Provides operational guarantee: Redis will not forget an iid before
duration elapses (unless capacity reached)
- Configurable per-stream via XCFGSET

**Capacity-based expiration (maxsize)**:

- Each producer's map enforces maximum capacity of maxsize entries
- When capacity reached, oldest iids for that producer are evicted
regardless of remaining duration
- Prevents unbounded memory growth during extended usage

### Configuration Commands

**XINFO STREAM**: View current configuration and metrics

Use `XINFO STREAM key` to retrieve idempotency configuration
(idmp-duration, idmp-maxsize) along with tracking metrics.

**XCFGSET**: Configure expiration parameters

```
XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize]
```

- **duration**: Seconds to retain each iid (range: 1- 86400 seconds)
- **maxsize**: Maximum iids to track per producer (range: 1-10,000
entries)
- Calling XCFGSET clears all existing producer IDMP maps for the stream

**Default Configuration** (when XCFGSET not called):

- Duration: 100 seconds
- Maxsize: 100 iids per producer
- Runtime configurable via: `stream-idmp-duration` and
`stream-idmp-maxsize`

## Response Behavior

**On first submission** (pid, iid) pair not in producer's map:

- Entry added to stream with generated entry ID
- (pid, iid) pair stored in producer's IDMP map with current timestamp
- Returns new entry ID

**On duplicate submission** (pid, iid) pair exists in producer's map:

- No entry added to stream
- Returns existing entry ID from producer's IDMP map
- Identical response to original submission (client cannot distinguish)

## Stream Metadata

XINFO STREAM extended with idempotency metrics and configuration:

- **idmp-duration**: The duration value (in seconds) configured for the
stream's IDMP map
- **idmp-maxsize**: The maxsize value configured for the stream's IDMP
map
- **pids-tracked**: Current number of producers with active IDMP maps
- **iids-tracked**: Current total number of iids across all producers'
IDMP maps (reflects active iids that haven't expired or been evicted)
- **iids-added**: Lifetime cumulative count of entries added with
idempotency parameters
- **iids-duplicates**: Lifetime cumulative count of duplicate iids
detected across all producers

## Persistence and Restart Behavior

**IDMP maps are fully persisted and restored across Redis restarts**:

- **RDB/AOF**: All pid-iid pairs, timestamps, and configuration are
included in snapshots and AOF logs
- **Recovery**: On restart, all tracked (pid, iid) pairs remain valid
and operational
- **Producer Requirement**: Producers must reuse the same pid after
restart to access their persisted IDMP map
- **Configuration**: Stream-level settings (duration, maxsize) persist
across restarts
- **Important**: Calling XCFGSET after restart clears restored IDMP maps
(same behavior as during runtime)

## Key Benefits

- **Enables At-most-once Producer Semantics**: Makes it possible to
safely retry message submissions without creating duplicates
- **Automatic Retry Safety**: Network failures and retries cannot create
duplicate entries
- **Producer Isolation**: Each producer maintains independent
idempotency tracking
- **Memory Efficient**: Time and capacity-based expiration per producer
prevents unbounded growth
- **Flexible Implementation**: Choose automatic (IDMPAUTO) or manual
(IDMP) based on performance needs
- **Backward Compatible**: Fully optional parameters with zero impact on
existing XADD behavior
- **Collision Resistant**: XXH128 with Sum + XOR combination and
field-length separators provides high-quality non-cryptographic hashing
for IDMPAUTO with extremely low collision probability and prevents
ambiguous concatenation attacks
2026-01-15 21:58:44 +08:00
Stav-Levi
73249497d4
Fix ACL key-pattern bypass in MSETEX command (#14659)
MSETEX doesn't properly check ACL key permissions for all keys - only
the first key is validated.

MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX
seconds

Keys are at every 2nd position (step=2). When Redis extracts keys for
ACL checking, it calculates where the last key is:

last = first + numkeys - 1;        => calculation ignores step
last = first + (numkeys-1) * step; 
With 2 keys starting at position 2:

Bug: last = 2 + 2 - 1 = 3 → only checks position 2
Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4

Fixes #14657
2026-01-08 08:41:55 +02:00
debing.sun
9ca860be9e
Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This bug was introduced by #14130 and found by guybe7 

When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete
strategies, if a node was eligible for removal but couldn't be removed
directly (because consumer group references need to be checked), the
code would incorrectly break out of the loop instead of continuing to
process entries within the node. This fix allows the per-entry deletion
logic to execute for eligible nodes when using non-KEEPREF strategies.
2026-01-05 21:17:36 +08:00
Stav-Levi
23aca15c8c
Fix the flexibility of argument positions in the Redis API's (#14416)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR implements flexible keyword-based argument parsing for all 12
hash field expiration commands, allowing users to specify arguments in
any logical order rather than being constrained by rigid positional
requirements.
This enhancement follows Redis's modern design of keyword-based flexible
argument ordering and significantly improves user experience.

Commands with Flexible Parsing
HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX

some examples: 
HEXPIRE: 
* All these are equivalent and valid:
HEXPIRE key EX 60 NX FIELDS 2 f1 f2
HEXPIRE key NX EX 60 FIELDS 2 f1 f2  
HEXPIRE key FIELDS 2 f1 f2 EX 60 NX
HEXPIRE key FIELDS 2 f1 f2 NX EX 60
HEXPIRE key NX FIELDS 2 f1 f2 EX 60

HGETEX:
* All these are equivalent and valid:
HGETEX key EX 60 FIELDS 2 f1 f2
HGETEX key FIELDS 2 f1 f2 EX 60

HSETEX:
* All these are equivalent and valid:
HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2
HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2
HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60
HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX
HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60
2025-12-14 09:35:12 +02:00
debing.sun
bb6389e823
Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Problem

When destroying a consumer group with `XGROUP DESTROY`, the cached
`min_cgroup_last_id` was not being invalidated. This caused incorrect
behavior when using `XDELEX` with the `ACKED` option, as the cache still
referenced the destroyed group's `last_id`.

## Solution

Invalidate the `min_cgroup_last_id` cache when the destroyed group's
`last_id` equals the cached minimum. The cache will be recalculated on
the next call to `streamEntryIsReferenced()`.

---------

Co-authored-by: guybe7 <guy.benoish@redislabs.com>
2025-11-21 22:37:17 +08:00
Oran Agra
0a6eacff1f
Add variable key-spec flags to SET IF* and DELEX (#14529)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
These commands behave as DEL and SET (blindly Remove or Overwrite) when
they don't get IF* flags, and require the value of the key when they do
run with these flags.

Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can
provide the right flags depending on the arguments used. (the plain
flags when arguments are unknown are the common denominator ones)

Move lookupKey call in DELEX to avoid double lookup, which also means
(some, namely arity) syntax errors are checked (and reported) before
checking the existence of the key.
2025-11-12 11:36:10 +02:00
Sergei Georgiev
90ba7ba4dc
Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524)
### Problem
The XREADGROUP command with CLAIM parameter incorrectly returns delivery
metadata (idle time and delivery count) as strings instead of integers,
contradicting the Redis specification.

### Solution
Updated the XREADGROUP CLAIM implementation to return delivery metadata
fields as integers, aligning with the documented specification and
maintaining consistency with Redis response conventions.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-11-11 19:05:22 +08:00
Moti Cohen
d25e582a17
Fix flaky test of hfe persist rdb reload (#14525)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
So far occured once on daily in the test-sanitizer-address job
2025-11-10 17:15:37 +02:00
Moti Cohen
189b7609f5
Add hfe rdb load test (#14511)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Verify that following RDB load fields keep their expiration time.
Verify that hashes that had HFEs not counted following rdb load in
subexpiry (by command `info keyspace`)
2025-11-09 09:49:54 +02:00
debing.sun
7f1bafc922 Fix XACKDEL stack overflow when IDs exceed STREAMID_STATIC_VECTOR_LEN (CVE-2025-62507)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
This issue was introduced by redis/redis#14130.
The problem is that when the number of IDs exceeds STREAMID_STATIC_VECTOR_LEN (8), the code forgot to reallocate memory for the IDs array, which causes a stack overflow.
2025-11-05 15:33:34 +02:00
sggeorgiev
3e2003ee0f Fix HGETEX out-of-bounds read when FIELDS option missing numfields argument
When the HGETEX command is used with the FIELDS option but without the required
numfields argument, the server would attempt to access an out-of-bounds argv index.

This PR adds a check to ensure numfields is present before accessing it,
returning an error if it is missing. Also includes a test case to cover this scenario.
2025-11-05 15:33:34 +02:00
debing.sun
e436a0e548
Enforce 16-char hex digest length and case-insensitive comparison for IFDEQ/IFDNE (#14502)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fix https://github.com/redis/redis/issues/14496

This PR makes the following changes:
- DIGEST: Always return 16 hex characters with leading zeros
  Example: "00006c38adf31777" instead of "6c38adf31777"

- IFDEQ/IFDNE: Validate the digest must be exactly 16 characters

- IFDEQ/IFDNE: Use strcasecmp for case-insensitive hex comparison
  Both uppercase and lowercase hex digits now work identically

---------

Co-authored-by: Marc Gravell <marc.gravell@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-11-03 16:59:50 +08:00
debing.sun
379fec1426
Use fixed position keys parameter for MSETEX command (#14470)
In PR https://github.com/redis/redis/pull/14434, we made the keys
parameter flexible, meaning it could appear anywhere among the command
arguments. However, this also made key parsing more complex, since we
could no longer determine the fixed position of key arguments.
Therefore, in this PR, we reverted it back to using fixed positions for
the keys.

And also fix this
[comment](https://github.com/redis/redis/pull/14434#discussion_r2459282563).

---------

Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-10-27 17:20:29 +08:00
Stav-Levi
52ea47b792
Add MSETEX command (#14434)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Introduce a new command MSETEX to set multiple string keys with a shared
expiration in a single atomic operation. Also with flexible argument
parsing.

Syntax:
MSETEX KEYS numkeys key value [key value …] [XX | NX] [EX seconds | PX
milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds |
KEEPTTL]

Sets the given keys to their respective values.
This command is an extension of the MSETNX that adds expiration and XX
options.

Options: 
EX seconds - Set the specified expiration time, in seconds
PX milliseconds - Set the specified expiration time, in milliseconds
EXAT timestamp-seconds - Set the specified Unix time at which the keys
will expire, in seconds
PXAT timestamp-milliseconds - Set the specified Unix time at which the
keys will expire, in milliseconds
KEEPTTL - Retain the time to live associated with the keys
XX - Only set the keys and their expiration if all already exist
NX - Only set the keys and their expiration if none exist

Flexible Argument Parsing examples:
  - MSETEX EX 10 KEYS 2 k1 v1 k2 v2
  - MSETEX KEYS 2 k1 v1 k2 v2 NX PX 5000
  - MSETEX NX EX 10 KEYS 2 k1 v1 k2 v2
  
Return Values:
Integer reply: 1 - All keys were set successfully
Integer reply: 0 - No keys were set (due to NX/XX conditions)
Error reply - Syntax error or invalid arguments
2025-10-23 19:12:02 +03:00
sggeorgiev
090ca801ea
Add CLAIM parameter to XREADGROUP for automatic pending entry claiming (#14402)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Overview
This PR enhances Redis Streams consumer groups by adding an optional
CLAIM parameter to the `XREADGROUP` command, enabling automatic claiming
of idle pending entries alongside normal message consumption in a single
operation.
## Problem Statement
Current Redis Streams consumer group implementations require developers
to manually orchestrate multiple commands to handle both pending and new
entries:

- `XPENDING` to discover idle pending entries
- `XCLAIM/XAUTOCLAIM` to claim idle entries
- `XREADGROUP` to consume new entries

This multi-command approach creates:
- **Performance overhead** from multiple round trips to Redis
- **Implementation complexity**, particularly when working with multiple
streams
- **Code duplication** across consumer implementations

## Solution
Extends XREADGROUP with a new optional CLAIM parameter:
`XREADGROUP GROUP group consumer [COUNT count] [BLOCK milliseconds]
[NOACK] [CLAIM min-idle-time] STREAMS key [key ...] id [id ...]`

When CLAIM min-idle-time is specified, the command operates in two
phases:

1. **Claim Phase:** Automatically claims pending entries idle for ≥
min-idle-time milliseconds
2. **Read Phase:** Processes new entries if the COUNT limit hasn't been
reached

## Response Format Changes
When the CLAIM option is used, the response format is extended to
include delivery metadata for each entry:
**Standard XREADGROUP response (without CLAIM):**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 STREAMS mystream >
1) 1) "mystream"
   2) 1) 1) "1609459200000-0"
         2) 1) "field1"
            2) "value1"
```
**XREADGROUP response with CLAIM:**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 CLAIM 30000 STREAMS mystream >
1) 1) "mystream"
   2) 1) 1) "1609459200000-0"
         2) 1) "field1"
            2) "value1"
         3) 15000
         4) 3
```
**Response structure with CLAIM:**

- **Field 1:** Stream entry ID (unchanged)
- **Field 2:** Field-value pairs (unchanged)
- **Field 3:** Idle time in milliseconds - the number of milliseconds
elapsed since this entry was last delivered to a consumer
- **Field 4:** Delivery count - the number of times this entry has been
delivered:
  
  - `0` for new messages that haven't been delivered before
  - `1+` for claimed messages (previously unacknowledged entries)

**Purpose of the new fields:**
These fields enable intelligent client-side processing decisions:

- Idle time enables time-based escalation strategies, detection of stuck
messages, and priority processing for critically delayed work
- Delivery count enables retry limits, dead-letter queue logic, poison
message detection, and alternative processing strategies based on
failure history

Together, these fields provide the visibility needed to build robust,
self-healing consumer systems without requiring additional XPENDING
queries.

**Note:** If the ID parameter is not `>`, the command returns entries
that are pending for the consumer, and the CLAIM option is ignored. In
this case, the response follows the standard format without the
additional delivery metadata fields.
## Key Benefits

- **Reduced Complexity:** Eliminates manual PEL management and
multi-command orchestration
- **Improved Performance:** Reduces round trips by 50-70% for workloads
processing both pending and new entries
- **Backward Compatibility:** Fully optional parameter with zero
breaking changes to existing behavior
- **Multi-Stream Support:** Works seamlessly across multiple streams in
a single command
- **Flexible Consumer Patterns:** Enables mixed consumer types within
the same group:
  -   Consumers without CLAIM that only handle new messages
  -   Consumers with CLAIM that process both pending and new entries

## Impact on Existing Commands
The XCLAIM and XAUTOCLAIM commands may potentially benefit from the new
pel_by_time index for improved performance, such optimizations require
further investigation and testing. Enhancements to XCLAIM and XAUTOCLAIM
are postponed for future work.

## Performance Benchmarks
### Latency Performance
Comprehensive performance testing demonstrates significant improvements
over the traditional XAUTOCLAIM approach:
**Test Methodology**
Two identical test scenarios were executed to compare XAUTOCLAIM against
XREADGROUP with CLAIM:
**Test Setup:**
1. Insert 20,000 messages into a stream
2. Read all messages with XREADGROUP to populate the pending entries
list (PEL)
3. Set IDLE time to 1100ms on 1,000 randomly selected pending messages
using XCLAIM
4. Set IDLE time to 50ms on all remaining 19,000 pending messages using
XCLAIM
5. Execute the target command with min-idle-time=1000ms and COUNT=1000
to claim the eligible messages
6. Repeat steps 3-5 for 1,000 iterations

**Test 1 - XAUTOCLAIM (Traditional Approach):**
```
XAUTOCLAIM Performance:
  Average:    54.671ms
  Median:     53.582ms
  Min:        3.738ms
  Max:        71.596ms
  P95:        62.536ms
  P99:        68.800ms
```
**Test 2 - XREADGROUP with CLAIM (New Approach):**
```
XREADGROUP CLAIM Performance:
  Average:    2.426ms
  Median:     2.571ms
  Min:        1.287ms
  Max:        4.653ms
  P95:        3.370ms
  P99:        4.212ms
```
**Performance Analysis**
The new XREADGROUP CLAIM implementation delivers **22.5x faster average
performance** compared to XAUTOCLAIM:
- **Average latency reduction:** 95.6% (54.671ms → 2.426ms)
- **Median latency reduction:** 95.2% (53.582ms → 2.571ms)
- **P95 latency reduction:** 94.6% (62.536ms → 3.370ms)
- **P99 latency reduction:** 93.9% (68.800ms → 4.212ms)

This performance improvement is achieved through the time-ordered PEL
index (pel_by_time), which enables O(log n + k) retrieval of idle
entries versus XAUTOCLAIM's less efficient scanning approach.

### Memory Performance
To evaluate the memory overhead of the pel_by_time index, comprehensive
memory testing was conducted comparing Redis with and without the index
under realistic workload conditions.

**Test Methodology:**
- Insert 200,000 new messages into a stream
- Read messages in blocks of 100 using XREADGROUP (populating the PEL
with 200,000 pending entries)
- Wait 5ms after each read block (simulating realistic processing delays
that affect rax tree compression)
- Measure memory usage before and after the reading phase

**Test Results - Without pel_by_time Index:**
```
Initial memory (used):                                   926.10 KB
After insertion (used):                                    6.80 MB
After reading (used):                                     41.53 MB
Memory increase from data:                                 5.90 MB
Memory increase from reading:                             34.72 MB
Total memory increase:                                    40.62 MB
```
**Test Results - With pel_by_time Index:**
```
Initial memory (used):                                   927.44 KB
After insertion (used):                                    6.81 MB
After reading (used):                                     45.07 MB
Memory increase from data:                                 5.90 MB
Memory increase from reading:                             38.27 MB
Total memory increase:                                    44.17 MB
```
**Memory Performance Analysis:**
The pel_by_time index introduces a measurable but reasonable memory
overhead:

**Used Memory Impact:**
- Memory increase from pel_by_time index: **3.55 MB** (38.27 MB - 34.72
MB)
- Per-entry overhead: **18.6 bytes** (3.55 MB / 200,000 entries)
- Percentage overhead: **8.7%** increase in total memory usage

**Per-Entry Memory Breakdown:**
The theoretical minimum for the pel_by_time index is 32 bytes per entry
(composite key only, no node values). The observed 18.6 bytes per entry
overhead is lower than the theoretical maximum, suggesting effective rax
tree compression is occurring despite the 5ms delays between reads.

## Technical Implementation

### New Data Structure: Time-Ordered PEL Index (`pel_by_time`)
To efficiently identify and claim idle pending entries, this PR
introduces a new rax tree structure to the consumer group
implementation:
**Structure Design:**

- Tree Type: Rax tree named pel_by_time added to each consumer group
- Key Composition: 32-byte composite key consisting of:
  - `delivery_time` (timestamp when entry was last delivered)
  - `streamId` (stream entry ID)

**Key Format:** `delivery_time` + `streamId` (concatenated)
**Node Value:** None - all necessary information is encoded in the key
itself for memory efficiency
**Key Properties:**
_Uniqueness Guarantee:_ While multiple pending entries may share the
same `delivery_time`, the `streamId` component ensures each key is
globally unique within the tree.
_Lexicographical Ordering:_ The rax tree naturally orders nodes
lexicographically by key. Since `delivery_time` forms the prefix of each
key, entries are automatically sorted by delivery time, with oldest
entries appearing first in the tree.
_Efficient Range Operations:_ This time-based ordering enables highly
efficient range searches. To find all entries idle for at least
`min-idle-time` milliseconds, we simply perform a range query from the
tree's beginning up to `current_time - min-idle-time`.
**Fast Retrieval:** 
Once idle entries are identified via the `pel_by_time` index, the
embedded `streamId` in each key is used to quickly retrieve the full
pending message data structure for the subsequent `XREADGROUP` claim
operation.
**Performance Characteristics:**
- **Insertion:** O(log n) when adding entries to PEL
- **Range Search:** O(log n + k) where k is the number of idle entries
found
- **Memory Overhead:** 32 bytes per pending entry for the index key (no
additional node values stored)

This dual-index approach (existing PEL structures plus the new
time-ordered index) allows XREADGROUP with CLAIM to efficiently identify
claimable entries without scanning the entire PEL, making the operation
suitable for consumer groups with large pending entry lists.

### COUNT Behavior with CLAIM
When the `COUNT` option is used in conjunction with `CLAIM`, the command
follows a two-phase execution strategy to maximize the specified count
limit:
**Phase 1:** Claim Idle Pending Entries

- Retrieve claimable pending entries (idle for ≥ min-idle-time) up to
the COUNT limit
- These entries are claimed and returned to the consumer

**Phase 2:** Fetch New Messages (if needed)

- If the `COUNT` limit has not been satisfied by claimed pending
entries, the command proceeds to read new messages from the stream
- New messages are fetched up to the remaining available count:
`remaining_count = COUNT - claimed_entries`

This prioritization ensures that idle pending entries are always
processed first, preventing indefinite message stalling while still
allowing consumers to process new messages efficiently when pending
entries are scarce.


### BLOCK Behavior with CLAIM
When the CLAIM option is used in conjunction with the BLOCK option, the
command exhibits sophisticated blocking behavior that responds to both
new messages and pending entries becoming claimable:

**Blocking State Management:**
If there are no immediately claimable pending entries and no new
messages available in the stream, the `XREADGROUP` command enters a
blocking state for the specified duration. However, the implementation
must handle a critical scenario: pending entries that become idle (and
thus claimable) while the command is blocked must trigger an early
wakeup to serve those entries.

**Implementation: `stream_claim_pending_keys` Dictionary**
To enable this reactive blocking behavior, a new
`stream_claim_pending_keys` dictionary is introduced to the `redisDb`
structure:
- **Key:** Stream key being watched
- **Value:** The minimum timestamp when the next pending entry in this
stream will become claimable (i.e., will satisfy the min-idle-time
requirement)

**Multi-Client Coordination:**
When multiple XREADGROUP commands with BLOCK and CLAIM are executed
concurrently on the same stream, the dictionary value stores the
shortest claimable time across all waiting clients. This ensures the
earliest possible wakeup when any pending entry becomes available for
claiming.

**Wakeup Mechanism: `handleClaimableStreamEntries`**
The `handleClaimableStreamEntries` function is invoked regularly from
`blockedBeforeSleep` to monitor and react to claimable entries:
1. **Scan Phase:** Iterates through all entries in the
`stream_claim_pending_keys` dictionary
2. **Time Check:** Compares each entry's claimable timestamp against the
current time
3. **Signal Phase:** When `claimable_time ≤ current_time`, calls
`signalKeyAsReady` to wake up all clients blocked on that stream
4. **Client Processing:** Awakened clients attempt to claim and process
the newly available pending entries

**Resource Contention Handling:**
When the number of claimable entries is insufficient to satisfy all
awakened clients:

- Clients that successfully claim entries complete their operations
- Remaining clients recalculate the next minimum claimable time based on
remaining pending entries
- These clients update the `stream_claim_pending_keys` dictionary with
the new timestamp
- They re-enter the blocking state to wait for the next batch of
claimable entries

This design ensures fair resource distribution and prevents busy-waiting
while maintaining responsiveness to both new messages and aging pending
entries.
2025-10-21 20:35:43 +08:00
Mincho Paskalev
aed879ad0a
Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435)
# Description

Add optimistic locking for string objects via compare-and-set and
compare-and-delete mechanism.

## What's changed

Introduction of new DIGEST command for string objects calculated via
XXH3 hash.

Extend SET command with new parameters supporting optimistic locking.
The new value is set only if checks against a given (old) value or a
given string digest pass.

Introduction of new DELEX command to support conditionally deleting a
key. Conditions are also checks against string value or string digest.

## Motivation

For developers who need to to implement a compare-and-set and
compare-and-delete single-key optimistic concurrency control this PR
provides single-command based implementation.

Compare-and-set and compare-and-delete are mostly used for [Optimistic
concurrency
control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control):
a client (1) fetches the value, keeps the old value (or its digest, for
a large string) in memory, (2) manipulates a local copy of the value,
(3) applies the local changes to the server, but only if the server’s
value hasn’t been changed (still equal to the old value).

Note that compare-and-set [can also be
implemented](https://redis.io/docs/latest/develop/using-commands/transactions/#optimistic-locking-using-check-and-set)
with WATCH … MULTI … EXEC and Lua scripts. The new SET optional
arguments and the DELEX command do not enable new functionality,
however, they are much simpler and faster to use for the very common use
case of single-key optimistic concurrency control.

## Related issues and PRs

https://github.com/redis/redis/issues/12485
https://github.com/redis/redis/pull/8361
https://github.com/redis/redis/pull/4258

## Description of the new commands

### DIGEST

```
DIGEST key
```

Get the hash digest of the value stored in key, as an hex string.

Reply:
- Null if key does not exist
- error if key exists but holds a value which is not a string
- (bulk string) the XXH3 digest of the value stored in key, as an hex
string

### SET

```
SET key value [NX | XX | IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest] [GET] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
```

`IFEQ match-value` - Set the key’s value and expiration only if its
current value is equal to match-value. If key doesn’t exist - it won’t
be created.
`IFNE match-value` - Set the key’s value and expiration only if its
current value is not equal to match-value. If key doesn’t exist - it
will be created.
`IFDEQ match-digest` - Set the key’s value and expiration only if the
digest of its current value is equal to match-digest. If key doesn’t
exist - it won’t be created.
`IFDNE match-digest` - Set the key’s value and expiration only if the
digest of its current value is not equal to match-digest. If key doesn’t
exist - it will be created.

Reply update:
- If GET was not specified:
   - Nil reply if either
- the key doesn’t exist and XX/IFEQ/IFDEQ was specified. The key was not
created.
- the key exists, and NX was specified or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false. The key was not set.
   - Simple string reply: OK: The key was set.
- If GET was specified, any of the following:
- Nil reply: The key didn't exist before this command (whether the key
was created or not).
- Bulk string reply: The previous value of the key (whether the key was
set or not).

### DELEX

```
DELEX key [IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest]
```

Conditionally removes the specified key. A key is ignored if it does not
exist.

`IFEQ match-value` - Delete the key only if its value is equal to
match-value
`IFNE match-value` - Delete the key only if its value is not equal to
match-value
`IFDEQ match-digest` - Delete the key only if the digest of its value is
equal to match-digest
`IFDNE match-digest` - Delete the key only if the digest of its value is
not equal to match-digest

Reply: 
- error if key exists but holds a value that is not a string and
IFEQ/IFNE/IFDEQ/IFDNE is specified.
- (integer) 0 if not deleted (the key does not exist or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false), or 1 if deleted.

### Notes

Added copy of xxhash repo to deps -
[version](c961fbe61a)

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Yuan Wang <wangyuancode@163.com>
2025-10-21 10:32:49 +03:00
Moti Cohen
5b49119236
Fix crash in lookupKey() when executing_client is NULL (#14415)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR is based on:
https://github.com/valkey-io/valkey/pull/2347

This was introduced in https://github.com/redis/redis/pull/13512

The server crashes with a null pointer dereference when lookupKey() is
called from handleClientsBlockedOnKey(). The crash occurs because
server.executing_client is NULL, but the code attempts to access
server.executing_client->cmd->proc without checking.

**Crash scenario:**
Client 1 enables CLIENT NO-TOUCH
Client 2 blocks on BRPOP mylist 0
Client 1 executes RPUSH mylist elem
When unblocking Client 2, lookupKey() dereferences NULL
server.executing_client → crash

**Solution**
Added proper null checks before dereferencing server.executing_client:
Check if LOOKUP_NOTOUCH flag is already set before attempting to modify
it
Verify both server.current_client and server.executing_client are not
NULL before accessing their members
Maintain the TOUCH command exception for scripts

**Testing**
Added regression test in tests/unit/type/list.tcl that reproduces and
verifies the fix for this crash scenario.

This fix is based on valkey-io/valkey#2347

Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-10-13 12:12:38 +03:00
张宇杭
083f38ef5a
Fix issues with server.allow_access_expired (#14262)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Close https://github.com/redis/redis/issues/14214

1. When the server.allow_access_expired flag is set to 1, it allows
access to expired keys that have not yet been evicted. All places
involving access to expired keys should consider the impact of this
parameter.
2. The modifications involve five methods: hfieldIsExpired,
hashTypeNext, hashTypeLength, keyIsExpired, and hashTypeIsExpired. When
the server.allow_access_expired flag is set to 1, these methods will not
skip expired keys, otherwise they follow the normal logic execution.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-10-12 11:02:03 +08:00
Moti Cohen
9b63e99d05
Refactor HFE: Introduce Per-Slot Expiration Store (estore) (#14294)
Hash field expiration is managed with two levels of data structures.

1. At the DB level, an ebuckets structure maintains the set of all
hashes that contain fields with expiration.
2. At the per-hash level, an ebuckets structure tracks fields with
expiration.

This pull request refactors the 1st level to operate per slot instead,
and introduces a new API called estore (expiration store). Its design
aligns closely with the existing kvstore API, ensuring consistency and
simplifying usage. The terminology at that level has been updated from
“HFE” or “hexpire” to “subexpiry”, reflecting a broader scope that can
later support other data types.
2025-09-11 16:45:17 +03:00
debing.sun
60adba48aa
Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326)
This PR is based on https://github.com/valkey-io/valkey/pull/1303

This PR introduces a DEBUG_DEFRAG compilation option that enables
activedefrag functionality even when the allocator is not jemalloc, and
always forces defragmentation regardless of the amount or ratio of
fragmentation.

## Using
```
make SANITIZER=address DEBUG_DEFRAG=<force|fully>
./runtest --debug-defrag
```

* DEBUG_DEFRAG=force
   * Ignore the threshold for defragmentation to ensure that
defragmentation is always triggered.
   * Always reallocate pointers to probe for correctness issues in pointer
reallocation.

* DEBUG_DEFRAG=fully
   * Includes everything in the option `force`.
   * Additionally performs a full defrag on every defrag cycle, which is
significantly slower but more accurate.

---------

Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: oranagra <oran@redislabs.com>
2025-09-10 12:52:20 +08:00
Giuseppe Coco
5f8e7852f4
Fix: Validate ENTRIESREAD in XGROUP command (#14259)
Fixes #14257

The XGROUP CREATE and SETID subcommands allowed setting an ENTRIESREAD
value greater than the stream's total `entries_added` counter. This
could lead to a logically inconsistent state.

This commit adds a check to ensure the provided ENTRIESREAD value is not
greater than the number of entries ever added to the stream. If
ENTRIESREAD is too large, it gets set to the total number of entries in
the stream, i.e. `s->entries_added`.
2025-09-01 08:36:38 +08:00
Moti Cohen
e6c261f3fb
Fix MEMORY USAGE command (#14288)
After the key-value unification (kvobj), the MEMORY USAGE command may no
longer account for the embedded key length stored within the kvobj. To
fix this, replace sizeof(*o) with zmalloc_size((void *)o) to ensure the
full allocated size is measured.

In this context, the function objectComputeSize() was renamed and
modified to kvobjComputeSize(). From computing only the value size to
compute the key and its value.
2025-08-20 13:54:45 +03:00
debing.sun
b9d9d4000b
Prevent crash when cgroups_ref is null in streamEntryIsReferenced() after reload (#14276)
This bug was introduced by https://github.com/redis/redis/pull/14130
found by @oranagra

### Summary

Because `s->cgroup_ref` is created at runtime the first time a consumer
group is linked with a message, but it is not released when all
references are removed.

However, after `debug reload` or restart, if the PEL is empty (meaning
no consumer group is referencing any message), `s->cgroup_ref` will not
be recreated.

As a result, when executing XADD or XTRIM with `ACKED` option and
checking whether a message that is being read but has not been ACKed can
be deleted, the cgroup_ref being NULL will cause a crash.

### Code Path
```
xaddCommand -> streamTrim -> streamEntryIsReferenced
```

### Solution

Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced().
2025-08-15 15:15:16 +08:00
debing.sun
bec644aab1
Fix missing kvobj reassignment after reallocation in MOVE command (#14233)
Introduced by https://github.com/redis/redis/issues/13806

Fixed a crash in the MOVE command when moving hash objects that have
both key expiration and field expiration.

The issue occurred in the following scenario:
1. A hash has both key expiration and field expiration.
2. During MOVE command, `setExpireByLink()` is called to set the
expiration time for the target hash, which may reallocate the kvobj of
hash.
3. Since the hash has field expiration, `hashTypeAddToExpires()` is
called to update the minimum field expiration time

Issue:
However, the kvobj pointer wasn't updated with the return value from
`setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed
memory.
2025-07-30 22:24:56 +08:00