mirror of
https://github.com/redis/redis.git
synced 2026-05-28 04:02:46 -04:00
465 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
95040d61d5
|
Replace INCREX out-of-bounds policy to a single SATURATE option (#15237)
Follow https://github.com/redis/redis/issues/15045 ## Summary Simplify INCREX's out-of-bounds policy: The original INCREX shipped with three out-of-bounds policies — OVERFLOW FAIL, OVERFLOW SAT, OVERFLOW REJECT — but FAIL and REJECT are functionally redundant: both leave the key untouched when the result is out of bounds. They differ only in how the caller is notified (error reply vs. [current_value, 0] array reply), which forces the user to make a stylistic choice with no real semantic difference. This PR collapses the three policies into one clear behavior: * Default: the operation is rejected; the key value and TTL are left unchanged, and the reply is [current_value, 0]. Callers detect non-application by checking the applied-increment field; no error-handling branch is required. * SATURATE: the result is saturated to UBOUND / LBOUND, or to the type limits (LLONG_MAX/MIN for BYINT, ±LDBL_MAX for BYFLOAT) when no explicit bound is given. New syntax: INCREX <key> [BYFLOAT increment | BYINT increment] [LBOUND lowerbound] [UBOUND upperbound] [SATURATE] [EX seconds | PX milliseconds | EXAT seconds-timestamp | PXAT milliseconds-timestamp | PERSIST] [ENX] --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> |
||
|
|
0d9576435f
|
Implement the new Redis Array type (#15162)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Redis Array For years, Redis has been missing a real indexed data structure for the use cases where the index and the spatial relationship of elements are semantic. Hashes give you random lookups, but you have to store an index as a key, and have no range visibility. Lists give you appending and trimming, but what is in the middle remains hard to access. Streams give you append-only events, which is another (useful, indeed) beast. None of these is what you want when the *position itself* has business meaning — slot 37, step 4, row 18552, day from 2934 to 2949, file line 11, 12, 15 and so forth. And, all those types, for different reasons, are all suboptimal when you want a **ring buffer** able to store the latest N observed samples of something. Up to now, users found ways (they always do \o/) using the fact that the data structures that are obvious in this universe are also extremely powerful, if well implemented. But this forces compromises. Arrays handle these index-first requirements natively, and usually with much better memory and CPU usage than the workarounds. If the use case is the right one, Arrays often provide much better space, time and usability at the same time. ## Internal encoding 1. When dense, an Array is essentially a more fancy C array. You don't pay anything for storing the index. 2. Yet, instead of going really flat, arrays are sliced into 4096-element slices, and each slice, when it contains just a few elements, uses a special sparse encoding. When a slice is empty it's just a `NULL` stored in the directory. 3. Small ints, floats, and short strings are pointer-tagged, so they cost zero additional memory beyond the pointer slot itself. 4. When very sparse, a super-directory of windowed directories is used. This allows the data type to be safe, instead of exhibiting pathological space or time behavior. This representation is only triggered when there are more than 8 million elements or very high indexes set. ## Use cases Arrays are mostly stateless if not for the fact that each array remembers the index of the latest added item, allowing `ARINSERT` and `ARRING` to work properly. Otherwise it is a set/get at this index game, with solid support for both setting / getting ranges, server-side scanning, returning only populated elements in a time which is proportional not to the range size, but to the population size. A few concrete examples, that may work as mental models for the set of problems that are similar to them (from the POV of the data modeling). **Thermometer.** A sensor reporting once per minute, with gaps: ``` ARSET temp:room12:day7 123 22.3 ARGETRANGE temp:room12:day7 600 660 # the 10:00–11:00 window, with NULLs ARSCAN temp:room12:day7 600 660 # only populated elements AROP temp:room12:day7 0 1439 MAX # peak of the day, server-side ``` Missing minutes cost little to nothing. Numeric aggregation runs inside Redis. Telemetry, IoT, meter readings, KPI rollups. **Calendar.** A clinic with 96 fifteen-minute slots per day: ``` ARSET sched:room12:day 32 booking:991 ARSCAN sched:room12:day 0 95 # only occupied slots ARGETRANGE sched:room12:day 48 63 # the afternoon full view to render ``` The slot number is the business key in this case. Room booking, parking spaces, warehouse bins, lockers, ... **Ring buffer.** ARRING replaces the classic LPUSH+LTRIM pattern. Imagine remote `dmesg`. ``` ARRING machine:123 200 "[141087.430123]: arm_cpu_init(): cpu 14 online" # Capped to 200 entries ARLASTITEMS machine:123 50 REV # 50 newest first ``` Faster than LPUSH+LTRIM, keep indexed access to past elements. Last-N alarms, recent fraud scores, access history, remote logs, device events. Ok here the use cases are mainly the ones of the old pattern: it is just a better fit and allows to access random items in the middle, aggregate server-side, and so forth. **Workflow.** Step number is the index, value is the status. Gaps are meaningful: ``` ARSET claim:99172 0 received ARSET claim:99172 3 waiting:reviewer42 ARSET claim:99172 5 approved ARGETRANGE claim:99172 0 5 # full workflow view, with NULLs for missing steps ARSCAN claim:99172 0 5 # only steps that have a state ARCOUNT claim:99172 # number of recorded steps ARLEN claim:99172 # highest reached step + 1 ``` **Skills knowledge base for agents.** Arrays are good at representing / grepping into Markdown files: ``` ARSET skill:metal_gpu 0 "...." ARSET skill:metal_gpu 1 "...." ARSET skill:metal_gpu 2 "...." ARGREP skill:metal_gpu - + RE "M3|M4" WITHVALUES ``` ARGREP has EXACT, MATCH, GLOB, RE, you can have multiple predicates, can select AND or OR behavior. **Bulk import results.** Sparse row annotations over millions of rows / CSV / ...: ``` ARSET import:job551 18552 ERR:bad_email ARSCAN import:job551 0 1000000 # Provides only rows that have something ``` ## TLDR If the position is part of the meaning, use an Array. If you want to aggregate or grep remotely, use an Array. Feedback welcome :) --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Shubham S Taple <155555100+ShubhamTaple@users.noreply.github.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> Co-authored-by: Marc Gravell <marc.gravell@gmail.com> |
||
|
|
9c1ecd044e
|
Add INCREX command for atomic increment with TTL and bounds (#15045)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
Close #14278 ## Overview Rate limiters, sliding windows, request counters, and numerous other network-facing patterns share a common primitive: **atomically increment a counter and set its expiration**. Achieving this in Redis requires either multiple round-trips or a Lua script that bundles `INCR` / `INCRBY` / `INCRBYFLOAT` with `EXPIRE` / `PEXPIRE`. We propose a new command, **`INCREX`**, that collapses this two-step pattern into a single, native, O(1) command. `INCREX` atomically: 1. Increments (or decrements) a key's numeric value — by integer or float. 2. Optionally enforces lower and/or upper bounds, with a configurable overflow policy (error out, saturate, or no-op), enabling built-in cap enforcement (e.g., max request count) without additional client logic. 3. Optionally sets or removes the key's expiration. 4. Returns both the **new value** and the **actual increment applied**, giving the caller immediate feedback on whether the operation was saturated or skipped. ## Use Cases ### Basic Usage ``` # Increment by 1 (default) and set a 60-second TTL. > SET mykey 10 > INCREX mykey EX 60 1) (integer) 11 # new value 2) (integer) 1 # actual increment # Use 0 as initial value if the key doesn't exist. > DEL mykey > INCREX mykey 1) (integer) 1 # new value 2) (integer) 1 # actual increment # Default policy (OVERFLOW FAIL): exceeding a bound returns an error. > SET mykey 5 > INCREX mykey BYINT 20 UBOUND 10 (error) value is out of bounds # Opt into saturation with OVERFLOW SAT. > INCREX mykey BYINT 20 UBOUND 10 OVERFLOW SAT 1) (integer) 10 # saturated to upper bound 2) (integer) 5 # only 5 was actually applied # Skip the operation with OVERFLOW REJECT — the key and its TTL are # untouched, and the reply reports the current value with a zero delta. > SET mykey 5 > INCREX mykey BYINT 20 UBOUND 10 OVERFLOW REJECT 1) (integer) 5 # current (unchanged) value 2) (integer) 0 # nothing was applied # Increment by a float > SET mykey 1 > INCREX mykey BYFLOAT 0.5 1) "1.5" 2) "0.5" ``` ### Use Case: Rate Limiter **Before (Lua script):** ```lua -- KEYS[1] = rate limit key, ARGV[1] = limit, ARGV[2] = window in seconds local current = redis.call('INCR', KEYS[1]) if current > tonumber(ARGV[1]) then return 0 -- rejected end if current == 1 then redis.call('EXPIRE', KEYS[1], ARGV[2]) end return 1 -- allowed ``` Client invocation: ```python result = redis.eval(LUA_SCRIPT, 1, f"ratelimit:{user_id}", 100, 60) if result == 0: reject_request() ``` **After (INCREX):** ```python new_val, actual_incr = redis.execute_command( "INCREX", f"ratelimit:{user_id}", "UBOUND", 100, "OVERFLOW", "REJECT", "EX", 60, "ENX" ) if actual_incr == 0: # Rate limit exceeded — key left unchanged. reject_request() ``` `ENX` means: set expiration only if the key doesn't already have an expiration. This ensures the sliding window's TTL is only set on the first request. ### Use Case: Token Bucket Refill Refill tokens periodically up to a capacity ceiling, saturating at the cap instead of erroring: ``` > INCREX tokens:user123 BYINT 10 UBOUND 100 OVERFLOW SAT EX 3600 ENX 1) (integer) 10 2) (integer) 10 ``` Tokens cannot exceed 100, and the key auto-expires after inactivity. ### Use Case: Countdown / Resource Consumption Decrement a resource counter down to zero, saturating at the floor: ``` > SET credits:user123 50 > INCREX credits:user123 BYINT -1 LBOUND 0 OVERFLOW SAT 1) (integer) 49 2) (integer) -1 ``` When credits are exhausted, `OVERFLOW SAT` prevents negative balances without client-side checks. ## Parameter Reference ### Syntax ``` INCREX key [BYFLOAT increment | BYINT increment] [LBOUND lowerbound] [UBOUND upperbound] [OVERFLOW <FAIL | SAT | REJECT>] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | PERSIST] [ENX] ``` ### Parameters | Parameter | Description | |-----------|-------------| | `key` | The key to increment. Created with value `0` if it does not exist. | | `BYFLOAT increment` | Increment the value by the given long-double float. | | `BYINT increment` | Increment the value by the given 64-bit signed integer. | | `LBOUND lowerbound` | Set lower bound for the increment result. Defaults to `LLONG_MIN` (integer) or `-LDBL_MAX` (float). | | `UBOUND upperbound` | Set upper bound for the increment result. Defaults to `LLONG_MAX` (integer) or `LDBL_MAX` (float). | | `OVERFLOW <FAIL \| SAT \| REJECT>` | Set the overflow policy when the result would be out of bounds. `FAIL` rejects the operation with an error (default). `SAT` saturates the result to the bound. `REJECT` leaves the key and its TTL untouched and replies with the current value and a zero delta. | | `EX seconds` | Set the key's TTL to `seconds` seconds. | | `PX milliseconds` | Set the key's TTL to `milliseconds` milliseconds. | | `EXAT unix-time-seconds` | Set the key's expiration to the absolute Unix timestamp in seconds. | | `PXAT unix-time-milliseconds` | Set the key's expiration to the absolute Unix timestamp in milliseconds. | | `PERSIST` | Remove the key's existing TTL. | | `ENX` | Set the key's TTL/expiration if it has No eXpiration | If neither `BYINT` nor `BYFLOAT` is specified, the increment defaults to integer `1`. ### Return Value An **array of two elements**: 1. **New value** — the value of the key after the increment (or the unchanged current value under `OVERFLOW REJECT`). 2. **Actual increment** — the increment that was actually applied. May differ from the requested increment when `OVERFLOW SAT` saturates the result to a bound, and is always `0` when `OVERFLOW REJECT` skipped the operation. - In integer mode (default or `BYINT`): both elements are **integers**. - In float mode (`BYFLOAT`): both elements are **bulk strings** representing the float values on RESP2, and **RESP3 Doubles** on RESP3. ### Overflow Policy (FAIL vs. SAT vs. REJECT) Controlled by the optional `OVERFLOW` argument. A bound violation includes both exceeding an explicit `LBOUND`/`UBOUND` and overflowing the type limits when no explicit bound is given. - **`OVERFLOW FAIL` (default)**: if the computed result would violate a bound, the command returns an error and the key is left unchanged. This matches the existing semantics of `INCRBY` / `INCRBYFLOAT` on overflow. - **`OVERFLOW SAT`**: the result is silently capped at `UBOUND` / floored at `LBOUND` (or saturated to the type limits when no explicit bound is given). The second element of the reply reflects the saturated delta. If the delta cannot be represented as a 64-bit signed integer(default or `BYINT`), or would produce Infinity(`BYFLOAT`), an error is returned. - **`OVERFLOW REJECT`**: the operation is silently skipped — the key value and its TTL are left unchanged, no keyspace notification is fired, and nothing is replicated. The reply is `[current_value, 0]`, allowing the caller to detect the rejection without handling an error. ### Notes - If **no expiration option** is given, the key's existing TTL is preserved (like `INCR`). - `ENX` requires one of `EX`/`PX`/`EXAT`/`PXAT`. - If the result is saturated by `OVERFLOW SAT`, the expiration is still applied as specified. - Under `OVERFLOW REJECT` the expiration option is ignored on the rejected branch — TTL is preserved exactly as it was before the call. - **`BYINT` requires an integer-typed existing value; `BYFLOAT` accepts both.** Integers can be promoted to floats losslessly, but a stored float (e.g. `"1.5"`) cannot be parsed back as an integer. This is consistent with `INCR`/`INCRBY` (integer-only) and `INCRBYFLOAT` (accepts both). --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Moti Cohen <moti.cohen@redis.com> Co-authored-by: oranagra <oran@redislabs.com> |
||
|
|
62551a7b12
|
Batched MGET/MSET dict prefetch with dictType-driven payload hints (#15133)
Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. */ void *(*prefetchEntryKey)(const dictEntry *de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). */ void *(*prefetchEntryValue)(const dictEntry *de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void *dbDictPrefetchEntryKey(const dictEntry *de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void *dbDictPrefetchEntryValue(const dictEntry *de) { kvobj *kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: | Type | Role | |---|---| | `dictPrefetchLookup` | Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). | | `dictPrefetcher` | Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. | Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); /* one-shot heap alloc of lookups[] */ dictPrefetcherReset(p, dicts, keys, nkeys); /* configure for one batch */ dictPrefetcherRun(p); /* drive FSM until all PREFETCH_DONE */ dictPrefetcherFree(p); /* release */ ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict **dicts, void **keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c /* Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. */ static int prefetchKeysBatch(client *c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - **MGET (`mgetCommand`)** — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - **MSET / MSETNX (`msetGenericCommand`)** — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - **Cross-command batch path (`addCommandToBatch`)** — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - **Future types**: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
5a05863e97
|
t_string: rewrite SET GET propagation in place (#15114)
Optimize SET key value GET propagation rewriting in setGenericCommand() by removing GET arguments in-place with rewriteClientCommandArgument(). This avoids the overhead of allocating a new argv vector and incrementing reference counts for every retained argument. The optimization is scoped to the no-expire SET ... GET rewrite path. It also adds test coverage for cases with repeated GET tokens to ensure robust string semantics and consistent replication behavior. Changes: - Use rewriteClientCommandArgument(c, j, NULL) for in-place removal. - Eliminate redundant argv allocations and refcount increments. - Improve performance of SET GET in high-throughput write streams. |
||
|
|
63f02e7876
|
Fix double ERR prefix in XNACK error replies (#15091)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Several `addReplyError` and `addReplyErrorFormat` calls in `xnackCommand` included a redundant `"ERR "` prefix in the message string. Since `addReplyErrorLength` already prepends `-ERR ` to the RESP reply, clients received `ERR ERR ...` for these error paths. This PR removes the redundant prefix from all five affected calls and tightens the corresponding test patterns to match from the beginning of the error message (`"ERR ..."` instead of `"*...*"`), so any future double-prefix regression will be caught. |
||
|
|
0fa78fd8fd
|
perf: widen fast_float_strtod fast path to 17-19 digit mantissas (#15061)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
## Root cause Roughly 50% of random double scores generated by the ZADD listpack workload have 17-19 significant digits, which exceed `MAX_MANTISSA_FAST_PATH` (`2^53`). These inputs fall through to the `strtod()` fallback: ```c char static_buf[128]; memcpy(buf, nptr, len); /* memcpy back! */ buf[len] = '\0'; /* null-term */ double result = strtod(buf, ...); /* glibc strtod — ~10× slower on ARM */ ``` The original C++ `fast_float` library handled the same 17-19 digit inputs with Eisel-Lemire / bigint arithmetic without falling back to `strtod()`. That is what the pure-C replacement lost. ## Fix Compute `mantissa * 10^exponent` in 128-bit integer arithmetic using `__uint128_t`, then convert to double with a single IEEE round-to-nearest-even cast. Supported for `|exp| in [0, 19]` where `10^|exp|` fits in `uint64`; cases outside that range (or otherwise outside the fast path's preconditions) still fall through to `strtod()`. --------- Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
8aeea8c210
|
Increase threshold for HPEXPIRETIME persists after RDB reload test (#15047) | ||
|
|
fa6d4c3d63
|
Fix SIGABRT in HSETEX when a field appears twice in the FIELDS list (#14956)
HSETEX crashed on assert() with a SIGABRT when the same field appeared more than once in the FIELDS list and an expiry time was given (EX/PX/EXAT/PXAT). Root cause: hfieldPersist() and the KEEP_TTL path in hashTypeSet() both asserted that dictExpireMeta->expireMeta.trash == 0, meaning the hash must be globally registered in the HFE DS. This is incorrect during HSETEX execution because hashTypeSetExDone(), which registers the hash globally and clears trash, called only at the end of flow. The private per-field ebuckets are fully valid regardless of the global registration state. Fix: Remove both incorrect assertions. The operations on the private ebuckets (ebRemove in hfieldPersist, ebAdd in the KEEP_TTL path) are correct and do not require the hash to be globally registered. Tests: Added two regression tests covering the crash scenarios: - HSETEX EX with a duplicate field (existing field, expiry given) - HSETEX FNX EX with a duplicate field (no prior field, FNX condition passes) |
||
|
|
80f1ebda88
|
Add AGGREGATE COUNT option to ZUNION, ZINTER, ZUNIONSTORE, and ZINTERSTORE (#14892)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
### Overview
This PR adds a new `COUNT` aggregation mode to the `ZUNIONSTORE`,
`ZINTERSTORE`, `ZUNION`, and `ZINTER` sorted set commands. When
`AGGREGATE COUNT` is specified, the resulting score for each element
reflects how many input sets contain it (optionally scaled by
`WEIGHTS`), rather than combining the actual scores of the elements.
This enables a common use case — counting set membership frequency —
directly at the command level, without application-side workarounds.
### Problem Statement
For developers who need to know **how many input sorted sets contain
each element**, there is no single-command solution today.
**Example:** given several game leaderboards, find how many leaderboards
each player appears in.
The existing aggregation modes (`SUM`, `MIN`, `MAX`) all operate on the
elements' scores. To ignore scores and just count set membership, you'd
currently need to copy each sorted set with all scores set to 1, then
run `ZUNIONSTORE`/`ZINTERSTORE` with `SUM` — requiring multiple round
trips, temporary keys, and application-level locking to avoid races.
A `COUNT` aggregation mode solves this directly.
### Solution
Introduces `AGGREGATE COUNT` as a fourth aggregation mode:
- `ZINTER numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE
<SUM | MIN | MAX | COUNT>] [WITHSCORES]`
- `ZINTERSTORE destination numkeys key [key ...] [WEIGHTS weight [weight
...]] [AGGREGATE <SUM | MIN | MAX | COUNT>]`
- `ZUNION numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE
<SUM | MIN | MAX | COUNT>] [WITHSCORES]`
- `ZUNIONSTORE destination numkeys key [key ...] [WEIGHTS weight [weight
...]] [AGGREGATE <SUM | MIN | MAX | COUNT>]`
When `COUNT` is specified, **the scores in the input sets are ignored**.
Note that `WEIGHTS` is **not** ignored — each set contributes its weight
(default 1) per element, and the contributions are summed.
**Implementation details:**
A new helper function `zuiWeightedScore()` computes the per-set
contribution:
```c
inline static double zuiWeightedScore(double score, double weight, int aggregate) {
return (aggregate == REDIS_AGGR_COUNT) ? weight : weight * score;
}
```
The `zunionInterAggregate()` function treats `COUNT` identically to
`SUM` — it adds the per-set contributions. All four call sites where
`weight * score` was previously computed inline are updated to use
`zuiWeightedScore()`.
### Examples
```
> ZADD s1 1 foo 1 bar
> ZADD s2 2 foo 2 bar
> ZADD s3 3 foo
```
**With `SUM` (existing behavior, for comparison):**
```
> ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "29"
> ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "20"
3) "foo"
4) "29"
```
**With `COUNT` and `WEIGHTS`:**
```
> ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "18"
> ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "15"
3) "foo"
4) "18"
```
**With `COUNT` and no specified `WEIGHTS`** — resulting score equals the
number of input sorted sets containing the element:
```
> ZINTERSTORE t1 3 s1 s2 s3 AGGREGATE COUNT
(integer) 1
> ZRANGE t1 0 -1 WITHSCORES
1) "foo"
2) "3"
> ZUNIONSTORE t1 3 s1 s2 s3 AGGREGATE COUNT
(integer) 2
> ZRANGE t1 0 -1 WITHSCORES
1) "bar"
2) "2"
3) "foo"
4) "3"
```
### Backward Compatibility
This is a fully additive change. The new `COUNT` keyword is only
recognized after the `AGGREGATE` token in the four affected commands.
Existing commands, arguments, and default behavior (`AGGREGATE SUM`) are
completely unchanged. No new command is introduced, and no existing
response format is modified.
|
||
|
|
e1d35aca01
|
Fix HEXPIRE numfields overflow (#15021)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Validate HEXPIRE-family field counts without parser overflow keep flexible option order; only require fields fit in argv add tests for INT_MAX numfields across HEXPIRE/HPEXPIRE/HEXPIREAT/HPEXPIREAT |
||
|
|
e8da0e5b47
|
Fix brittle assert_match patterns for unexpected slowlog fields (#14948) | ||
|
|
0be39e5032
|
Fix missing consumer propagation on empty XREADGROUP (#14963)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Summary Fixes consumer replication inconsistency when `XREADGROUP` is called for a new consumer but no `XCLAIM` commands are propagated to the replica. Previously, consumer creation was only propagated to replicas when `noack=true`, relying on `XCLAIM` propagation to implicitly create the consumer in the non-NOACK path. However, if no messages exist to read, no `XCLAIM` is generated, and the consumer is silently lost on the replica. This is a follow-up to the original fix in [redis/redis#7140](https://github.com/redis/redis/issues/7140) / [redis/redis#7526](https://github.com/redis/redis/pull/7526), which introduced `XGROUP CREATECONSUMER` propagation but only for the `NOACK` case. ## Changes - **`xreadgroupCommand` (src/t_stream.c):** Replaced the `if (noack)` guard around the `streamPropagateConsumerCreation()` call with a deferred check after `streamReplyWithRange()`. Consumer creation is now propagated when `noack || propCount == 0` — that is, only when no `XCLAIM` commands were generated. This avoids redundant propagation in the common case where `XCLAIM` already implicitly creates the consumer on the replica, while correctly handling both the NOACK path (where PEL/XCLAIM is skipped entirely) and the no-messages path (where there is nothing to XCLAIM). - **Test (tests/unit/type/stream-cgroups.tcl):** Added replication test `"XREADGROUP propagates new consumer to replica"` that sets up a master-replica pair and verifies consumer propagation in two cases: (1) without NOACK when no messages are available to deliver, and (2) with NOACK when messages are delivered but XCLAIM is skipped. ## Benefits - **Master-replica consistency:** Consumers created by `XREADGROUP` are now visible on replicas whenever no `XCLAIM` would otherwise create them — covering both the NOACK path and the empty-stream path. - **No redundant propagation:** The noack || propCount == 0 condition avoids emitting a superfluous XGROUP CREATECONSUMER when XCLAIM commands are already propagated and would implicitly create the consumer on the replica. |
||
|
|
747dfe578e
|
Add XNACK command for releasing stream messages back to the group (#14797)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
### Overview
This PR enhances Redis Streams consumer groups by adding a new `XNACK`
command that allows consumers to explicitly release pending messages
back to the group without acknowledging them. Released (NACKed) entries
become immediately available for re-delivery to other consumers,
eliminating the idle-timeout delay currently required for message
recovery. The command supports three modes — SILENT, FAIL, and FATAL —
giving consumers fine-grained control over delivery counter semantics to
handle graceful shutdowns, transient failures, and poison messages
respectively.
### Problem Statement
For developers using Redis Streams with consumer groups, there are
several common scenarios where a consumer needs to release a message it
has claimed without acknowledging it:
1. **Transient internal failures**: A consumer may fail to process a
message because of problems unrelated to the message itself — for
example, it cannot connect to an external service to fetch required
context. The message is perfectly valid and should be retried promptly
by another consumer.
2. **Resource pressure**: A consumer under resource stress (low CPU, low
memory) may be unable to handle a specific message (e.g., a complex or
large message) within acceptable QoS. It should leave the opportunity to
other consumers in the group, with minimal delay.
3. **Graceful shutdown**: A consumer about to shut down would like to
immediately release all unprocessed messages it has claimed, so they can
be picked up by remaining consumers without waiting for idle timeouts.
4. **Poison / malicious messages**: A consumer may detect or suspect
that a claimed message is invalid or malicious and wants to mark it as
permanently failed (for dead-letter queue processing when available).
**Currently, a consumer cannot NACK a message.** It can either:
- **XACK** it — marks it as "processed" and removes it from the PEL
entirely, losing the ability to redeliver it
- **Leave it pending** — requires other consumers to discover it via
`XPENDING` and claim it via `XCLAIM`/`XAUTOCLAIM` or `XREADGROUP CLAIM`
after the idle timeout expires, introducing a long, unnecessary delay
In all these cases, the logic that applications must implement
introduces **message handling delays**, **implementation complexity**,
and **code duplication** across consumer implementations.
### Solution
Introduces a new `XNACK` (Negative ACKnowledge) command that explicitly
releases pending messages from their owning consumer back to the group's
PEL, making them immediately claimable via `XCLAIM` and `XAUTOCLAIM`,
and prioritized for re-delivery in `XREADGROUP CLAIM`:
```
XNACK key group <SILENT|FAIL|FATAL> IDS numids id [id ...] [RETRYCOUNT count] [FORCE]
```
When executed, the command:
1. **Disassociates** the entry from its owning consumer (`consumer =
NULL`)
2. **Repositions** the entry to the head of the PEL time-ordered list
(`delivery_time = 0`), making it immediately claimable with any
`min-idle-time` threshold
3. **Adjusts the delivery counter** based on the specified mode, giving
consumers fine-grained control over retry semantics
4. **Returns** the count of successfully NACKed entries
**Mode** controls the delivery counter adjustment and communicates the
reason for the NACK:
| Mode | Delivery Counter Behavior | Use Case |
|----------|---------------------------------------------------|---------------------------------------------|
| `SILENT` | Decrement by 1 (undo the delivery increment) | Consumer
shutdown / transient internal error — the delivery "didn't count" |
| `FAIL` | No change (keep the incremented value) | Message too complex
for this consumer, but may work for others — count this as an attempt |
| `FATAL` | Set to `LLONG_MAX` | Invalid / suspected malicious message —
mark as permanently failed |
The three modes map directly to the real-world scenarios above:
- **SILENT** for graceful shutdown or transient failures unrelated to
the message
- **FAIL** for resource-constrained consumers that cannot handle a
specific message
- **FATAL** for poison message detection and dead-letter queue
integration
**Optional parameters:**
- **`RETRYCOUNT count`**: Directly sets `delivery_count` to the
specified value, overriding the mode-based adjustment
- **`FORCE`**: Creates new unowned PEL entries for IDs that are not
already in the group PEL (the entry must exist in the stream). When
`FORCE` creates an entry, the delivery counter is set to `0` (or to
`RETRYCOUNT` if specified, or to `LLONG_MAX` if mode is `FATAL`). This
is used internally for AOF rewrite and replication.
### Response Format
The command returns an integer — the number of messages successfully
NACKed (released back to the group PEL):
```
127.0.0.1:6379> XADD mystream 1-0 f v1
"1-0"
127.0.0.1:6379> XADD mystream 2-0 f v2
"2-0"
127.0.0.1:6379> XGROUP CREATE mystream grp 0
OK
127.0.0.1:6379> XREADGROUP GROUP grp c1 STREAMS mystream >
1) 1) "mystream"
2) 1) 1) "1-0"
2) 1) "f"
2) "v1"
2) 1) "2-0"
2) 1) "f"
2) "v2"
127.0.0.1:6379> XNACK mystream grp FAIL IDS 2 1-0 2-0
(integer) 2
```
After XNACK, the entries appear with an empty consumer in XPENDING
output:
```
127.0.0.1:6379> XPENDING mystream grp - + 10
1) 1) "1-0"
2) ""
3) (integer) -1
4) (integer) 1
2) 1) "2-0"
2) ""
3) (integer) -1
4) (integer) 1
```
### NACK Zone: Data Structure Extension
To support unowned PEL entries and ensure they are prioritized for
re-delivery, a **NACK zone** is introduced at the head of the existing
PEL time-ordered doubly-linked list. A new `pel_nack_tail` pointer is
added to the `streamCG` structure:
**PEL ordering:**
```
[pel_time_head] <-> ... <-> [pel_nack_tail] <-> [owned entries...] <-> [pel_time_tail]
|_____________ NACK zone ______________| |_______ normal PEL ________|
```
The head of the PEL contains all NACKed messages (FIFO-ordered),
followed by all delivered messages that were not NACKed (same order as
today). This ensures NACKed messages are always prioritized over idle
pending messages.
The delivery order for `XREADGROUP` is therefore:
1. If `CLAIM` was specified: first deliver NACKed messages, then deliver
due pending messages (current behavior)
2. Deliver new entries after the group's last-delivered-id (current
behavior)
**Structure Design:**
- NACKed entries occupy positions from `pel_time_head` to
`pel_nack_tail` in the time-ordered list
- Their `delivery_time` is set to `0`, ensuring they always appear
"oldest" and are immediately claimable
- Their `consumer` pointer is set to `NULL`, marking them as unowned
- `pel_nack_tail` is `NULL` when no NACKed entries exist
**Key Properties:**
- **O(1) insertion**: New NACKed entries are inserted right after
`pel_nack_tail` (or at the list head if the zone is empty)
- **FIFO ordering** among NACKed entries: entries are NACKed in the
order they are released
- **Immediate claimability**: Since `delivery_time = 0`, NACKed entries
have maximum idle time and satisfy any `min-idle-time` threshold in
`XCLAIM` and `XAUTOCLAIM`, In `XREADGROUP CLAIM`, NACKed entries are
also prioritized over other pending entries due to their position at the
head of the PEL.
- **Zone integrity**: The `pelListInsertSorted` function is updated to
stop scanning at the `pel_nack_tail` boundary, ensuring owned entries
are never placed inside the NACK zone
### Impact on Existing Commands
All commands that interact with the PEL are updated to handle unowned
(`consumer = NULL`) entries:
- **XPENDING**: Shows NACKed entries with an empty consumer name
- **XCLAIM / XAUTOCLAIM**: Can claim NACKed entries (they satisfy any
min-idle-time since `delivery_time = 0`)
- **XREADGROUP CLAIM**: NACKed entries are picked up by the claim phase
- **XACK**: Works correctly on NACKed entries (removes from group PEL)
- **XINFO STREAM FULL**: Displays NACKed entries with an empty consumer
name
- **XGROUP DELCONSUMER**: Unaffected — NACKed entries are not in any
consumer's PEL
Propagation is also updated: when `XCLAIM` or `XAUTOCLAIM` encounters a
deleted stream entry for an unowned NACK, it propagates `XACK` (instead
of `XCLAIM`) to replicas and AOF, since there is no source consumer to
reference.
### Persistence
**RDB:**
- A new RDB type `RDB_TYPE_STREAM_LISTPACKS_5` (type 27) is introduced
- After saving consumer PEL entries, the NACK zone stream IDs are saved
separately (count + encoded IDs)
- On load, NACK zone entries are reconstructed by looking them up in the
group PEL, unlinking from their sorted position, and re-inserting into
the NACK zone via `pelListInsertNacked`
- Backward compatibility is preserved: old RDB types continue to load
with the existing validation (all entries must have consumers)
**AOF:**
- AOF rewrite emits `XNACK <key> <group> FAIL IDS <n> <id...> RETRYCOUNT
<cnt> FORCE` commands for entries in the NACK zone
- Consecutive entries with the same `delivery_count` are batched into a
single command (up to `AOF_REWRITE_ITEMS_PER_CMD` IDs per command)
### Defragmentation
The defragmentation logic is restructured to handle unowned entries:
- **`defragStreamCGPendingEntry`** (new): Walks the group-level PEL rax,
defragments each NACK, updates the doubly-linked list pointers
(`pel_prev`, `pel_next`), `pel_time_head`, `pel_time_tail`,
`pel_nack_tail`, and the consumer PEL back-pointer for owned entries
- **`defragStreamConsumerPendingEntry`** (simplified): Only fixes up
back-pointers to the possibly-relocated consumer and CG, since actual
defragmentation is now done at the group-level walk. Unowned (NACK zone)
entries have no consumer PEL walk, so the group-level pass is their only
chance
### Key Benefits
- **Immediate re-delivery**: NACKed entries are instantly claimable by
other consumers via `XCLAIM` and `XAUTOCLAIM` (since `delivery_time = 0`
satisfies any `min-idle-time`), and prioritized for re-delivery in
`XREADGROUP CLAIM`, eliminating idle-time delays that can range from
seconds to minutes
- **Explicit release semantics**: Consumers can release messages
intentionally, with fine-grained control over retry behavior — a
capability that exists in competing systems like RabbitMQ
- **Flexible retry control**: Three modes (SILENT, FAIL, FATAL) plus
RETRYCOUNT cover the full spectrum of failure handling strategies, from
graceful shutdown to poison message detection
- **Reduced application complexity**: Eliminates the need for
application-level workarounds involving XPENDING polling, arbitrary idle
timeouts, and manual XCLAIM orchestration
- **Dead-letter queue readiness**: FATAL mode + delivery count enables
straightforward poison message detection and future DLQ integration
- **Backward compatibility**: Fully optional new command with zero
breaking changes to existing behavior
|
||
|
|
a6a27f56f2
|
Test tcp deadlock fixes (#14946)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com> |
||
|
|
2ba0194fbe
|
Fix memory leak in ZDIFF algorithm 2 on early exit (#14932)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
## Problem `zdiffAlgorithm2()` can break out early once the destination cardinality reaches zero. In that path, a temporary SDS created by `zuiSdsFromValue()` is left dirty and never released, because the cleanup normally happens on the next `zuiNext()` call which is skipped due to the early `break`. `zuiClearIterator()` called after the loop does **not** clean up dirty values — only `zuiNext()` or explicit `zuiDiscardDirtyValue()` does. ## Fix Add `zuiDiscardDirtyValue(&zval)` before the early `break` to ensure the temporary SDS is freed on all exit paths. |
||
|
|
f4d176b3b7
|
KEYSIZES/ASM: simplify histograms, fix background trim, and refactor debug assertions (#14877)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
- Simplify KEYSIZES tracking and saving memory by removing per‑slot histogram state in kvstoreDictMetadata and routing all updates through kvsUpdateHistogram + updateKeysizesHist(db, type, ...) (per‑DB only). - Fix KEYSIZES consistency during ASM background trim by introducing asmTrimCtx, passing it through emptyDbDataAsync/BIO lazy‑free, computing histogram deltas in the background, and applying them on completion; add bg_trim_running to coordinate with validation. - Refactor and relax debug assertions into a unified dbg_assert_flags bitmask and dbgRunAssertions(db), and skip KEYSIZES/ALLOCSIZE checks during nested execution, RDB load, ASM import, and ASM background trim. - Update commands, module APIs, tests (including UNLINK async deletion coverage), and daily CI workflow to reflect the new histogram behavior and re‑enable ASM/cluster tests. - Revise the daily CI “debug-assert-keyspace” workflow to run ASM and slot-stats unit tests again Potential edge case with histogram accuracy: The histogram could become inaccurate if the database is flushed while ASM trim is running in the background. I considered adding a generation counter to detect this, but decided against it since this is purely an INFO/diagnostic feature and the edge case is quite rare. The target_kvstore pointer check prevents applying stale deltas to the wrong kvstore, and if the histogram does become incorrect we have debugServerAssert() to catch negative values - it won't cause crashes in production. This is a known limitation we can document and revisit if it becomes a real issue in practice. |
||
|
|
9accf8bd24
|
Refine error message for HSETEX command with PERSIST (#14880)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fixes #14879 > PERSIST can only be given for HGETEX and KEEPTTL for HSETEX but the code doesn’t verify it. **Behavior currently:** ``` 127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v (error) ERR Only one of EX, PX, EXAT, PXAT or KEEPTTL arguments can be specified ``` **With this PR, behavior would be:** ``` 127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v (error) ERR unknown argument: persist ``` |
||
|
|
462e603a1f
|
Fix stream_idmp_keys missing from database lifecycle ops (#14897)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
**Summary** Ensures `db->stream_idmp_keys` is managed consistently with `keys`, `expires`, and `subexpires` across every database lifecycle operation — flush, swap, temp-DB init/discard, lazy-free, and cluster slot migration. Without this fix, the dict was absent from these code paths, causing three classes of bugs: a SIGSEGV during diskless replication in `swapdb` mode (NULL pointer dereference in `initTempDb`), silently lost IDMP tracking after `SWAPDB` and diskless replication (pointers never swapped), and stale IDMP entries surviving `FLUSHDB` (dict never cleared). **Changes** - **`emptyDbStructure`** (`src/db.c`) — Clear `stream_idmp_keys` on flush so stale entries don't persist across `FLUSHDB`/`FLUSHALL`. - **`initTempDb` / `discardTempDb`** (`src/db.c`) — Create and release `stream_idmp_keys` for temp databases used during diskless replication RDB load. Fixes the SIGSEGV when `rdbLoadRioWithLoadingCtx` calls `dictAddRaw` on a NULL pointer. - **`dbSwapDatabases`** (`src/db.c`) — Swap `stream_idmp_keys` alongside the other per-DB dicts so IDMP tracking follows the data during `SWAPDB`. - **`swapMainDbWithTempDb`** (`src/db.c`) — Swap `stream_idmp_keys` when promoting a temp database after diskless replication, so the cron can discover and expire IDMP entries on replicas. - **`streamMoveIdmpKeys`** (`src/db.c`) — New helper that migrates IDMP entries by slot from one dict to another, used during cluster slot migration. - **`emptyDbAsync` / `emptyDbDataAsync`** (`src/lazyfree.c`) — Replace the dict with a fresh one and hand the old one to the background lazy-free job. - **`lazyfreeFreeDatabase`** (`src/lazyfree.c`) — Free `stream_idmp_keys` in the background job (now takes 4 args instead of 3). - **`asmTriggerBackgroundTrim`** (`src/cluster_asm.c`) — Move matching IDMP entries by slot into a temporary dict during background trim, then pass it to `emptyDbDataAsync` for async cleanup. - **`server.h`** — Updated `emptyDbDataAsync` signature; added `streamMoveIdmpKeys` declaration. - **Tests** (`tests/unit/type/stream.tcl`) — Four new integration tests covering IDMP expiry after `SAVE` + restart, tracking survival across `SWAPDB`, cleanup on `FLUSHDB`, and diskless replication in `swapdb` mode (both `rdbchannel=yes` and `rdbchannel=no`). **What this fixes** - **No crash on diskless replication** — `initTempDb` now initializes the dict, eliminating the NULL dereference during RDB load. - **Tracking preserved across swaps** — Both `SWAPDB` and diskless replication correctly transfer the dict, so the cron keeps expiring entries in the right database. - **Clean state after flush** — `FLUSHDB`/`FLUSHALL` clear the dict, preventing ghost entries from interfering with new streams. - **Correct cluster migration cleanup** — IDMP entries for migrated slots are moved and freed alongside the key data. |
||
|
|
1b615c774d
|
Fix FIELDS argument validation in HSETEX/HGETEX (#14883)
Fixes #14879 Improve validation of the FIELDS argument in HSETEX and HGETEX to ensure exactly one field is provided, rejecting both missing and multiple fields with consistent and accurate error messages. Align behavior across both commands. |
||
|
|
ee376cdc3c
|
Fix IDMP cron expiration not working after RDB load (#14869)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
**Summary** Registers streams with IDMP producers in `db->stream_idmp_keys` during RDB load, so that the periodic cron job can expire stale IDMP entries after a server restart. Without this fix, streams loaded from RDB were never registered, causing IDMP entries to accumulate indefinitely and never be cleaned up by the cron. **Changes** - **`rdbLoadRioWithLoadingCtx` (src/rdb.c):** After loading each key-value pair, added a check for `OBJ_STREAM` type with a non-NULL `idmp_producers` rax. When found, creates a string object for the key and inserts it into `db->stream_idmp_keys` via `dictAddRaw`. If the key already exists (duplicate insert), the reference is decremented to avoid a leak. - **Test `XADD IDMP cron expiration works after RDB load` (tests/unit/type/stream.tcl):** Added an integration test that creates a stream with IDMP producers, sets a short `IDMP-DURATION`, saves and restarts the server, then verifies that the cron eventually expires the entries (pids-tracked and iids-tracked drop to 0) and that previously-expired IIDs can be re-added as new entries. **Benefits** - **Cron expiration works after restart:** Streams with IDMP producers are now discoverable by the periodic cleanup cron after an RDB load, ensuring expired entries are reaped as expected. - **No memory leak on stale entries:** Without registration, IDMP entries that outlived their duration would persist in memory forever after a restart, growing unboundedly. This fix ensures they are cleaned up. - **Consistency between runtime and post-load behavior:** IDMP expiration now behaves identically whether the stream was created at runtime or restored from an RDB snapshot. |
||
|
|
62059a2438
|
Chore complete Tcl 9 support and fix regressions in test suite (#14845)
## Problem
PR https://github.com/redis/redis/pull/14787 introduced **Tcl 9 support
for the test suite**, but it still fails on my machine (**macOS 26.3,
Tcl 9.0.3**). Some tests fail and the runner may hang.
Example:
```bash
$ tclsh <<<'puts $tcl_version'
9.0.3
$ make test
[err]: BITCOUNT against test vector #2
Expected [r bitcount str] == 4
```
This is caused by **behavior changes in Tcl 9**, including:
- `string length` returning **character count** instead of **byte
count**
- binary sockets rejecting characters with code points **>255**
- differences in `string is wideinteger`
Parts of the Redis Tcl test framework rely on **byte-oriented
behavior**, which breaks under Tcl 9.
## Changes
### 1. Fix IPC payload encoding in test runner
`tests/unit/memefficiency.tcl` contains a **non-ASCII quote character**:
|
||
|
|
dd81afa2c8
|
Do not add a key into db if it has a past expiration time (#14784)
For **RESTORE** and **SET** command, if the expiration time is already elapsed, we skip adding it to the DB. However, we still increment the `expiredkeys` counter. From stats perspective we behave as if we inserted a new key (possibly an overwrite) and later expired it, but from the per-key KSN observability, we reflect what we've actually done in the db (deletion of old key, and no insertion of new one), so we don't confuse modules. **Changes**: - **RESTORE**: Increment the `expiredkeys` counter if the key has an expiration time in the past. - **SET**: If the key has an expiration time in the past, do not add it to the main DB, increment the `expiredkeys` counter instead. And we also delete the old key if it exists. For reference, **EXPIREAT** with TTL in the past, which implicitly deletes the key and return success. Now **SET** command has the same behavior. |
||
|
|
1f7bcecc94
|
Add XIDMPRECORD command and AOFRW emission to restore stream IDMP state (#14794)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Summary - Introduce a new XIDMPRECORD <key> <pid> <iid> <streamID> command that attaches IDMP (idempotency) metadata — a producer ID and an idempotency ID — to an existing stream message. This is an internal command used to replay IDMP state during AOF loading. - Modify rewriteStreamObject() in AOF rewrite (AOFRW) to emit XIDMPRECORD commands after the XADD and consumer-group commands, ensuring IDMP deduplication state is fully preserved across AOF rewrites and server restarts. - Additionally emit XCFGSET during AOFRW when per-stream IDMP configuration (IDMP-DURATION, IDMP-MAXSIZE) differs from server defaults, so custom settings survive rewrites. ## New command: XIDMPRECORD Syntax: `XIDMPRECORD <key> <pid> <iid> <streamID>` - Validates that the key exists, is a stream, the stream ID refers to an existing (non-deleted) entry, and both pid and iid are non-empty. - If the (pid, iid) pair already maps to the same stream ID, the command is idempotent and returns OK. - If the (pid, iid) pair already maps to a different stream ID, an error is returned. ## AOF rewrite changes (src/aof.c) - New helper rioWriteStreamIdmpEntry() writes a single XIDMPRECORD bulk command to the AOF rio. - After consumer groups are emitted, iterates all producers in s->idmp_producers and emits an XIDMPRECORD for every IDMP entry (in linked-list insertion order). - Emits XCFGSET for per-stream IDMP config when it diverges from server defaults. |
||
|
|
98328ae2ec
|
Pause dict auto-resize during multi-field deletion (#14783)
The idea comes directly from ValKey: https://github.com/valkey-io/valkey/pull/3144 Deleting many fields from a hash/zset/set stored as a dict can trigger repeated shrink/rehash work during the loop. --------- Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
18538461d1
|
Add separate statistics for active expiration of keys and hash fields (#14727)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
### Summary Adds `expired_keys_active` and `expired_subkeys_active` counters to track keys and hash fields expired by the active expiration cycle, distinguishing them from lazy expirations. These new metrics are exposed in INFO stats output. ### Motivation Currently, Redis tracks the total number of expired keys (expired_keys) and expired hash fields (expired_subkeys), but there's no way to differentiate between expirations triggered by active expire and lazy expire. --------- Co-authored-by: Moti Cohen <moti.cohen@redis.com> |
||
|
|
221409788a
|
Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Overview This PR introduces idempotency support to Redis Streams' XADD command, enabling automatic deduplication of duplicate message submissions through optional IDMPAUTO and IDMP parameters with producer identification. This enables reliable at-least-once delivery while preventing duplicate entries in streams. ## Problem Statement Current Redis Streams implementations lack built-in idempotency mechanisms, making reliable at-least-once delivery impossible without accepting duplicates: - **Application-level tracking**: Developers must maintain separate data structures to track submitted messages - **Race conditions**: Network failures and retries can result in duplicate stream entries - **Complexity overhead**: Each producer must implement custom deduplication logic - **Memory inefficiency**: External deduplication systems duplicate Redis's storage capabilities This lack of native idempotency support creates reliability challenges in distributed systems where at-least-once delivery semantics are required but exactly-once processing is desired. ## Solution Extends XADD with optional idempotency parameters that include producer identification: ``` XADD key [NOMKSTREAM] [KEEPREF | DELREF | ACKED] [IDMPAUTO pid | IDMP pid iid] [MAXLEN | MINID [= | ~] threshold [LIMIT count]] <* | id> field value [field value ...] ``` ### Producer ID (pid) - **pid** (producer id): A unique identifier for each producer - Must be unique per producer instance - Producers must use the same pid after restart to access their persisted idempotency tracking - Enables per-producer idempotency tracking, isolating duplicate detection between different producers **Format**: Binary or string, recommended max 36 bytes **Generation**: - **Recommended**: UUID v4 for globally unique identification - **Alternative**: `hostname:process_id` or application-assigned IDs ### Idempotency Modes **IDMPAUTO pid (Automatic Idempotency)**: - Producer specifies its pid, Redis automatically calculates a unique idempotent ID (iid) based on entry content - Hash calculation combines XXH128 hashing of individual field-value pairs using an order-independent Sum + XOR approach with rotation (each pair: `XXH128(field || field_length || value)`) - 16-byte binary iid with extremely low accidental collision probability - XXH128 is a non-cryptographic hash function: fast and well-distributed, but does NOT prevent intentional collision attacks - For protection against adversarial collision crafting, use IDMP mode with cryptographically-signed idempotent IDs - Order-independent: field ordering does not affect the calculated iid - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without creating duplicate entry - Generally slower than manual mode due to hash calculation overhead **IDMP pid iid (Manual Idempotency)**: - Caller provides explicit producer id (pid) and idempotent ID (iid) for deduplication - iid must be unique per message (either globally or per pid) - Faster processing than IDMPAUTO (no hash calculation overhead) - Enables shorter iids for reduced memory footprint - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without comparing field contents - Caller responsible for iid uniqueness and consistency across retries Both modes can only be specified when entry ID is `*` (auto-generated). ### Deduplication Logic When XADD is called with idempotency parameters: 1. Redis checks if the message was recently added to the stream based on the (pid, iid) pair 2. If the (pid, iid) pair matches a recently-seen pair for that producer, the message is assumed to be identical 3. No duplicate message is added to the stream; the existing entry ID is returned 4. With **IDMP pid iid**: Redis does not compare the specified fields and their values—two messages with the same (pid, iid) are assumed identical 5. With **IDMPAUTO pid**: Redis calculates the iid from message content and checks for duplicates ## IDMP Map: Per-Producer Time and Capacity-Based Expiration Each producer with idempotency enabled maintains its own isolated IDMP map (iid → entry_id) with dual expiration criteria: **Time-based expiration (duration)**: - Each iid expires automatically after duration seconds from insertion - Provides operational guarantee: Redis will not forget an iid before duration elapses (unless capacity reached) - Configurable per-stream via XCFGSET **Capacity-based expiration (maxsize)**: - Each producer's map enforces maximum capacity of maxsize entries - When capacity reached, oldest iids for that producer are evicted regardless of remaining duration - Prevents unbounded memory growth during extended usage ### Configuration Commands **XINFO STREAM**: View current configuration and metrics Use `XINFO STREAM key` to retrieve idempotency configuration (idmp-duration, idmp-maxsize) along with tracking metrics. **XCFGSET**: Configure expiration parameters ``` XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize] ``` - **duration**: Seconds to retain each iid (range: 1- 86400 seconds) - **maxsize**: Maximum iids to track per producer (range: 1-10,000 entries) - Calling XCFGSET clears all existing producer IDMP maps for the stream **Default Configuration** (when XCFGSET not called): - Duration: 100 seconds - Maxsize: 100 iids per producer - Runtime configurable via: `stream-idmp-duration` and `stream-idmp-maxsize` ## Response Behavior **On first submission** (pid, iid) pair not in producer's map: - Entry added to stream with generated entry ID - (pid, iid) pair stored in producer's IDMP map with current timestamp - Returns new entry ID **On duplicate submission** (pid, iid) pair exists in producer's map: - No entry added to stream - Returns existing entry ID from producer's IDMP map - Identical response to original submission (client cannot distinguish) ## Stream Metadata XINFO STREAM extended with idempotency metrics and configuration: - **idmp-duration**: The duration value (in seconds) configured for the stream's IDMP map - **idmp-maxsize**: The maxsize value configured for the stream's IDMP map - **pids-tracked**: Current number of producers with active IDMP maps - **iids-tracked**: Current total number of iids across all producers' IDMP maps (reflects active iids that haven't expired or been evicted) - **iids-added**: Lifetime cumulative count of entries added with idempotency parameters - **iids-duplicates**: Lifetime cumulative count of duplicate iids detected across all producers ## Persistence and Restart Behavior **IDMP maps are fully persisted and restored across Redis restarts**: - **RDB/AOF**: All pid-iid pairs, timestamps, and configuration are included in snapshots and AOF logs - **Recovery**: On restart, all tracked (pid, iid) pairs remain valid and operational - **Producer Requirement**: Producers must reuse the same pid after restart to access their persisted IDMP map - **Configuration**: Stream-level settings (duration, maxsize) persist across restarts - **Important**: Calling XCFGSET after restart clears restored IDMP maps (same behavior as during runtime) ## Key Benefits - **Enables At-most-once Producer Semantics**: Makes it possible to safely retry message submissions without creating duplicates - **Automatic Retry Safety**: Network failures and retries cannot create duplicate entries - **Producer Isolation**: Each producer maintains independent idempotency tracking - **Memory Efficient**: Time and capacity-based expiration per producer prevents unbounded growth - **Flexible Implementation**: Choose automatic (IDMPAUTO) or manual (IDMP) based on performance needs - **Backward Compatible**: Fully optional parameters with zero impact on existing XADD behavior - **Collision Resistant**: XXH128 with Sum + XOR combination and field-length separators provides high-quality non-cryptographic hashing for IDMPAUTO with extremely low collision probability and prevents ambiguous concatenation attacks |
||
|
|
73249497d4
|
Fix ACL key-pattern bypass in MSETEX command (#14659)
MSETEX doesn't properly check ACL key permissions for all keys - only the first key is validated. MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX seconds Keys are at every 2nd position (step=2). When Redis extracts keys for ACL checking, it calculates where the last key is: last = first + numkeys - 1; => calculation ignores step last = first + (numkeys-1) * step; With 2 keys starting at position 2: Bug: last = 2 + 2 - 1 = 3 → only checks position 2 Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4 Fixes #14657 |
||
|
|
9ca860be9e
|
Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This bug was introduced by #14130 and found by guybe7 When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete strategies, if a node was eligible for removal but couldn't be removed directly (because consumer group references need to be checked), the code would incorrectly break out of the loop instead of continuing to process entries within the node. This fix allows the per-entry deletion logic to execute for eligible nodes when using non-KEEPREF strategies. |
||
|
|
23aca15c8c
|
Fix the flexibility of argument positions in the Redis API's (#14416)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR implements flexible keyword-based argument parsing for all 12 hash field expiration commands, allowing users to specify arguments in any logical order rather than being constrained by rigid positional requirements. This enhancement follows Redis's modern design of keyword-based flexible argument ordering and significantly improves user experience. Commands with Flexible Parsing HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX some examples: HEXPIRE: * All these are equivalent and valid: HEXPIRE key EX 60 NX FIELDS 2 f1 f2 HEXPIRE key NX EX 60 FIELDS 2 f1 f2 HEXPIRE key FIELDS 2 f1 f2 EX 60 NX HEXPIRE key FIELDS 2 f1 f2 NX EX 60 HEXPIRE key NX FIELDS 2 f1 f2 EX 60 HGETEX: * All these are equivalent and valid: HGETEX key EX 60 FIELDS 2 f1 f2 HGETEX key FIELDS 2 f1 f2 EX 60 HSETEX: * All these are equivalent and valid: HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2 HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2 HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60 HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60 |
||
|
|
bb6389e823
|
Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Problem When destroying a consumer group with `XGROUP DESTROY`, the cached `min_cgroup_last_id` was not being invalidated. This caused incorrect behavior when using `XDELEX` with the `ACKED` option, as the cache still referenced the destroyed group's `last_id`. ## Solution Invalidate the `min_cgroup_last_id` cache when the destroyed group's `last_id` equals the cached minimum. The cache will be recalculated on the next call to `streamEntryIsReferenced()`. --------- Co-authored-by: guybe7 <guy.benoish@redislabs.com> |
||
|
|
0a6eacff1f
|
Add variable key-spec flags to SET IF* and DELEX (#14529)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
These commands behave as DEL and SET (blindly Remove or Overwrite) when they don't get IF* flags, and require the value of the key when they do run with these flags. Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can provide the right flags depending on the arguments used. (the plain flags when arguments are unknown are the common denominator ones) Move lookupKey call in DELEX to avoid double lookup, which also means (some, namely arity) syntax errors are checked (and reported) before checking the existence of the key. |
||
|
|
90ba7ba4dc
|
Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524)
### Problem The XREADGROUP command with CLAIM parameter incorrectly returns delivery metadata (idle time and delivery count) as strings instead of integers, contradicting the Redis specification. ### Solution Updated the XREADGROUP CLAIM implementation to return delivery metadata fields as integers, aligning with the documented specification and maintaining consistency with Redis response conventions. --------- Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
d25e582a17
|
Fix flaky test of hfe persist rdb reload (#14525)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
So far occured once on daily in the test-sanitizer-address job |
||
|
|
189b7609f5
|
Add hfe rdb load test (#14511)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Verify that following RDB load fields keep their expiration time. Verify that hashes that had HFEs not counted following rdb load in subexpiry (by command `info keyspace`) |
||
|
|
7f1bafc922 |
Fix XACKDEL stack overflow when IDs exceed STREAMID_STATIC_VECTOR_LEN (CVE-2025-62507)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
This issue was introduced by redis/redis#14130. The problem is that when the number of IDs exceeds STREAMID_STATIC_VECTOR_LEN (8), the code forgot to reallocate memory for the IDs array, which causes a stack overflow. |
||
|
|
3e2003ee0f |
Fix HGETEX out-of-bounds read when FIELDS option missing numfields argument
When the HGETEX command is used with the FIELDS option but without the required numfields argument, the server would attempt to access an out-of-bounds argv index. This PR adds a check to ensure numfields is present before accessing it, returning an error if it is missing. Also includes a test case to cover this scenario. |
||
|
|
e436a0e548
|
Enforce 16-char hex digest length and case-insensitive comparison for IFDEQ/IFDNE (#14502)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fix https://github.com/redis/redis/issues/14496 This PR makes the following changes: - DIGEST: Always return 16 hex characters with leading zeros Example: "00006c38adf31777" instead of "6c38adf31777" - IFDEQ/IFDNE: Validate the digest must be exactly 16 characters - IFDEQ/IFDNE: Use strcasecmp for case-insensitive hex comparison Both uppercase and lowercase hex digits now work identically --------- Co-authored-by: Marc Gravell <marc.gravell@gmail.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
379fec1426
|
Use fixed position keys parameter for MSETEX command (#14470)
In PR https://github.com/redis/redis/pull/14434, we made the keys parameter flexible, meaning it could appear anywhere among the command arguments. However, this also made key parsing more complex, since we could no longer determine the fixed position of key arguments. Therefore, in this PR, we reverted it back to using fixed positions for the keys. And also fix this [comment](https://github.com/redis/redis/pull/14434#discussion_r2459282563). --------- Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
52ea47b792
|
Add MSETEX command (#14434)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Introduce a new command MSETEX to set multiple string keys with a shared expiration in a single atomic operation. Also with flexible argument parsing. Syntax: MSETEX KEYS numkeys key value [key value …] [XX | NX] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL] Sets the given keys to their respective values. This command is an extension of the MSETNX that adds expiration and XX options. Options: EX seconds - Set the specified expiration time, in seconds PX milliseconds - Set the specified expiration time, in milliseconds EXAT timestamp-seconds - Set the specified Unix time at which the keys will expire, in seconds PXAT timestamp-milliseconds - Set the specified Unix time at which the keys will expire, in milliseconds KEEPTTL - Retain the time to live associated with the keys XX - Only set the keys and their expiration if all already exist NX - Only set the keys and their expiration if none exist Flexible Argument Parsing examples: - MSETEX EX 10 KEYS 2 k1 v1 k2 v2 - MSETEX KEYS 2 k1 v1 k2 v2 NX PX 5000 - MSETEX NX EX 10 KEYS 2 k1 v1 k2 v2 Return Values: Integer reply: 1 - All keys were set successfully Integer reply: 0 - No keys were set (due to NX/XX conditions) Error reply - Syntax error or invalid arguments |
||
|
|
090ca801ea
|
Add CLAIM parameter to XREADGROUP for automatic pending entry claiming (#14402)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Overview
This PR enhances Redis Streams consumer groups by adding an optional
CLAIM parameter to the `XREADGROUP` command, enabling automatic claiming
of idle pending entries alongside normal message consumption in a single
operation.
## Problem Statement
Current Redis Streams consumer group implementations require developers
to manually orchestrate multiple commands to handle both pending and new
entries:
- `XPENDING` to discover idle pending entries
- `XCLAIM/XAUTOCLAIM` to claim idle entries
- `XREADGROUP` to consume new entries
This multi-command approach creates:
- **Performance overhead** from multiple round trips to Redis
- **Implementation complexity**, particularly when working with multiple
streams
- **Code duplication** across consumer implementations
## Solution
Extends XREADGROUP with a new optional CLAIM parameter:
`XREADGROUP GROUP group consumer [COUNT count] [BLOCK milliseconds]
[NOACK] [CLAIM min-idle-time] STREAMS key [key ...] id [id ...]`
When CLAIM min-idle-time is specified, the command operates in two
phases:
1. **Claim Phase:** Automatically claims pending entries idle for ≥
min-idle-time milliseconds
2. **Read Phase:** Processes new entries if the COUNT limit hasn't been
reached
## Response Format Changes
When the CLAIM option is used, the response format is extended to
include delivery metadata for each entry:
**Standard XREADGROUP response (without CLAIM):**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 STREAMS mystream >
1) 1) "mystream"
2) 1) 1) "1609459200000-0"
2) 1) "field1"
2) "value1"
```
**XREADGROUP response with CLAIM:**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 CLAIM 30000 STREAMS mystream >
1) 1) "mystream"
2) 1) 1) "1609459200000-0"
2) 1) "field1"
2) "value1"
3) 15000
4) 3
```
**Response structure with CLAIM:**
- **Field 1:** Stream entry ID (unchanged)
- **Field 2:** Field-value pairs (unchanged)
- **Field 3:** Idle time in milliseconds - the number of milliseconds
elapsed since this entry was last delivered to a consumer
- **Field 4:** Delivery count - the number of times this entry has been
delivered:
- `0` for new messages that haven't been delivered before
- `1+` for claimed messages (previously unacknowledged entries)
**Purpose of the new fields:**
These fields enable intelligent client-side processing decisions:
- Idle time enables time-based escalation strategies, detection of stuck
messages, and priority processing for critically delayed work
- Delivery count enables retry limits, dead-letter queue logic, poison
message detection, and alternative processing strategies based on
failure history
Together, these fields provide the visibility needed to build robust,
self-healing consumer systems without requiring additional XPENDING
queries.
**Note:** If the ID parameter is not `>`, the command returns entries
that are pending for the consumer, and the CLAIM option is ignored. In
this case, the response follows the standard format without the
additional delivery metadata fields.
## Key Benefits
- **Reduced Complexity:** Eliminates manual PEL management and
multi-command orchestration
- **Improved Performance:** Reduces round trips by 50-70% for workloads
processing both pending and new entries
- **Backward Compatibility:** Fully optional parameter with zero
breaking changes to existing behavior
- **Multi-Stream Support:** Works seamlessly across multiple streams in
a single command
- **Flexible Consumer Patterns:** Enables mixed consumer types within
the same group:
- Consumers without CLAIM that only handle new messages
- Consumers with CLAIM that process both pending and new entries
## Impact on Existing Commands
The XCLAIM and XAUTOCLAIM commands may potentially benefit from the new
pel_by_time index for improved performance, such optimizations require
further investigation and testing. Enhancements to XCLAIM and XAUTOCLAIM
are postponed for future work.
## Performance Benchmarks
### Latency Performance
Comprehensive performance testing demonstrates significant improvements
over the traditional XAUTOCLAIM approach:
**Test Methodology**
Two identical test scenarios were executed to compare XAUTOCLAIM against
XREADGROUP with CLAIM:
**Test Setup:**
1. Insert 20,000 messages into a stream
2. Read all messages with XREADGROUP to populate the pending entries
list (PEL)
3. Set IDLE time to 1100ms on 1,000 randomly selected pending messages
using XCLAIM
4. Set IDLE time to 50ms on all remaining 19,000 pending messages using
XCLAIM
5. Execute the target command with min-idle-time=1000ms and COUNT=1000
to claim the eligible messages
6. Repeat steps 3-5 for 1,000 iterations
**Test 1 - XAUTOCLAIM (Traditional Approach):**
```
XAUTOCLAIM Performance:
Average: 54.671ms
Median: 53.582ms
Min: 3.738ms
Max: 71.596ms
P95: 62.536ms
P99: 68.800ms
```
**Test 2 - XREADGROUP with CLAIM (New Approach):**
```
XREADGROUP CLAIM Performance:
Average: 2.426ms
Median: 2.571ms
Min: 1.287ms
Max: 4.653ms
P95: 3.370ms
P99: 4.212ms
```
**Performance Analysis**
The new XREADGROUP CLAIM implementation delivers **22.5x faster average
performance** compared to XAUTOCLAIM:
- **Average latency reduction:** 95.6% (54.671ms → 2.426ms)
- **Median latency reduction:** 95.2% (53.582ms → 2.571ms)
- **P95 latency reduction:** 94.6% (62.536ms → 3.370ms)
- **P99 latency reduction:** 93.9% (68.800ms → 4.212ms)
This performance improvement is achieved through the time-ordered PEL
index (pel_by_time), which enables O(log n + k) retrieval of idle
entries versus XAUTOCLAIM's less efficient scanning approach.
### Memory Performance
To evaluate the memory overhead of the pel_by_time index, comprehensive
memory testing was conducted comparing Redis with and without the index
under realistic workload conditions.
**Test Methodology:**
- Insert 200,000 new messages into a stream
- Read messages in blocks of 100 using XREADGROUP (populating the PEL
with 200,000 pending entries)
- Wait 5ms after each read block (simulating realistic processing delays
that affect rax tree compression)
- Measure memory usage before and after the reading phase
**Test Results - Without pel_by_time Index:**
```
Initial memory (used): 926.10 KB
After insertion (used): 6.80 MB
After reading (used): 41.53 MB
Memory increase from data: 5.90 MB
Memory increase from reading: 34.72 MB
Total memory increase: 40.62 MB
```
**Test Results - With pel_by_time Index:**
```
Initial memory (used): 927.44 KB
After insertion (used): 6.81 MB
After reading (used): 45.07 MB
Memory increase from data: 5.90 MB
Memory increase from reading: 38.27 MB
Total memory increase: 44.17 MB
```
**Memory Performance Analysis:**
The pel_by_time index introduces a measurable but reasonable memory
overhead:
**Used Memory Impact:**
- Memory increase from pel_by_time index: **3.55 MB** (38.27 MB - 34.72
MB)
- Per-entry overhead: **18.6 bytes** (3.55 MB / 200,000 entries)
- Percentage overhead: **8.7%** increase in total memory usage
**Per-Entry Memory Breakdown:**
The theoretical minimum for the pel_by_time index is 32 bytes per entry
(composite key only, no node values). The observed 18.6 bytes per entry
overhead is lower than the theoretical maximum, suggesting effective rax
tree compression is occurring despite the 5ms delays between reads.
## Technical Implementation
### New Data Structure: Time-Ordered PEL Index (`pel_by_time`)
To efficiently identify and claim idle pending entries, this PR
introduces a new rax tree structure to the consumer group
implementation:
**Structure Design:**
- Tree Type: Rax tree named pel_by_time added to each consumer group
- Key Composition: 32-byte composite key consisting of:
- `delivery_time` (timestamp when entry was last delivered)
- `streamId` (stream entry ID)
**Key Format:** `delivery_time` + `streamId` (concatenated)
**Node Value:** None - all necessary information is encoded in the key
itself for memory efficiency
**Key Properties:**
_Uniqueness Guarantee:_ While multiple pending entries may share the
same `delivery_time`, the `streamId` component ensures each key is
globally unique within the tree.
_Lexicographical Ordering:_ The rax tree naturally orders nodes
lexicographically by key. Since `delivery_time` forms the prefix of each
key, entries are automatically sorted by delivery time, with oldest
entries appearing first in the tree.
_Efficient Range Operations:_ This time-based ordering enables highly
efficient range searches. To find all entries idle for at least
`min-idle-time` milliseconds, we simply perform a range query from the
tree's beginning up to `current_time - min-idle-time`.
**Fast Retrieval:**
Once idle entries are identified via the `pel_by_time` index, the
embedded `streamId` in each key is used to quickly retrieve the full
pending message data structure for the subsequent `XREADGROUP` claim
operation.
**Performance Characteristics:**
- **Insertion:** O(log n) when adding entries to PEL
- **Range Search:** O(log n + k) where k is the number of idle entries
found
- **Memory Overhead:** 32 bytes per pending entry for the index key (no
additional node values stored)
This dual-index approach (existing PEL structures plus the new
time-ordered index) allows XREADGROUP with CLAIM to efficiently identify
claimable entries without scanning the entire PEL, making the operation
suitable for consumer groups with large pending entry lists.
### COUNT Behavior with CLAIM
When the `COUNT` option is used in conjunction with `CLAIM`, the command
follows a two-phase execution strategy to maximize the specified count
limit:
**Phase 1:** Claim Idle Pending Entries
- Retrieve claimable pending entries (idle for ≥ min-idle-time) up to
the COUNT limit
- These entries are claimed and returned to the consumer
**Phase 2:** Fetch New Messages (if needed)
- If the `COUNT` limit has not been satisfied by claimed pending
entries, the command proceeds to read new messages from the stream
- New messages are fetched up to the remaining available count:
`remaining_count = COUNT - claimed_entries`
This prioritization ensures that idle pending entries are always
processed first, preventing indefinite message stalling while still
allowing consumers to process new messages efficiently when pending
entries are scarce.
### BLOCK Behavior with CLAIM
When the CLAIM option is used in conjunction with the BLOCK option, the
command exhibits sophisticated blocking behavior that responds to both
new messages and pending entries becoming claimable:
**Blocking State Management:**
If there are no immediately claimable pending entries and no new
messages available in the stream, the `XREADGROUP` command enters a
blocking state for the specified duration. However, the implementation
must handle a critical scenario: pending entries that become idle (and
thus claimable) while the command is blocked must trigger an early
wakeup to serve those entries.
**Implementation: `stream_claim_pending_keys` Dictionary**
To enable this reactive blocking behavior, a new
`stream_claim_pending_keys` dictionary is introduced to the `redisDb`
structure:
- **Key:** Stream key being watched
- **Value:** The minimum timestamp when the next pending entry in this
stream will become claimable (i.e., will satisfy the min-idle-time
requirement)
**Multi-Client Coordination:**
When multiple XREADGROUP commands with BLOCK and CLAIM are executed
concurrently on the same stream, the dictionary value stores the
shortest claimable time across all waiting clients. This ensures the
earliest possible wakeup when any pending entry becomes available for
claiming.
**Wakeup Mechanism: `handleClaimableStreamEntries`**
The `handleClaimableStreamEntries` function is invoked regularly from
`blockedBeforeSleep` to monitor and react to claimable entries:
1. **Scan Phase:** Iterates through all entries in the
`stream_claim_pending_keys` dictionary
2. **Time Check:** Compares each entry's claimable timestamp against the
current time
3. **Signal Phase:** When `claimable_time ≤ current_time`, calls
`signalKeyAsReady` to wake up all clients blocked on that stream
4. **Client Processing:** Awakened clients attempt to claim and process
the newly available pending entries
**Resource Contention Handling:**
When the number of claimable entries is insufficient to satisfy all
awakened clients:
- Clients that successfully claim entries complete their operations
- Remaining clients recalculate the next minimum claimable time based on
remaining pending entries
- These clients update the `stream_claim_pending_keys` dictionary with
the new timestamp
- They re-enter the blocking state to wait for the next batch of
claimable entries
This design ensures fair resource distribution and prevents busy-waiting
while maintaining responsiveness to both new messages and aging pending
entries.
|
||
|
|
aed879ad0a
|
Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435)
# Description
Add optimistic locking for string objects via compare-and-set and
compare-and-delete mechanism.
## What's changed
Introduction of new DIGEST command for string objects calculated via
XXH3 hash.
Extend SET command with new parameters supporting optimistic locking.
The new value is set only if checks against a given (old) value or a
given string digest pass.
Introduction of new DELEX command to support conditionally deleting a
key. Conditions are also checks against string value or string digest.
## Motivation
For developers who need to to implement a compare-and-set and
compare-and-delete single-key optimistic concurrency control this PR
provides single-command based implementation.
Compare-and-set and compare-and-delete are mostly used for [Optimistic
concurrency
control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control):
a client (1) fetches the value, keeps the old value (or its digest, for
a large string) in memory, (2) manipulates a local copy of the value,
(3) applies the local changes to the server, but only if the server’s
value hasn’t been changed (still equal to the old value).
Note that compare-and-set [can also be
implemented](https://redis.io/docs/latest/develop/using-commands/transactions/#optimistic-locking-using-check-and-set)
with WATCH … MULTI … EXEC and Lua scripts. The new SET optional
arguments and the DELEX command do not enable new functionality,
however, they are much simpler and faster to use for the very common use
case of single-key optimistic concurrency control.
## Related issues and PRs
https://github.com/redis/redis/issues/12485
https://github.com/redis/redis/pull/8361
https://github.com/redis/redis/pull/4258
## Description of the new commands
### DIGEST
```
DIGEST key
```
Get the hash digest of the value stored in key, as an hex string.
Reply:
- Null if key does not exist
- error if key exists but holds a value which is not a string
- (bulk string) the XXH3 digest of the value stored in key, as an hex
string
### SET
```
SET key value [NX | XX | IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest] [GET] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
```
`IFEQ match-value` - Set the key’s value and expiration only if its
current value is equal to match-value. If key doesn’t exist - it won’t
be created.
`IFNE match-value` - Set the key’s value and expiration only if its
current value is not equal to match-value. If key doesn’t exist - it
will be created.
`IFDEQ match-digest` - Set the key’s value and expiration only if the
digest of its current value is equal to match-digest. If key doesn’t
exist - it won’t be created.
`IFDNE match-digest` - Set the key’s value and expiration only if the
digest of its current value is not equal to match-digest. If key doesn’t
exist - it will be created.
Reply update:
- If GET was not specified:
- Nil reply if either
- the key doesn’t exist and XX/IFEQ/IFDEQ was specified. The key was not
created.
- the key exists, and NX was specified or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false. The key was not set.
- Simple string reply: OK: The key was set.
- If GET was specified, any of the following:
- Nil reply: The key didn't exist before this command (whether the key
was created or not).
- Bulk string reply: The previous value of the key (whether the key was
set or not).
### DELEX
```
DELEX key [IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest]
```
Conditionally removes the specified key. A key is ignored if it does not
exist.
`IFEQ match-value` - Delete the key only if its value is equal to
match-value
`IFNE match-value` - Delete the key only if its value is not equal to
match-value
`IFDEQ match-digest` - Delete the key only if the digest of its value is
equal to match-digest
`IFDNE match-digest` - Delete the key only if the digest of its value is
not equal to match-digest
Reply:
- error if key exists but holds a value that is not a string and
IFEQ/IFNE/IFDEQ/IFDNE is specified.
- (integer) 0 if not deleted (the key does not exist or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false), or 1 if deleted.
### Notes
Added copy of xxhash repo to deps -
[version](
|
||
|
|
5b49119236
|
Fix crash in lookupKey() when executing_client is NULL (#14415)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR is based on: https://github.com/valkey-io/valkey/pull/2347 This was introduced in https://github.com/redis/redis/pull/13512 The server crashes with a null pointer dereference when lookupKey() is called from handleClientsBlockedOnKey(). The crash occurs because server.executing_client is NULL, but the code attempts to access server.executing_client->cmd->proc without checking. **Crash scenario:** Client 1 enables CLIENT NO-TOUCH Client 2 blocks on BRPOP mylist 0 Client 1 executes RPUSH mylist elem When unblocking Client 2, lookupKey() dereferences NULL server.executing_client → crash **Solution** Added proper null checks before dereferencing server.executing_client: Check if LOOKUP_NOTOUCH flag is already set before attempting to modify it Verify both server.current_client and server.executing_client are not NULL before accessing their members Maintain the TOUCH command exception for scripts **Testing** Added regression test in tests/unit/type/list.tcl that reproduces and verifies the fix for this crash scenario. This fix is based on valkey-io/valkey#2347 Co-authored-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> |
||
|
|
083f38ef5a
|
Fix issues with server.allow_access_expired (#14262)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Close https://github.com/redis/redis/issues/14214 1. When the server.allow_access_expired flag is set to 1, it allows access to expired keys that have not yet been evicted. All places involving access to expired keys should consider the impact of this parameter. 2. The modifications involve five methods: hfieldIsExpired, hashTypeNext, hashTypeLength, keyIsExpired, and hashTypeIsExpired. When the server.allow_access_expired flag is set to 1, these methods will not skip expired keys, otherwise they follow the normal logic execution. --------- Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
9b63e99d05
|
Refactor HFE: Introduce Per-Slot Expiration Store (estore) (#14294)
Hash field expiration is managed with two levels of data structures. 1. At the DB level, an ebuckets structure maintains the set of all hashes that contain fields with expiration. 2. At the per-hash level, an ebuckets structure tracks fields with expiration. This pull request refactors the 1st level to operate per slot instead, and introduces a new API called estore (expiration store). Its design aligns closely with the existing kvstore API, ensuring consistency and simplifying usage. The terminology at that level has been updated from “HFE” or “hexpire” to “subexpiry”, reflecting a broader scope that can later support other data types. |
||
|
|
60adba48aa
|
Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326)
This PR is based on https://github.com/valkey-io/valkey/pull/1303 This PR introduces a DEBUG_DEFRAG compilation option that enables activedefrag functionality even when the allocator is not jemalloc, and always forces defragmentation regardless of the amount or ratio of fragmentation. ## Using ``` make SANITIZER=address DEBUG_DEFRAG=<force|fully> ./runtest --debug-defrag ``` * DEBUG_DEFRAG=force * Ignore the threshold for defragmentation to ensure that defragmentation is always triggered. * Always reallocate pointers to probe for correctness issues in pointer reallocation. * DEBUG_DEFRAG=fully * Includes everything in the option `force`. * Additionally performs a full defrag on every defrag cycle, which is significantly slower but more accurate. --------- Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: oranagra <oran@redislabs.com> |
||
|
|
5f8e7852f4
|
Fix: Validate ENTRIESREAD in XGROUP command (#14259)
Fixes #14257 The XGROUP CREATE and SETID subcommands allowed setting an ENTRIESREAD value greater than the stream's total `entries_added` counter. This could lead to a logically inconsistent state. This commit adds a check to ensure the provided ENTRIESREAD value is not greater than the number of entries ever added to the stream. If ENTRIESREAD is too large, it gets set to the total number of entries in the stream, i.e. `s->entries_added`. |
||
|
|
e6c261f3fb
|
Fix MEMORY USAGE command (#14288)
After the key-value unification (kvobj), the MEMORY USAGE command may no longer account for the embedded key length stored within the kvobj. To fix this, replace sizeof(*o) with zmalloc_size((void *)o) to ensure the full allocated size is measured. In this context, the function objectComputeSize() was renamed and modified to kvobjComputeSize(). From computing only the value size to compute the key and its value. |
||
|
|
b9d9d4000b
|
Prevent crash when cgroups_ref is null in streamEntryIsReferenced() after reload (#14276)
This bug was introduced by https://github.com/redis/redis/pull/14130 found by @oranagra ### Summary Because `s->cgroup_ref` is created at runtime the first time a consumer group is linked with a message, but it is not released when all references are removed. However, after `debug reload` or restart, if the PEL is empty (meaning no consumer group is referencing any message), `s->cgroup_ref` will not be recreated. As a result, when executing XADD or XTRIM with `ACKED` option and checking whether a message that is being read but has not been ACKed can be deleted, the cgroup_ref being NULL will cause a crash. ### Code Path ``` xaddCommand -> streamTrim -> streamEntryIsReferenced ``` ### Solution Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced(). |
||
|
|
bec644aab1
|
Fix missing kvobj reassignment after reallocation in MOVE command (#14233)
Introduced by https://github.com/redis/redis/issues/13806 Fixed a crash in the MOVE command when moving hash objects that have both key expiration and field expiration. The issue occurred in the following scenario: 1. A hash has both key expiration and field expiration. 2. During MOVE command, `setExpireByLink()` is called to set the expiration time for the target hash, which may reallocate the kvobj of hash. 3. Since the hash has field expiration, `hashTypeAddToExpires()` is called to update the minimum field expiration time Issue: However, the kvobj pointer wasn't updated with the return value from `setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed memory. |