redis

mirror of https://github.com/redis/redis.git synced 2026-05-28 04:02:46 -04:00

Author	SHA1	Message	Date
debing.sun	95040d61d5	Replace INCREX out-of-bounds policy to a single SATURATE option (#15237 ) Follow https://github.com/redis/redis/issues/15045 ## Summary Simplify INCREX's out-of-bounds policy: The original INCREX shipped with three out-of-bounds policies — OVERFLOW FAIL, OVERFLOW SAT, OVERFLOW REJECT — but FAIL and REJECT are functionally redundant: both leave the key untouched when the result is out of bounds. They differ only in how the caller is notified (error reply vs. [current_value, 0] array reply), which forces the user to make a stylistic choice with no real semantic difference. This PR collapses the three policies into one clear behavior: * Default: the operation is rejected; the key value and TTL are left unchanged, and the reply is [current_value, 0]. Callers detect non-application by checking the applied-increment field; no error-handling branch is required. * SATURATE: the result is saturated to UBOUND / LBOUND, or to the type limits (LLONG_MAX/MIN for BYINT, ±LDBL_MAX for BYFLOAT) when no explicit bound is given. New syntax: INCREX <key> [BYFLOAT increment \| BYINT increment] [LBOUND lowerbound] [UBOUND upperbound] [SATURATE] [EX seconds \| PX milliseconds \| EXAT seconds-timestamp \| PXAT milliseconds-timestamp \| PERSIST] [ENX] --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>	2026-05-20 18:07:14 +08:00
Salvatore Sanfilippo	0d9576435f	Implement the new Redis Array type (#15162 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details # Redis Array For years, Redis has been missing a real indexed data structure for the use cases where the index and the spatial relationship of elements are semantic. Hashes give you random lookups, but you have to store an index as a key, and have no range visibility. Lists give you appending and trimming, but what is in the middle remains hard to access. Streams give you append-only events, which is another (useful, indeed) beast. None of these is what you want when the position itself has business meaning — slot 37, step 4, row 18552, day from 2934 to 2949, file line 11, 12, 15 and so forth. And, all those types, for different reasons, are all suboptimal when you want a ring buffer able to store the latest N observed samples of something. Up to now, users found ways (they always do \o/) using the fact that the data structures that are obvious in this universe are also extremely powerful, if well implemented. But this forces compromises. Arrays handle these index-first requirements natively, and usually with much better memory and CPU usage than the workarounds. If the use case is the right one, Arrays often provide much better space, time and usability at the same time. ## Internal encoding 1. When dense, an Array is essentially a more fancy C array. You don't pay anything for storing the index. 2. Yet, instead of going really flat, arrays are sliced into 4096-element slices, and each slice, when it contains just a few elements, uses a special sparse encoding. When a slice is empty it's just a `NULL` stored in the directory. 3. Small ints, floats, and short strings are pointer-tagged, so they cost zero additional memory beyond the pointer slot itself. 4. When very sparse, a super-directory of windowed directories is used. This allows the data type to be safe, instead of exhibiting pathological space or time behavior. This representation is only triggered when there are more than 8 million elements or very high indexes set. ## Use cases Arrays are mostly stateless if not for the fact that each array remembers the index of the latest added item, allowing `ARINSERT` and `ARRING` to work properly. Otherwise it is a set/get at this index game, with solid support for both setting / getting ranges, server-side scanning, returning only populated elements in a time which is proportional not to the range size, but to the population size. A few concrete examples, that may work as mental models for the set of problems that are similar to them (from the POV of the data modeling). Thermometer. A sensor reporting once per minute, with gaps: ``` ARSET temp:room12:day7 123 22.3 ARGETRANGE temp:room12:day7 600 660 # the 10:00–11:00 window, with NULLs ARSCAN temp:room12:day7 600 660 # only populated elements AROP temp:room12:day7 0 1439 MAX # peak of the day, server-side ``` Missing minutes cost little to nothing. Numeric aggregation runs inside Redis. Telemetry, IoT, meter readings, KPI rollups. Calendar. A clinic with 96 fifteen-minute slots per day: ``` ARSET sched:room12:day 32 booking:991 ARSCAN sched:room12:day 0 95 # only occupied slots ARGETRANGE sched:room12:day 48 63 # the afternoon full view to render ``` The slot number is the business key in this case. Room booking, parking spaces, warehouse bins, lockers, ... Ring buffer. ARRING replaces the classic LPUSH+LTRIM pattern. Imagine remote `dmesg`. ``` ARRING machine:123 200 "[141087.430123]: arm_cpu_init(): cpu 14 online" # Capped to 200 entries ARLASTITEMS machine:123 50 REV # 50 newest first ``` Faster than LPUSH+LTRIM, keep indexed access to past elements. Last-N alarms, recent fraud scores, access history, remote logs, device events. Ok here the use cases are mainly the ones of the old pattern: it is just a better fit and allows to access random items in the middle, aggregate server-side, and so forth. Workflow. Step number is the index, value is the status. Gaps are meaningful: ``` ARSET claim:99172 0 received ARSET claim:99172 3 waiting:reviewer42 ARSET claim:99172 5 approved ARGETRANGE claim:99172 0 5 # full workflow view, with NULLs for missing steps ARSCAN claim:99172 0 5 # only steps that have a state ARCOUNT claim:99172 # number of recorded steps ARLEN claim:99172 # highest reached step + 1 ``` Skills knowledge base for agents. Arrays are good at representing / grepping into Markdown files: ``` ARSET skill:metal_gpu 0 "...." ARSET skill:metal_gpu 1 "...." ARSET skill:metal_gpu 2 "...." ARGREP skill:metal_gpu - + RE "M3\|M4" WITHVALUES ``` ARGREP has EXACT, MATCH, GLOB, RE, you can have multiple predicates, can select AND or OR behavior. Bulk import results. Sparse row annotations over millions of rows / CSV / ...: ``` ARSET import:job551 18552 ERR:bad_email ARSCAN import:job551 0 1000000 # Provides only rows that have something ``` ## TLDR If the position is part of the meaning, use an Array. If you want to aggregate or grep remotely, use an Array. Feedback welcome :) --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Shubham S Taple <155555100+ShubhamTaple@users.noreply.github.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> Co-authored-by: Marc Gravell <marc.gravell@gmail.com>	2026-05-14 00:56:44 +08:00
raffertyyu	9c1ecd044e	Add INCREX command for atomic increment with TTL and bounds (#15045 ) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details Close #14278 ## Overview Rate limiters, sliding windows, request counters, and numerous other network-facing patterns share a common primitive: atomically increment a counter and set its expiration. Achieving this in Redis requires either multiple round-trips or a Lua script that bundles `INCR` / `INCRBY` / `INCRBYFLOAT` with `EXPIRE` / `PEXPIRE`. We propose a new command, `INCREX`, that collapses this two-step pattern into a single, native, O(1) command. `INCREX` atomically: 1. Increments (or decrements) a key's numeric value — by integer or float. 2. Optionally enforces lower and/or upper bounds, with a configurable overflow policy (error out, saturate, or no-op), enabling built-in cap enforcement (e.g., max request count) without additional client logic. 3. Optionally sets or removes the key's expiration. 4. Returns both the new value and the actual increment applied, giving the caller immediate feedback on whether the operation was saturated or skipped. ## Use Cases ### Basic Usage ``` # Increment by 1 (default) and set a 60-second TTL. > SET mykey 10 > INCREX mykey EX 60 1) (integer) 11 # new value 2) (integer) 1 # actual increment # Use 0 as initial value if the key doesn't exist. > DEL mykey > INCREX mykey 1) (integer) 1 # new value 2) (integer) 1 # actual increment # Default policy (OVERFLOW FAIL): exceeding a bound returns an error. > SET mykey 5 > INCREX mykey BYINT 20 UBOUND 10 (error) value is out of bounds # Opt into saturation with OVERFLOW SAT. > INCREX mykey BYINT 20 UBOUND 10 OVERFLOW SAT 1) (integer) 10 # saturated to upper bound 2) (integer) 5 # only 5 was actually applied # Skip the operation with OVERFLOW REJECT — the key and its TTL are # untouched, and the reply reports the current value with a zero delta. > SET mykey 5 > INCREX mykey BYINT 20 UBOUND 10 OVERFLOW REJECT 1) (integer) 5 # current (unchanged) value 2) (integer) 0 # nothing was applied # Increment by a float > SET mykey 1 > INCREX mykey BYFLOAT 0.5 1) "1.5" 2) "0.5" ``` ### Use Case: Rate Limiter Before (Lua script): ```lua -- KEYS[1] = rate limit key, ARGV[1] = limit, ARGV[2] = window in seconds local current = redis.call('INCR', KEYS[1]) if current > tonumber(ARGV[1]) then return 0 -- rejected end if current == 1 then redis.call('EXPIRE', KEYS[1], ARGV[2]) end return 1 -- allowed ``` Client invocation: ```python result = redis.eval(LUA_SCRIPT, 1, f"ratelimit:{user_id}", 100, 60) if result == 0: reject_request() ``` After (INCREX): ```python new_val, actual_incr = redis.execute_command( "INCREX", f"ratelimit:{user_id}", "UBOUND", 100, "OVERFLOW", "REJECT", "EX", 60, "ENX" ) if actual_incr == 0: # Rate limit exceeded — key left unchanged. reject_request() ``` `ENX` means: set expiration only if the key doesn't already have an expiration. This ensures the sliding window's TTL is only set on the first request. ### Use Case: Token Bucket Refill Refill tokens periodically up to a capacity ceiling, saturating at the cap instead of erroring: ``` > INCREX tokens:user123 BYINT 10 UBOUND 100 OVERFLOW SAT EX 3600 ENX 1) (integer) 10 2) (integer) 10 ``` Tokens cannot exceed 100, and the key auto-expires after inactivity. ### Use Case: Countdown / Resource Consumption Decrement a resource counter down to zero, saturating at the floor: ``` > SET credits:user123 50 > INCREX credits:user123 BYINT -1 LBOUND 0 OVERFLOW SAT 1) (integer) 49 2) (integer) -1 ``` When credits are exhausted, `OVERFLOW SAT` prevents negative balances without client-side checks. ## Parameter Reference ### Syntax ``` INCREX key [BYFLOAT increment \| BYINT increment] [LBOUND lowerbound] [UBOUND upperbound] [OVERFLOW <FAIL \| SAT \| REJECT>] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| PERSIST] [ENX] ``` ### Parameters \| Parameter \| Description \| \|-----------\|-------------\| \| `key` \| The key to increment. Created with value `0` if it does not exist. \| \| `BYFLOAT increment` \| Increment the value by the given long-double float. \| \| `BYINT increment` \| Increment the value by the given 64-bit signed integer. \| \| `LBOUND lowerbound` \| Set lower bound for the increment result. Defaults to `LLONG_MIN` (integer) or `-LDBL_MAX` (float). \| \| `UBOUND upperbound` \| Set upper bound for the increment result. Defaults to `LLONG_MAX` (integer) or `LDBL_MAX` (float). \| \| `OVERFLOW <FAIL \\| SAT \\| REJECT>` \| Set the overflow policy when the result would be out of bounds. `FAIL` rejects the operation with an error (default). `SAT` saturates the result to the bound. `REJECT` leaves the key and its TTL untouched and replies with the current value and a zero delta. \| \| `EX seconds` \| Set the key's TTL to `seconds` seconds. \| \| `PX milliseconds` \| Set the key's TTL to `milliseconds` milliseconds. \| \| `EXAT unix-time-seconds` \| Set the key's expiration to the absolute Unix timestamp in seconds. \| \| `PXAT unix-time-milliseconds` \| Set the key's expiration to the absolute Unix timestamp in milliseconds. \| \| `PERSIST` \| Remove the key's existing TTL. \| \| `ENX` \| Set the key's TTL/expiration if it has No eXpiration \| If neither `BYINT` nor `BYFLOAT` is specified, the increment defaults to integer `1`. ### Return Value An array of two elements: 1. New value — the value of the key after the increment (or the unchanged current value under `OVERFLOW REJECT`). 2. Actual increment — the increment that was actually applied. May differ from the requested increment when `OVERFLOW SAT` saturates the result to a bound, and is always `0` when `OVERFLOW REJECT` skipped the operation. - In integer mode (default or `BYINT`): both elements are integers. - In float mode (`BYFLOAT`): both elements are bulk strings representing the float values on RESP2, and RESP3 Doubles on RESP3. ### Overflow Policy (FAIL vs. SAT vs. REJECT) Controlled by the optional `OVERFLOW` argument. A bound violation includes both exceeding an explicit `LBOUND`/`UBOUND` and overflowing the type limits when no explicit bound is given. - `OVERFLOW FAIL` (default): if the computed result would violate a bound, the command returns an error and the key is left unchanged. This matches the existing semantics of `INCRBY` / `INCRBYFLOAT` on overflow. - `OVERFLOW SAT`: the result is silently capped at `UBOUND` / floored at `LBOUND` (or saturated to the type limits when no explicit bound is given). The second element of the reply reflects the saturated delta. If the delta cannot be represented as a 64-bit signed integer(default or `BYINT`), or would produce Infinity(`BYFLOAT`), an error is returned. - `OVERFLOW REJECT`: the operation is silently skipped — the key value and its TTL are left unchanged, no keyspace notification is fired, and nothing is replicated. The reply is `[current_value, 0]`, allowing the caller to detect the rejection without handling an error. ### Notes - If no expiration option is given, the key's existing TTL is preserved (like `INCR`). - `ENX` requires one of `EX`/`PX`/`EXAT`/`PXAT`. - If the result is saturated by `OVERFLOW SAT`, the expiration is still applied as specified. - Under `OVERFLOW REJECT` the expiration option is ignored on the rejected branch — TTL is preserved exactly as it was before the call. - `BYINT` requires an integer-typed existing value; `BYFLOAT` accepts both. Integers can be promoted to floats losslessly, but a stored float (e.g. `"1.5"`) cannot be parsed back as an integer. This is consistent with `INCR`/`INCRBY` (integer-only) and `INCRBYFLOAT` (accepts both). --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Moti Cohen <moti.cohen@redis.com> Co-authored-by: oranagra <oran@redislabs.com>	2026-05-11 18:22:27 +08:00
Filipe Oliveira (Redis)	62551a7b12	Batched MGET/MSET dict prefetch with dictType-driven payload hints (#15133 ) Reduce MGET / MSET latency by overlapping the dict-lookup memory accesses across the keys of a single multi-key command. Builds on the cross-command batched prefetch framework introduced in #14017 and the dict-prefetch state machine in `memory_prefetch.c`, and lifts the kvobject-aware bits out of the state machine into two new `dictType` callbacks so the same machinery can be reused for other dict-encoded types later (hash hashtable, sets, sorted sets) without paying for `kvobj`-specific code paths in the core loop. Bundles the work originally proposed in #14899 (MGET prefetch framework, by @mpozniak95) and #15043 (MSET batch prefetch). ## Design Two new optional callbacks on `dictType`: ```c typedef struct dictType { ... /* Bring the entry's key payload into cache before keyCompare runs. * Returns the address to prefetch, or NULL if the entry alone is enough. / void (prefetchEntryKey)(const dictEntry de); /* Called only after a key match. Returns the value-side payload to * prefetch (or NULL). / void (prefetchEntryValue)(const dictEntry de); } dictType; ``` `dbDictType` registers both. The kv-aware logic — the `dictEntryIsKey()` shortcut for embedded kvobjs, and `kv->ptr` for `OBJ_STRING` / `OBJ_ENCODING_RAW` values — now lives in two small helpers in `server.c`: ```c static void dbDictPrefetchEntryKey(const dictEntry de) { return dictEntryIsKey(de) ? NULL : dictGetKey(de); } static void dbDictPrefetchEntryValue(const dictEntry de) { kvobj kv = dictGetKey(de); return (kv->type == OBJ_STRING && kv->encoding == OBJ_ENCODING_RAW) ? kv->ptr : NULL; } ``` The `PrefetchGetValueDataFunc` typedef and the per-call `get_val_data` parameter on `dictPrefetchKeys()` / `dictPrefetch()` are removed — the dict's own type drives both ends. This also removes the foot-gun where callers (like `mgetCommand`) had to remember whether to pass `prefetchGetObjectValuePtr` or `NULL`. `memory_prefetch.c` no longer references `kvobj`, `kvobjGetKey`, or any specific value layout. ## State machine Two file-local types in `memory_prefetch.c`: \| Type \| Role \| \|---\|---\| \| `dictPrefetchLookup` \| Per-key snapshot of an in-flight, software-pipelined `dictFind` (mirrors the locals a synchronous `dictFind` would carry across one bucket walk). \| \| `dictPrefetcher` \| Driver that advances a batch of `dictPrefetchLookup`s through the FSM, yielding to the next in-flight lookup each time a prefetch is issued. \| Five-stage lifecycle for each lookup, driven by the prefetcher: ```text │ start │ ┌────────▼─────────┐ ┌─────────►│ PREFETCH_BUCKET ├────►────────┐ │ └────────┬─────────┘ no more tables │ bucket│found │ │ │ │ entry not found - goto next table ┌────────▼────────┐ │ └────◄─────┤ PREFETCH_ENTRY │ ▼ ┌────────────►└────────┬────────┘ │ │ entry│found │ │ │ │ │ ┌───────────▼─────────────┐ │ │ │ PREFETCH_ENTRY_KEY │ ◄── dictType->prefetchEntryKey(de) │ └───────────┬─────────────┘ │ │ │ │ key mismatch - goto next entry │ │ │ ┌───────────▼─────────────┐ │ └──────◄───│ PREFETCH_ENTRY_VALUE │ ◄── keyCompare; on match, └───────────┬─────────────┘ dictType->prefetchEntryValue(de) │ │ ┌─────────▼─────────────┐ │ │ PREFETCH_DONE │◄────────┘ └───────────────────────┘ ``` `PREFETCH_BUCKET` first picks `ht_table[0]`, then flips to `ht_table[1]` if the dict is mid-rehash, then transitions to `PREFETCH_DONE` if no more tables remain. `memory_prefetch.c` exposes a small lifecycle that any caller can drive: ```c dictPrefetcherInit(p, max_keys); / one-shot heap alloc of lookups[] / dictPrefetcherReset(p, dicts, keys, nkeys); / configure for one batch / dictPrefetcherRun(p); / drive FSM until all PREFETCH_DONE / dictPrefetcherFree(p); / release / ``` Each FSM stage is a named static function (`dictPrefetchBucket`, `dictPrefetchEntry`, `dictPrefetchEntryKey`, `dictPrefetchEntryValue`), so the `dictPrefetcherRun` driver is a four-line `switch` over the state. The state machine is dict-pure: no `kvobj` field on `dictPrefetchLookup`, no `kvobjGetKey` reach-through. Round-robin advance semantics — a state only advances the cursor if a prefetch was actually issued — are preserved, so the embedded-kvobj fast path (`dictEntryIsKey(de) == 1` → callback returns NULL) still skips the extra prefetch and falls straight into the compare on the next loop iteration. The cross-command path (`prefetchCommands` / `PrefetchCommandsBatch`) embeds a `dictPrefetcher` initialized once at startup and reset per batch, so cross-command prefetching no longer allocates per call. ## Intra-command API ```c void dictPrefetchKeys(dict dicts, void keys, size_t nkeys); ``` A single multi-key command (e.g. MGET) can prefetch dict data for a batch of its own keys, reusing the same state machine that the cross-command path uses. Single-key calls (`nkeys <= 1`) early-return — nothing to interleave with. The implementation stack-allocates a fixed-size lookup array bounded by `DICT_PREFETCH_MAX_SIZE = 64` (no VLA, predictable stack usage), so the intra-command path doesn't touch the heap. ## Notes on the call sites A shared helper picks the next prefetch batch and warms it via `dictPrefetchKeys`: ```c / Pick the next prefetch batch starting at argv[start] and warm it via * dictPrefetchKeys. 'stride' is 1 for keys-only args (MGET) or 2 for * key/value pairs (MSET). Returns the chosen batch size in items. / static int prefetchKeysBatch(client c, int slot, int start, int stride); ``` Adaptive batch sizing inside the helper: if at least two full batches (`PREFETCH_BATCH_SIZE * 2 = 32` items) remain, take one batch (`PREFETCH_BATCH_SIZE = 16`); otherwise take all remaining items in one call. This generalizes the small-request fast path so the trailing batch of a large request also gets the single-call benefit. - MGET (`mgetCommand`) — gated by `do_prefetch = server.prefetch_batch_max_size && !already_prefetched && numkeys > 1`, with `already_prefetched = c->current_pending_cmd && (c->current_pending_cmd->flags & PENDING_CMD_KEYS_PREFETCHED)`. When `do_prefetch` is set, each iteration calls `prefetchKeysBatch(c, slot, j, 1)` and then sequentially `lookupKeyRead`s + replies the chosen batch. When `do_prefetch` is clear (cross-command path already warmed the keys, or batch prefetching is off), the loop takes all remaining items in one go and skips the prefetch. - MSET / MSETNX (`msetGenericCommand`) — same `do_prefetch` gate as MGET with `stride = 2`. For the NX flag the NX-check loop runs `lookupKeyWrite` (which already warmed everything via `prefetchKeysBatch`); the SET loop then disables further prefetch (`do_prefetch &&= !nx`) so we don't re-prefetch on the second pass. Going through the full state machine (rather than bucket-only) means `dbDictType`'s `prefetchEntryValue` callback runs on a key match — warming the old kvobj's payload, which `setKey -> dbReplaceValue -> updateKeysizesHist(oldlen, newlen)` then reads to compute the histogram delta. The slot dict is re-fetched per batch — in cluster mode the slot dict can be freed mid-MSET (`KVSTORE_FREE_EMPTY_DICTS` + `expireIfNeeded`), so a cached pointer would otherwise dangle. - Cross-command batch path (`addCommandToBatch`) — sets `PENDING_CMD_KEYS_PREFETCHED` on every command added to the batch, even on partial-batch overflow (was: only when ALL keys fit). The intra-command path then uniformly skips supplemental prefetching for any command the batch touched. Rationale: running both paths (cross-command warm + intra-command supplement) caused a measured −9.6 % regression on x86 with pipeline-10, and the partial cross- command warmup is sufficient for the head of the keyset; the cold tail goes through normal lookup, which is still cheaper than running the FSM a second time on already-warm keys. - Future types: each dict's `dictType` can register its own `prefetchEntryKey` / `prefetchEntryValue` (e.g. for the hashtable hash encoding, the field-sds and value-sds payloads), without touching `memory_prefetch.c`. ## Benchmark validation On x86, performance improvements are significant for larger batch sizes: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +89.44% - 5Mkeys-string-mget-100B-100keys: +37.33% - 5Mkeys-string-mget-100B-30keys: +22.40% On ARM (Graviton4), the gains are even more pronounced: - 5Mkeys-string-mget-10B-100keys-pipeline-10: +128.34% - 5Mkeys-string-mget-100B-100keys-pipeline-10: +46.76% Overall, the improvement scales with batch size, while a few small-batch cases show marginal gains or slight regressions. --------- Co-authored-by: Marcin Poźniak <marcin.pozniak@intel.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2026-05-11 10:22:25 +08:00
Moti Cohen	5a05863e97	t_string: rewrite SET GET propagation in place (#15114 ) Optimize SET key value GET propagation rewriting in setGenericCommand() by removing GET arguments in-place with rewriteClientCommandArgument(). This avoids the overhead of allocating a new argv vector and incrementing reference counts for every retained argument. The optimization is scoped to the no-expire SET ... GET rewrite path. It also adds test coverage for cases with repeated GET tokens to ensure robust string semantics and consistent replication behavior. Changes: - Use rewriteClientCommandArgument(c, j, NULL) for in-place removal. - Eliminate redundant argv allocations and refcount increments. - Improve performance of SET GET in high-throughput write streams.	2026-04-28 10:05:59 +03:00
sggeorgiev	63f02e7876	Fix double ERR prefix in XNACK error replies (#15091 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Several `addReplyError` and `addReplyErrorFormat` calls in `xnackCommand` included a redundant `"ERR "` prefix in the message string. Since `addReplyErrorLength` already prepends `-ERR ` to the RESP reply, clients received `ERR ERR ...` for these error paths. This PR removes the redundant prefix from all five affected calls and tightens the corresponding test patterns to match from the beginning of the error message (`"ERR ..."` instead of `"..."`), so any future double-prefix regression will be caught.	2026-04-22 09:12:04 +03:00
Filipe Oliveira (Redis)	0fa78fd8fd	perf: widen fast_float_strtod fast path to 17-19 digit mantissas (#15061 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details ## Root cause Roughly 50% of random double scores generated by the ZADD listpack workload have 17-19 significant digits, which exceed `MAX_MANTISSA_FAST_PATH` (`2^53`). These inputs fall through to the `strtod()` fallback: ```c char static_buf[128]; memcpy(buf, nptr, len); /* memcpy back! / buf[len] = '\0'; / null-term / double result = strtod(buf, ...); / glibc strtod — ~10× slower on ARM / ``` The original C++ `fast_float` library handled the same 17-19 digit inputs with Eisel-Lemire / bigint arithmetic without falling back to `strtod()`. That is what the pure-C replacement lost. ## Fix Compute `mantissa 10^exponent` in 128-bit integer arithmetic using `__uint128_t`, then convert to double with a single IEEE round-to-nearest-even cast. Supported for `\|exp\| in [0, 19]` where `10^\|exp\|` fits in `uint64`; cases outside that range (or otherwise outside the fast path's preconditions) still fall through to `strtod()`. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2026-04-20 20:45:49 +08:00
Vitah Lin	8aeea8c210	Increase threshold for HPEXPIRETIME persists after RDB reload test (#15047 )	2026-04-17 16:36:02 +08:00
Moti Cohen	fa6d4c3d63	Fix SIGABRT in HSETEX when a field appears twice in the FIELDS list (#14956 ) HSETEX crashed on assert() with a SIGABRT when the same field appeared more than once in the FIELDS list and an expiry time was given (EX/PX/EXAT/PXAT). Root cause: hfieldPersist() and the KEEP_TTL path in hashTypeSet() both asserted that dictExpireMeta->expireMeta.trash == 0, meaning the hash must be globally registered in the HFE DS. This is incorrect during HSETEX execution because hashTypeSetExDone(), which registers the hash globally and clears trash, called only at the end of flow. The private per-field ebuckets are fully valid regardless of the global registration state. Fix: Remove both incorrect assertions. The operations on the private ebuckets (ebRemove in hfieldPersist, ebAdd in the KEEP_TTL path) are correct and do not require the hash to be globally registered. Tests: Added two regression tests covering the crash scenarios: - HSETEX EX with a duplicate field (existing field, expiry given) - HSETEX FNX EX with a duplicate field (no prior field, FNX condition passes)	2026-04-16 13:16:52 +03:00
Sergei Georgiev	80f1ebda88	Add AGGREGATE COUNT option to ZUNION, ZINTER, ZUNIONSTORE, and ZINTERSTORE (#14892 ) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details ### Overview This PR adds a new `COUNT` aggregation mode to the `ZUNIONSTORE`, `ZINTERSTORE`, `ZUNION`, and `ZINTER` sorted set commands. When `AGGREGATE COUNT` is specified, the resulting score for each element reflects how many input sets contain it (optionally scaled by `WEIGHTS`), rather than combining the actual scores of the elements. This enables a common use case — counting set membership frequency — directly at the command level, without application-side workarounds. ### Problem Statement For developers who need to know how many input sorted sets contain each element, there is no single-command solution today. Example: given several game leaderboards, find how many leaderboards each player appears in. The existing aggregation modes (`SUM`, `MIN`, `MAX`) all operate on the elements' scores. To ignore scores and just count set membership, you'd currently need to copy each sorted set with all scores set to 1, then run `ZUNIONSTORE`/`ZINTERSTORE` with `SUM` — requiring multiple round trips, temporary keys, and application-level locking to avoid races. A `COUNT` aggregation mode solves this directly. ### Solution Introduces `AGGREGATE COUNT` as a fourth aggregation mode: - `ZINTER numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE <SUM \| MIN \| MAX \| COUNT>] [WITHSCORES]` - `ZINTERSTORE destination numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE <SUM \| MIN \| MAX \| COUNT>]` - `ZUNION numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE <SUM \| MIN \| MAX \| COUNT>] [WITHSCORES]` - `ZUNIONSTORE destination numkeys key [key ...] [WEIGHTS weight [weight ...]] [AGGREGATE <SUM \| MIN \| MAX \| COUNT>]` When `COUNT` is specified, the scores in the input sets are ignored. Note that `WEIGHTS` is not ignored — each set contributes its weight (default 1) per element, and the contributions are summed. Implementation details: A new helper function `zuiWeightedScore()` computes the per-set contribution: ```c inline static double zuiWeightedScore(double score, double weight, int aggregate) { return (aggregate == REDIS_AGGR_COUNT) ? weight : weight * score; } ``` The `zunionInterAggregate()` function treats `COUNT` identically to `SUM` — it adds the per-set contributions. All four call sites where `weight * score` was previously computed inline are updated to use `zuiWeightedScore()`. ### Examples ``` > ZADD s1 1 foo 1 bar > ZADD s2 2 foo 2 bar > ZADD s3 3 foo ``` With `SUM` (existing behavior, for comparison): ``` > ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM (integer) 1 > ZRANGE t1 0 -1 WITHSCORES 1) "foo" 2) "29" > ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE SUM (integer) 2 > ZRANGE t1 0 -1 WITHSCORES 1) "bar" 2) "20" 3) "foo" 4) "29" ``` With `COUNT` and `WEIGHTS`: ``` > ZINTERSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT (integer) 1 > ZRANGE t1 0 -1 WITHSCORES 1) "foo" 2) "18" > ZUNIONSTORE t1 3 s1 s2 s3 WEIGHTS 10 5 3 AGGREGATE COUNT (integer) 2 > ZRANGE t1 0 -1 WITHSCORES 1) "bar" 2) "15" 3) "foo" 4) "18" ``` With `COUNT` and no specified `WEIGHTS` — resulting score equals the number of input sorted sets containing the element: ``` > ZINTERSTORE t1 3 s1 s2 s3 AGGREGATE COUNT (integer) 1 > ZRANGE t1 0 -1 WITHSCORES 1) "foo" 2) "3" > ZUNIONSTORE t1 3 s1 s2 s3 AGGREGATE COUNT (integer) 2 > ZRANGE t1 0 -1 WITHSCORES 1) "bar" 2) "2" 3) "foo" 4) "3" ``` ### Backward Compatibility This is a fully additive change. The new `COUNT` keyword is only recognized after the `AGGREGATE` token in the four affected commands. Existing commands, arguments, and default behavior (`AGGREGATE SUM`) are completely unchanged. No new command is introduced, and no existing response format is modified.	2026-04-14 09:21:53 +03:00
Moti Cohen	e1d35aca01	Fix HEXPIRE numfields overflow (#15021 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Validate HEXPIRE-family field counts without parser overflow keep flexible option order; only require fields fit in argv add tests for INT_MAX numfields across HEXPIRE/HPEXPIRE/HEXPIREAT/HPEXPIREAT	2026-04-13 09:46:46 +03:00
h.o.t. neglected	e8da0e5b47	Fix brittle assert_match patterns for unexpected slowlog fields (#14948 )	2026-04-13 14:45:14 +08:00
Sergei Georgiev	0be39e5032	Fix missing consumer propagation on empty XREADGROUP (#14963 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Summary Fixes consumer replication inconsistency when `XREADGROUP` is called for a new consumer but no `XCLAIM` commands are propagated to the replica. Previously, consumer creation was only propagated to replicas when `noack=true`, relying on `XCLAIM` propagation to implicitly create the consumer in the non-NOACK path. However, if no messages exist to read, no `XCLAIM` is generated, and the consumer is silently lost on the replica. This is a follow-up to the original fix in [redis/redis#7140](https://github.com/redis/redis/issues/7140) / [redis/redis#7526](https://github.com/redis/redis/pull/7526), which introduced `XGROUP CREATECONSUMER` propagation but only for the `NOACK` case. ## Changes - `xreadgroupCommand` (src/t_stream.c): Replaced the `if (noack)` guard around the `streamPropagateConsumerCreation()` call with a deferred check after `streamReplyWithRange()`. Consumer creation is now propagated when `noack \|\| propCount == 0` — that is, only when no `XCLAIM` commands were generated. This avoids redundant propagation in the common case where `XCLAIM` already implicitly creates the consumer on the replica, while correctly handling both the NOACK path (where PEL/XCLAIM is skipped entirely) and the no-messages path (where there is nothing to XCLAIM). - Test (tests/unit/type/stream-cgroups.tcl): Added replication test `"XREADGROUP propagates new consumer to replica"` that sets up a master-replica pair and verifies consumer propagation in two cases: (1) without NOACK when no messages are available to deliver, and (2) with NOACK when messages are delivered but XCLAIM is skipped. ## Benefits - Master-replica consistency: Consumers created by `XREADGROUP` are now visible on replicas whenever no `XCLAIM` would otherwise create them — covering both the NOACK path and the empty-stream path. - No redundant propagation: The noack \|\| propCount == 0 condition avoids emitting a superfluous XGROUP CREATECONSUMER when XCLAIM commands are already propagated and would implicitly create the consumer on the replica.	2026-04-08 14:59:22 +03:00
Sergei Georgiev	747dfe578e	Add XNACK command for releasing stream messages back to the group (#14797 ) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details ### Overview This PR enhances Redis Streams consumer groups by adding a new `XNACK` command that allows consumers to explicitly release pending messages back to the group without acknowledging them. Released (NACKed) entries become immediately available for re-delivery to other consumers, eliminating the idle-timeout delay currently required for message recovery. The command supports three modes — SILENT, FAIL, and FATAL — giving consumers fine-grained control over delivery counter semantics to handle graceful shutdowns, transient failures, and poison messages respectively. ### Problem Statement For developers using Redis Streams with consumer groups, there are several common scenarios where a consumer needs to release a message it has claimed without acknowledging it: 1. Transient internal failures: A consumer may fail to process a message because of problems unrelated to the message itself — for example, it cannot connect to an external service to fetch required context. The message is perfectly valid and should be retried promptly by another consumer. 2. Resource pressure: A consumer under resource stress (low CPU, low memory) may be unable to handle a specific message (e.g., a complex or large message) within acceptable QoS. It should leave the opportunity to other consumers in the group, with minimal delay. 3. Graceful shutdown: A consumer about to shut down would like to immediately release all unprocessed messages it has claimed, so they can be picked up by remaining consumers without waiting for idle timeouts. 4. Poison / malicious messages: A consumer may detect or suspect that a claimed message is invalid or malicious and wants to mark it as permanently failed (for dead-letter queue processing when available). Currently, a consumer cannot NACK a message. It can either: - XACK it — marks it as "processed" and removes it from the PEL entirely, losing the ability to redeliver it - Leave it pending — requires other consumers to discover it via `XPENDING` and claim it via `XCLAIM`/`XAUTOCLAIM` or `XREADGROUP CLAIM` after the idle timeout expires, introducing a long, unnecessary delay In all these cases, the logic that applications must implement introduces message handling delays, implementation complexity, and code duplication across consumer implementations. ### Solution Introduces a new `XNACK` (Negative ACKnowledge) command that explicitly releases pending messages from their owning consumer back to the group's PEL, making them immediately claimable via `XCLAIM` and `XAUTOCLAIM`, and prioritized for re-delivery in `XREADGROUP CLAIM`: ``` XNACK key group <SILENT\|FAIL\|FATAL> IDS numids id [id ...] [RETRYCOUNT count] [FORCE] ``` When executed, the command: 1. Disassociates the entry from its owning consumer (`consumer = NULL`) 2. Repositions the entry to the head of the PEL time-ordered list (`delivery_time = 0`), making it immediately claimable with any `min-idle-time` threshold 3. Adjusts the delivery counter based on the specified mode, giving consumers fine-grained control over retry semantics 4. Returns the count of successfully NACKed entries Mode controls the delivery counter adjustment and communicates the reason for the NACK: \| Mode \| Delivery Counter Behavior \| Use Case \| \|----------\|---------------------------------------------------\|---------------------------------------------\| \| `SILENT` \| Decrement by 1 (undo the delivery increment) \| Consumer shutdown / transient internal error — the delivery "didn't count" \| \| `FAIL` \| No change (keep the incremented value) \| Message too complex for this consumer, but may work for others — count this as an attempt \| \| `FATAL` \| Set to `LLONG_MAX` \| Invalid / suspected malicious message — mark as permanently failed \| The three modes map directly to the real-world scenarios above: - SILENT for graceful shutdown or transient failures unrelated to the message - FAIL for resource-constrained consumers that cannot handle a specific message - FATAL for poison message detection and dead-letter queue integration Optional parameters: - `RETRYCOUNT count`: Directly sets `delivery_count` to the specified value, overriding the mode-based adjustment - `FORCE`: Creates new unowned PEL entries for IDs that are not already in the group PEL (the entry must exist in the stream). When `FORCE` creates an entry, the delivery counter is set to `0` (or to `RETRYCOUNT` if specified, or to `LLONG_MAX` if mode is `FATAL`). This is used internally for AOF rewrite and replication. ### Response Format The command returns an integer — the number of messages successfully NACKed (released back to the group PEL): ``` 127.0.0.1:6379> XADD mystream 1-0 f v1 "1-0" 127.0.0.1:6379> XADD mystream 2-0 f v2 "2-0" 127.0.0.1:6379> XGROUP CREATE mystream grp 0 OK 127.0.0.1:6379> XREADGROUP GROUP grp c1 STREAMS mystream > 1) 1) "mystream" 2) 1) 1) "1-0" 2) 1) "f" 2) "v1" 2) 1) "2-0" 2) 1) "f" 2) "v2" 127.0.0.1:6379> XNACK mystream grp FAIL IDS 2 1-0 2-0 (integer) 2 ``` After XNACK, the entries appear with an empty consumer in XPENDING output: ``` 127.0.0.1:6379> XPENDING mystream grp - + 10 1) 1) "1-0" 2) "" 3) (integer) -1 4) (integer) 1 2) 1) "2-0" 2) "" 3) (integer) -1 4) (integer) 1 ``` ### NACK Zone: Data Structure Extension To support unowned PEL entries and ensure they are prioritized for re-delivery, a NACK zone is introduced at the head of the existing PEL time-ordered doubly-linked list. A new `pel_nack_tail` pointer is added to the `streamCG` structure: PEL ordering: ``` [pel_time_head] <-> ... <-> [pel_nack_tail] <-> [owned entries...] <-> [pel_time_tail] \|_____________ NACK zone ______________\| \|_______ normal PEL ________\| ``` The head of the PEL contains all NACKed messages (FIFO-ordered), followed by all delivered messages that were not NACKed (same order as today). This ensures NACKed messages are always prioritized over idle pending messages. The delivery order for `XREADGROUP` is therefore: 1. If `CLAIM` was specified: first deliver NACKed messages, then deliver due pending messages (current behavior) 2. Deliver new entries after the group's last-delivered-id (current behavior) Structure Design: - NACKed entries occupy positions from `pel_time_head` to `pel_nack_tail` in the time-ordered list - Their `delivery_time` is set to `0`, ensuring they always appear "oldest" and are immediately claimable - Their `consumer` pointer is set to `NULL`, marking them as unowned - `pel_nack_tail` is `NULL` when no NACKed entries exist Key Properties: - O(1) insertion: New NACKed entries are inserted right after `pel_nack_tail` (or at the list head if the zone is empty) - FIFO ordering among NACKed entries: entries are NACKed in the order they are released - Immediate claimability: Since `delivery_time = 0`, NACKed entries have maximum idle time and satisfy any `min-idle-time` threshold in `XCLAIM` and `XAUTOCLAIM`, In `XREADGROUP CLAIM`, NACKed entries are also prioritized over other pending entries due to their position at the head of the PEL. - Zone integrity: The `pelListInsertSorted` function is updated to stop scanning at the `pel_nack_tail` boundary, ensuring owned entries are never placed inside the NACK zone ### Impact on Existing Commands All commands that interact with the PEL are updated to handle unowned (`consumer = NULL`) entries: - XPENDING: Shows NACKed entries with an empty consumer name - XCLAIM / XAUTOCLAIM: Can claim NACKed entries (they satisfy any min-idle-time since `delivery_time = 0`) - XREADGROUP CLAIM: NACKed entries are picked up by the claim phase - XACK: Works correctly on NACKed entries (removes from group PEL) - XINFO STREAM FULL: Displays NACKed entries with an empty consumer name - XGROUP DELCONSUMER: Unaffected — NACKed entries are not in any consumer's PEL Propagation is also updated: when `XCLAIM` or `XAUTOCLAIM` encounters a deleted stream entry for an unowned NACK, it propagates `XACK` (instead of `XCLAIM`) to replicas and AOF, since there is no source consumer to reference. ### Persistence RDB: - A new RDB type `RDB_TYPE_STREAM_LISTPACKS_5` (type 27) is introduced - After saving consumer PEL entries, the NACK zone stream IDs are saved separately (count + encoded IDs) - On load, NACK zone entries are reconstructed by looking them up in the group PEL, unlinking from their sorted position, and re-inserting into the NACK zone via `pelListInsertNacked` - Backward compatibility is preserved: old RDB types continue to load with the existing validation (all entries must have consumers) AOF: - AOF rewrite emits `XNACK <key> <group> FAIL IDS <n> <id...> RETRYCOUNT <cnt> FORCE` commands for entries in the NACK zone - Consecutive entries with the same `delivery_count` are batched into a single command (up to `AOF_REWRITE_ITEMS_PER_CMD` IDs per command) ### Defragmentation The defragmentation logic is restructured to handle unowned entries: - `defragStreamCGPendingEntry` (new): Walks the group-level PEL rax, defragments each NACK, updates the doubly-linked list pointers (`pel_prev`, `pel_next`), `pel_time_head`, `pel_time_tail`, `pel_nack_tail`, and the consumer PEL back-pointer for owned entries - `defragStreamConsumerPendingEntry` (simplified): Only fixes up back-pointers to the possibly-relocated consumer and CG, since actual defragmentation is now done at the group-level walk. Unowned (NACK zone) entries have no consumer PEL walk, so the group-level pass is their only chance ### Key Benefits - Immediate re-delivery: NACKed entries are instantly claimable by other consumers via `XCLAIM` and `XAUTOCLAIM` (since `delivery_time = 0` satisfies any `min-idle-time`), and prioritized for re-delivery in `XREADGROUP CLAIM`, eliminating idle-time delays that can range from seconds to minutes - Explicit release semantics: Consumers can release messages intentionally, with fine-grained control over retry behavior — a capability that exists in competing systems like RabbitMQ - Flexible retry control: Three modes (SILENT, FAIL, FATAL) plus RETRYCOUNT cover the full spectrum of failure handling strategies, from graceful shutdown to poison message detection - Reduced application complexity: Eliminates the need for application-level workarounds involving XPENDING polling, arbitrary idle timeouts, and manual XCLAIM orchestration - Dead-letter queue readiness: FATAL mode + delivery count enables straightforward poison message detection and future DLQ integration - Backward compatibility: Fully optional new command with zero breaking changes to existing behavior	2026-04-07 14:17:53 +03:00
debing.sun	a6a27f56f2	Test tcp deadlock fixes (#14946 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This fix follows #14667 and #14886 Several tests pipelined large numbers of commands on deferring clients without draining replies. That can fill buffers and stall progress. Fix by draining replies every 500 pipelined requests to avoid TCP stalls. --------- Co-authored-by: oranagra <oran@redislabs.com>	2026-03-30 19:50:47 +08:00
Guimu	2ba0194fbe	Fix memory leak in ZDIFF algorithm 2 on early exit (#14932 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details ## Problem `zdiffAlgorithm2()` can break out early once the destination cardinality reaches zero. In that path, a temporary SDS created by `zuiSdsFromValue()` is left dirty and never released, because the cleanup normally happens on the next `zuiNext()` call which is skipped due to the early `break`. `zuiClearIterator()` called after the loop does not clean up dirty values — only `zuiNext()` or explicit `zuiDiscardDirtyValue()` does. ## Fix Add `zuiDiscardDirtyValue(&zval)` before the early `break` to ensure the temporary SDS is freed on all exit paths.	2026-03-27 22:09:57 +08:00
Moti Cohen	f4d176b3b7	KEYSIZES/ASM: simplify histograms, fix background trim, and refactor debug assertions (#14877 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details - Simplify KEYSIZES tracking and saving memory by removing per‑slot histogram state in kvstoreDictMetadata and routing all updates through kvsUpdateHistogram + updateKeysizesHist(db, type, ...) (per‑DB only). - Fix KEYSIZES consistency during ASM background trim by introducing asmTrimCtx, passing it through emptyDbDataAsync/BIO lazy‑free, computing histogram deltas in the background, and applying them on completion; add bg_trim_running to coordinate with validation. - Refactor and relax debug assertions into a unified dbg_assert_flags bitmask and dbgRunAssertions(db), and skip KEYSIZES/ALLOCSIZE checks during nested execution, RDB load, ASM import, and ASM background trim. - Update commands, module APIs, tests (including UNLINK async deletion coverage), and daily CI workflow to reflect the new histogram behavior and re‑enable ASM/cluster tests. - Revise the daily CI “debug-assert-keyspace” workflow to run ASM and slot-stats unit tests again Potential edge case with histogram accuracy: The histogram could become inaccurate if the database is flushed while ASM trim is running in the background. I considered adding a generation counter to detect this, but decided against it since this is purely an INFO/diagnostic feature and the edge case is quite rare. The target_kvstore pointer check prevents applying stale deltas to the wrong kvstore, and if the histogram does become incorrect we have debugServerAssert() to catch negative values - it won't cause crashes in production. This is a known limitation we can document and revisit if it becomes a real issue in practice.	2026-03-22 15:26:06 +02:00
Cong Chen	9accf8bd24	Refine error message for HSETEX command with PERSIST (#14880 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Fixes #14879 > PERSIST can only be given for HGETEX and KEEPTTL for HSETEX but the code doesn’t verify it. Behavior currently: ``` 127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v (error) ERR Only one of EX, PX, EXAT, PXAT or KEEPTTL arguments can be specified ``` With this PR, behavior would be: ``` 127.0.0.1:5555> HSETEX h ex 100 persist fields 1 k v (error) ERR unknown argument: persist ```	2026-03-20 10:19:00 +08:00
Sergei Georgiev	462e603a1f	Fix stream_idmp_keys missing from database lifecycle ops (#14897 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Summary Ensures `db->stream_idmp_keys` is managed consistently with `keys`, `expires`, and `subexpires` across every database lifecycle operation — flush, swap, temp-DB init/discard, lazy-free, and cluster slot migration. Without this fix, the dict was absent from these code paths, causing three classes of bugs: a SIGSEGV during diskless replication in `swapdb` mode (NULL pointer dereference in `initTempDb`), silently lost IDMP tracking after `SWAPDB` and diskless replication (pointers never swapped), and stale IDMP entries surviving `FLUSHDB` (dict never cleared). Changes - `emptyDbStructure` (`src/db.c`) — Clear `stream_idmp_keys` on flush so stale entries don't persist across `FLUSHDB`/`FLUSHALL`. - `initTempDb` / `discardTempDb` (`src/db.c`) — Create and release `stream_idmp_keys` for temp databases used during diskless replication RDB load. Fixes the SIGSEGV when `rdbLoadRioWithLoadingCtx` calls `dictAddRaw` on a NULL pointer. - `dbSwapDatabases` (`src/db.c`) — Swap `stream_idmp_keys` alongside the other per-DB dicts so IDMP tracking follows the data during `SWAPDB`. - `swapMainDbWithTempDb` (`src/db.c`) — Swap `stream_idmp_keys` when promoting a temp database after diskless replication, so the cron can discover and expire IDMP entries on replicas. - `streamMoveIdmpKeys` (`src/db.c`) — New helper that migrates IDMP entries by slot from one dict to another, used during cluster slot migration. - `emptyDbAsync` / `emptyDbDataAsync` (`src/lazyfree.c`) — Replace the dict with a fresh one and hand the old one to the background lazy-free job. - `lazyfreeFreeDatabase` (`src/lazyfree.c`) — Free `stream_idmp_keys` in the background job (now takes 4 args instead of 3). - `asmTriggerBackgroundTrim` (`src/cluster_asm.c`) — Move matching IDMP entries by slot into a temporary dict during background trim, then pass it to `emptyDbDataAsync` for async cleanup. - `server.h` — Updated `emptyDbDataAsync` signature; added `streamMoveIdmpKeys` declaration. - Tests (`tests/unit/type/stream.tcl`) — Four new integration tests covering IDMP expiry after `SAVE` + restart, tracking survival across `SWAPDB`, cleanup on `FLUSHDB`, and diskless replication in `swapdb` mode (both `rdbchannel=yes` and `rdbchannel=no`). What this fixes - No crash on diskless replication — `initTempDb` now initializes the dict, eliminating the NULL dereference during RDB load. - Tracking preserved across swaps — Both `SWAPDB` and diskless replication correctly transfer the dict, so the cron keeps expiring entries in the right database. - Clean state after flush — `FLUSHDB`/`FLUSHALL` clear the dict, preventing ghost entries from interfering with new streams. - Correct cluster migration cleanup — IDMP entries for migrated slots are moved and freed alongside the key data.	2026-03-19 14:37:19 +02:00
Cong Chen	1b615c774d	Fix FIELDS argument validation in HSETEX/HGETEX (#14883 ) Fixes #14879 Improve validation of the FIELDS argument in HSETEX and HGETEX to ensure exactly one field is provided, rejecting both missing and multiple fields with consistent and accurate error messages. Align behavior across both commands.	2026-03-19 19:54:22 +08:00
Sergei Georgiev	ee376cdc3c	Fix IDMP cron expiration not working after RDB load (#14869 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Summary Registers streams with IDMP producers in `db->stream_idmp_keys` during RDB load, so that the periodic cron job can expire stale IDMP entries after a server restart. Without this fix, streams loaded from RDB were never registered, causing IDMP entries to accumulate indefinitely and never be cleaned up by the cron. Changes - `rdbLoadRioWithLoadingCtx` (src/rdb.c): After loading each key-value pair, added a check for `OBJ_STREAM` type with a non-NULL `idmp_producers` rax. When found, creates a string object for the key and inserts it into `db->stream_idmp_keys` via `dictAddRaw`. If the key already exists (duplicate insert), the reference is decremented to avoid a leak. - Test `XADD IDMP cron expiration works after RDB load` (tests/unit/type/stream.tcl): Added an integration test that creates a stream with IDMP producers, sets a short `IDMP-DURATION`, saves and restarts the server, then verifies that the cron eventually expires the entries (pids-tracked and iids-tracked drop to 0) and that previously-expired IIDs can be re-added as new entries. Benefits - Cron expiration works after restart: Streams with IDMP producers are now discoverable by the periodic cleanup cron after an RDB load, ensuring expired entries are reaped as expected. - No memory leak on stale entries: Without registration, IDMP entries that outlived their duration would persist in memory forever after a restart, growing unboundedly. This fix ensures they are cleaned up. - Consistency between runtime and post-load behavior: IDMP expiration now behaves identically whether the stream was created at runtime or restored from an RDB snapshot.	2026-03-13 14:00:28 +02:00
Vitah Lin	62059a2438	Chore complete Tcl 9 support and fix regressions in test suite (#14845 ) ## Problem PR https://github.com/redis/redis/pull/14787 introduced Tcl 9 support for the test suite, but it still fails on my machine (macOS 26.3, Tcl 9.0.3). Some tests fail and the runner may hang. Example: ```bash $ tclsh <<<'puts $tcl_version' 9.0.3 $ make test [err]: BITCOUNT against test vector #2 Expected [r bitcount str] == 4 ``` This is caused by behavior changes in Tcl 9, including: - `string length` returning character count instead of byte count - binary sockets rejecting characters with code points >255 - differences in `string is wideinteger` Parts of the Redis Tcl test framework rely on byte-oriented behavior, which breaks under Tcl 9. ## Changes ### 1. Fix IPC payload encoding in test runner `tests/unit/memefficiency.tcl` contains a non-ASCII quote character: `fe16003e66/tests/unit/memefficiency.tcl (L826)` Under Tcl 9 this can corrupt the IPC stream when the test runner serializes Tcl code blocks, causing the runner to hang. Instead of fixing only this character, the IPC payload is now explicitly encoded using: ```tcl encoding convertto utf-8 ``` This makes the protocol robust against future non-ASCII characters. ### 2. Avoid Tcl 9 glob backtracking issue Replaced: ```tcl string match "[^\u0000-\u00ff]" $a ``` with: ```tcl regexp {[^\u0000-\u00ff]} $a ``` This avoids a catastrophic backtracking issue in Tcl 9's glob matcher while preserving the same behavior. ### 3. Update DIGEST validation The previous check relied on:`string is wideinteger` which behaves differently in Tcl 9. The assertion now validates the expected DIGEST format directly.	2026-03-06 13:27:56 +08:00
Yuan Wang	dd81afa2c8	Do not add a key into db if it has a past expiration time (#14784 ) For RESTORE and SET command, if the expiration time is already elapsed, we skip adding it to the DB. However, we still increment the `expiredkeys` counter. From stats perspective we behave as if we inserted a new key (possibly an overwrite) and later expired it, but from the per-key KSN observability, we reflect what we've actually done in the db (deletion of old key, and no insertion of new one), so we don't confuse modules. Changes: - RESTORE: Increment the `expiredkeys` counter if the key has an expiration time in the past. - SET: If the key has an expiration time in the past, do not add it to the main DB, increment the `expiredkeys` counter instead. And we also delete the old key if it exists. For reference, EXPIREAT with TTL in the past, which implicitly deletes the key and return success. Now SET command has the same behavior.	2026-02-25 20:44:13 +08:00
Sergei Georgiev	1f7bcecc94	Add XIDMPRECORD command and AOFRW emission to restore stream IDMP state (#14794 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Summary - Introduce a new XIDMPRECORD <key> <pid> <iid> <streamID> command that attaches IDMP (idempotency) metadata — a producer ID and an idempotency ID — to an existing stream message. This is an internal command used to replay IDMP state during AOF loading. - Modify rewriteStreamObject() in AOF rewrite (AOFRW) to emit XIDMPRECORD commands after the XADD and consumer-group commands, ensuring IDMP deduplication state is fully preserved across AOF rewrites and server restarts. - Additionally emit XCFGSET during AOFRW when per-stream IDMP configuration (IDMP-DURATION, IDMP-MAXSIZE) differs from server defaults, so custom settings survive rewrites. ## New command: XIDMPRECORD Syntax: `XIDMPRECORD <key> <pid> <iid> <streamID>` - Validates that the key exists, is a stream, the stream ID refers to an existing (non-deleted) entry, and both pid and iid are non-empty. - If the (pid, iid) pair already maps to the same stream ID, the command is idempotent and returns OK. - If the (pid, iid) pair already maps to a different stream ID, an error is returned. ## AOF rewrite changes (src/aof.c) - New helper rioWriteStreamIdmpEntry() writes a single XIDMPRECORD bulk command to the AOF rio. - After consumer groups are emitted, iterates all producers in s->idmp_producers and emits an XIDMPRECORD for every IDMP entry (in linked-list insertion order). - Emits XCFGSET for per-stream IDMP config when it diverges from server defaults.	2026-02-24 11:51:48 +02:00
Martin Dimitrov	98328ae2ec	Pause dict auto-resize during multi-field deletion (#14783 ) The idea comes directly from ValKey: https://github.com/valkey-io/valkey/pull/3144 Deleting many fields from a hash/zset/set stored as a dict can trigger repeated shrink/rehash work during the loop. --------- Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: debing.sun <debing.sun@redis.com>	2026-02-14 14:45:54 +08:00
debing.sun	18538461d1	Add separate statistics for active expiration of keys and hash fields (#14727 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details ### Summary Adds `expired_keys_active` and `expired_subkeys_active` counters to track keys and hash fields expired by the active expiration cycle, distinguishing them from lazy expirations. These new metrics are exposed in INFO stats output. ### Motivation Currently, Redis tracks the total number of expired keys (expired_keys) and expired hash fields (expired_subkeys), but there's no way to differentiate between expirations triggered by active expire and lazy expire. --------- Co-authored-by: Moti Cohen <moti.cohen@redis.com>	2026-01-22 22:30:25 +08:00
Sergei Georgiev	221409788a	Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details # Overview This PR introduces idempotency support to Redis Streams' XADD command, enabling automatic deduplication of duplicate message submissions through optional IDMPAUTO and IDMP parameters with producer identification. This enables reliable at-least-once delivery while preventing duplicate entries in streams. ## Problem Statement Current Redis Streams implementations lack built-in idempotency mechanisms, making reliable at-least-once delivery impossible without accepting duplicates: - Application-level tracking: Developers must maintain separate data structures to track submitted messages - Race conditions: Network failures and retries can result in duplicate stream entries - Complexity overhead: Each producer must implement custom deduplication logic - Memory inefficiency: External deduplication systems duplicate Redis's storage capabilities This lack of native idempotency support creates reliability challenges in distributed systems where at-least-once delivery semantics are required but exactly-once processing is desired. ## Solution Extends XADD with optional idempotency parameters that include producer identification: ``` XADD key [NOMKSTREAM] [KEEPREF \| DELREF \| ACKED] [IDMPAUTO pid \| IDMP pid iid] [MAXLEN \| MINID [= \| ~] threshold [LIMIT count]] <* \| id> field value [field value ...] ``` ### Producer ID (pid) - pid (producer id): A unique identifier for each producer - Must be unique per producer instance - Producers must use the same pid after restart to access their persisted idempotency tracking - Enables per-producer idempotency tracking, isolating duplicate detection between different producers Format: Binary or string, recommended max 36 bytes Generation: - Recommended: UUID v4 for globally unique identification - Alternative: `hostname:process_id` or application-assigned IDs ### Idempotency Modes IDMPAUTO pid (Automatic Idempotency): - Producer specifies its pid, Redis automatically calculates a unique idempotent ID (iid) based on entry content - Hash calculation combines XXH128 hashing of individual field-value pairs using an order-independent Sum + XOR approach with rotation (each pair: `XXH128(field \|\| field_length \|\| value)`) - 16-byte binary iid with extremely low accidental collision probability - XXH128 is a non-cryptographic hash function: fast and well-distributed, but does NOT prevent intentional collision attacks - For protection against adversarial collision crafting, use IDMP mode with cryptographically-signed idempotent IDs - Order-independent: field ordering does not affect the calculated iid - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without creating duplicate entry - Generally slower than manual mode due to hash calculation overhead IDMP pid iid (Manual Idempotency): - Caller provides explicit producer id (pid) and idempotent ID (iid) for deduplication - iid must be unique per message (either globally or per pid) - Faster processing than IDMPAUTO (no hash calculation overhead) - Enables shorter iids for reduced memory footprint - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without comparing field contents - Caller responsible for iid uniqueness and consistency across retries Both modes can only be specified when entry ID is `` (auto-generated). ### Deduplication Logic When XADD is called with idempotency parameters: 1. Redis checks if the message was recently added to the stream based on the (pid, iid) pair 2. If the (pid, iid) pair matches a recently-seen pair for that producer, the message is assumed to be identical 3. No duplicate message is added to the stream; the existing entry ID is returned 4. With IDMP pid iid: Redis does not compare the specified fields and their values—two messages with the same (pid, iid) are assumed identical 5. With IDMPAUTO pid: Redis calculates the iid from message content and checks for duplicates ## IDMP Map: Per-Producer Time and Capacity-Based Expiration Each producer with idempotency enabled maintains its own isolated IDMP map (iid → entry_id) with dual expiration criteria: Time-based expiration (duration): - Each iid expires automatically after duration seconds from insertion - Provides operational guarantee: Redis will not forget an iid before duration elapses (unless capacity reached) - Configurable per-stream via XCFGSET Capacity-based expiration (maxsize): - Each producer's map enforces maximum capacity of maxsize entries - When capacity reached, oldest iids for that producer are evicted regardless of remaining duration - Prevents unbounded memory growth during extended usage ### Configuration Commands XINFO STREAM: View current configuration and metrics Use `XINFO STREAM key` to retrieve idempotency configuration (idmp-duration, idmp-maxsize) along with tracking metrics. XCFGSET: Configure expiration parameters ``` XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize] ``` - duration: Seconds to retain each iid (range: 1- 86400 seconds) - maxsize: Maximum iids to track per producer (range: 1-10,000 entries) - Calling XCFGSET clears all existing producer IDMP maps for the stream Default Configuration* (when XCFGSET not called): - Duration: 100 seconds - Maxsize: 100 iids per producer - Runtime configurable via: `stream-idmp-duration` and `stream-idmp-maxsize` ## Response Behavior On first submission (pid, iid) pair not in producer's map: - Entry added to stream with generated entry ID - (pid, iid) pair stored in producer's IDMP map with current timestamp - Returns new entry ID On duplicate submission (pid, iid) pair exists in producer's map: - No entry added to stream - Returns existing entry ID from producer's IDMP map - Identical response to original submission (client cannot distinguish) ## Stream Metadata XINFO STREAM extended with idempotency metrics and configuration: - idmp-duration: The duration value (in seconds) configured for the stream's IDMP map - idmp-maxsize: The maxsize value configured for the stream's IDMP map - pids-tracked: Current number of producers with active IDMP maps - iids-tracked: Current total number of iids across all producers' IDMP maps (reflects active iids that haven't expired or been evicted) - iids-added: Lifetime cumulative count of entries added with idempotency parameters - iids-duplicates: Lifetime cumulative count of duplicate iids detected across all producers ## Persistence and Restart Behavior IDMP maps are fully persisted and restored across Redis restarts: - RDB/AOF: All pid-iid pairs, timestamps, and configuration are included in snapshots and AOF logs - Recovery: On restart, all tracked (pid, iid) pairs remain valid and operational - Producer Requirement: Producers must reuse the same pid after restart to access their persisted IDMP map - Configuration: Stream-level settings (duration, maxsize) persist across restarts - Important: Calling XCFGSET after restart clears restored IDMP maps (same behavior as during runtime) ## Key Benefits - Enables At-most-once Producer Semantics: Makes it possible to safely retry message submissions without creating duplicates - Automatic Retry Safety: Network failures and retries cannot create duplicate entries - Producer Isolation: Each producer maintains independent idempotency tracking - Memory Efficient: Time and capacity-based expiration per producer prevents unbounded growth - Flexible Implementation: Choose automatic (IDMPAUTO) or manual (IDMP) based on performance needs - Backward Compatible: Fully optional parameters with zero impact on existing XADD behavior - Collision Resistant: XXH128 with Sum + XOR combination and field-length separators provides high-quality non-cryptographic hashing for IDMPAUTO with extremely low collision probability and prevents ambiguous concatenation attacks	2026-01-15 21:58:44 +08:00
Stav-Levi	73249497d4	Fix ACL key-pattern bypass in MSETEX command (#14659 ) MSETEX doesn't properly check ACL key permissions for all keys - only the first key is validated. MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX seconds Keys are at every 2nd position (step=2). When Redis extracts keys for ACL checking, it calculates where the last key is: last = first + numkeys - 1; => calculation ignores step last = first + (numkeys-1) * step; With 2 keys starting at position 2: Bug: last = 2 + 2 - 1 = 3 → only checks position 2 Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4 Fixes #14657	2026-01-08 08:41:55 +02:00
debing.sun	9ca860be9e	Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This bug was introduced by #14130 and found by guybe7 When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete strategies, if a node was eligible for removal but couldn't be removed directly (because consumer group references need to be checked), the code would incorrectly break out of the loop instead of continuing to process entries within the node. This fix allows the per-entry deletion logic to execute for eligible nodes when using non-KEEPREF strategies.	2026-01-05 21:17:36 +08:00
Stav-Levi	23aca15c8c	Fix the flexibility of argument positions in the Redis API's (#14416 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This PR implements flexible keyword-based argument parsing for all 12 hash field expiration commands, allowing users to specify arguments in any logical order rather than being constrained by rigid positional requirements. This enhancement follows Redis's modern design of keyword-based flexible argument ordering and significantly improves user experience. Commands with Flexible Parsing HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX some examples: HEXPIRE: * All these are equivalent and valid: HEXPIRE key EX 60 NX FIELDS 2 f1 f2 HEXPIRE key NX EX 60 FIELDS 2 f1 f2 HEXPIRE key FIELDS 2 f1 f2 EX 60 NX HEXPIRE key FIELDS 2 f1 f2 NX EX 60 HEXPIRE key NX FIELDS 2 f1 f2 EX 60 HGETEX: * All these are equivalent and valid: HGETEX key EX 60 FIELDS 2 f1 f2 HGETEX key FIELDS 2 f1 f2 EX 60 HSETEX: * All these are equivalent and valid: HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2 HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2 HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60 HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60	2025-12-14 09:35:12 +02:00
debing.sun	bb6389e823	Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Problem When destroying a consumer group with `XGROUP DESTROY`, the cached `min_cgroup_last_id` was not being invalidated. This caused incorrect behavior when using `XDELEX` with the `ACKED` option, as the cache still referenced the destroyed group's `last_id`. ## Solution Invalidate the `min_cgroup_last_id` cache when the destroyed group's `last_id` equals the cached minimum. The cache will be recalculated on the next call to `streamEntryIsReferenced()`. --------- Co-authored-by: guybe7 <guy.benoish@redislabs.com>	2025-11-21 22:37:17 +08:00
Oran Agra	0a6eacff1f	Add variable key-spec flags to SET IF* and DELEX (#14529 ) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details These commands behave as DEL and SET (blindly Remove or Overwrite) when they don't get IF* flags, and require the value of the key when they do run with these flags. Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can provide the right flags depending on the arguments used. (the plain flags when arguments are unknown are the common denominator ones) Move lookupKey call in DELEX to avoid double lookup, which also means (some, namely arity) syntax errors are checked (and reported) before checking the existence of the key.	2025-11-12 11:36:10 +02:00
Sergei Georgiev	90ba7ba4dc	Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524 ) ### Problem The XREADGROUP command with CLAIM parameter incorrectly returns delivery metadata (idle time and delivery count) as strings instead of integers, contradicting the Redis specification. ### Solution Updated the XREADGROUP CLAIM implementation to return delivery metadata fields as integers, aligning with the documented specification and maintaining consistency with Redis response conventions. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-11-11 19:05:22 +08:00
Moti Cohen	d25e582a17	Fix flaky test of hfe persist rdb reload (#14525 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details So far occured once on daily in the test-sanitizer-address job	2025-11-10 17:15:37 +02:00
Moti Cohen	189b7609f5	Add hfe rdb load test (#14511 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Verify that following RDB load fields keep their expiration time. Verify that hashes that had HFEs not counted following rdb load in subexpiry (by command `info keyspace`)	2025-11-09 09:49:54 +02:00
debing.sun	7f1bafc922	Fix XACKDEL stack overflow when IDs exceed STREAMID_STATIC_VECTOR_LEN (CVE-2025-62507) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details This issue was introduced by redis/redis#14130. The problem is that when the number of IDs exceeds STREAMID_STATIC_VECTOR_LEN (8), the code forgot to reallocate memory for the IDs array, which causes a stack overflow.	2025-11-05 15:33:34 +02:00
sggeorgiev	3e2003ee0f	Fix HGETEX out-of-bounds read when FIELDS option missing numfields argument When the HGETEX command is used with the FIELDS option but without the required numfields argument, the server would attempt to access an out-of-bounds argv index. This PR adds a check to ensure numfields is present before accessing it, returning an error if it is missing. Also includes a test case to cover this scenario.	2025-11-05 15:33:34 +02:00
debing.sun	e436a0e548	Enforce 16-char hex digest length and case-insensitive comparison for IFDEQ/IFDNE (#14502 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Fix https://github.com/redis/redis/issues/14496 This PR makes the following changes: - DIGEST: Always return 16 hex characters with leading zeros Example: "00006c38adf31777" instead of "6c38adf31777" - IFDEQ/IFDNE: Validate the digest must be exactly 16 characters - IFDEQ/IFDNE: Use strcasecmp for case-insensitive hex comparison Both uppercase and lowercase hex digits now work identically --------- Co-authored-by: Marc Gravell <marc.gravell@gmail.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2025-11-03 16:59:50 +08:00
debing.sun	379fec1426	Use fixed position keys parameter for MSETEX command (#14470 ) In PR https://github.com/redis/redis/pull/14434, we made the keys parameter flexible, meaning it could appear anywhere among the command arguments. However, this also made key parsing more complex, since we could no longer determine the fixed position of key arguments. Therefore, in this PR, we reverted it back to using fixed positions for the keys. And also fix this [comment](https://github.com/redis/redis/pull/14434#discussion_r2459282563). --------- Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2025-10-27 17:20:29 +08:00
Stav-Levi	52ea47b792	Add MSETEX command (#14434 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Introduce a new command MSETEX to set multiple string keys with a shared expiration in a single atomic operation. Also with flexible argument parsing. Syntax: MSETEX KEYS numkeys key value [key value …] [XX \| NX] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] Sets the given keys to their respective values. This command is an extension of the MSETNX that adds expiration and XX options. Options: EX seconds - Set the specified expiration time, in seconds PX milliseconds - Set the specified expiration time, in milliseconds EXAT timestamp-seconds - Set the specified Unix time at which the keys will expire, in seconds PXAT timestamp-milliseconds - Set the specified Unix time at which the keys will expire, in milliseconds KEEPTTL - Retain the time to live associated with the keys XX - Only set the keys and their expiration if all already exist NX - Only set the keys and their expiration if none exist Flexible Argument Parsing examples: - MSETEX EX 10 KEYS 2 k1 v1 k2 v2 - MSETEX KEYS 2 k1 v1 k2 v2 NX PX 5000 - MSETEX NX EX 10 KEYS 2 k1 v1 k2 v2 Return Values: Integer reply: 1 - All keys were set successfully Integer reply: 0 - No keys were set (due to NX/XX conditions) Error reply - Syntax error or invalid arguments	2025-10-23 19:12:02 +03:00
sggeorgiev	090ca801ea	Add CLAIM parameter to XREADGROUP for automatic pending entry claiming (#14402 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Overview This PR enhances Redis Streams consumer groups by adding an optional CLAIM parameter to the `XREADGROUP` command, enabling automatic claiming of idle pending entries alongside normal message consumption in a single operation. ## Problem Statement Current Redis Streams consumer group implementations require developers to manually orchestrate multiple commands to handle both pending and new entries: - `XPENDING` to discover idle pending entries - `XCLAIM/XAUTOCLAIM` to claim idle entries - `XREADGROUP` to consume new entries This multi-command approach creates: - Performance overhead from multiple round trips to Redis - Implementation complexity, particularly when working with multiple streams - Code duplication across consumer implementations ## Solution Extends XREADGROUP with a new optional CLAIM parameter: `XREADGROUP GROUP group consumer [COUNT count] [BLOCK milliseconds] [NOACK] [CLAIM min-idle-time] STREAMS key [key ...] id [id ...]` When CLAIM min-idle-time is specified, the command operates in two phases: 1. Claim Phase: Automatically claims pending entries idle for ≥ min-idle-time milliseconds 2. Read Phase: Processes new entries if the COUNT limit hasn't been reached ## Response Format Changes When the CLAIM option is used, the response format is extended to include delivery metadata for each entry: Standard XREADGROUP response (without CLAIM): ``` 127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 STREAMS mystream > 1) 1) "mystream" 2) 1) 1) "1609459200000-0" 2) 1) "field1" 2) "value1" ``` XREADGROUP response with CLAIM: ``` 127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 CLAIM 30000 STREAMS mystream > 1) 1) "mystream" 2) 1) 1) "1609459200000-0" 2) 1) "field1" 2) "value1" 3) 15000 4) 3 ``` Response structure with CLAIM: - Field 1: Stream entry ID (unchanged) - Field 2: Field-value pairs (unchanged) - Field 3: Idle time in milliseconds - the number of milliseconds elapsed since this entry was last delivered to a consumer - Field 4: Delivery count - the number of times this entry has been delivered: - `0` for new messages that haven't been delivered before - `1+` for claimed messages (previously unacknowledged entries) Purpose of the new fields: These fields enable intelligent client-side processing decisions: - Idle time enables time-based escalation strategies, detection of stuck messages, and priority processing for critically delayed work - Delivery count enables retry limits, dead-letter queue logic, poison message detection, and alternative processing strategies based on failure history Together, these fields provide the visibility needed to build robust, self-healing consumer systems without requiring additional XPENDING queries. Note: If the ID parameter is not `>`, the command returns entries that are pending for the consumer, and the CLAIM option is ignored. In this case, the response follows the standard format without the additional delivery metadata fields. ## Key Benefits - Reduced Complexity: Eliminates manual PEL management and multi-command orchestration - Improved Performance: Reduces round trips by 50-70% for workloads processing both pending and new entries - Backward Compatibility: Fully optional parameter with zero breaking changes to existing behavior - Multi-Stream Support: Works seamlessly across multiple streams in a single command - Flexible Consumer Patterns: Enables mixed consumer types within the same group: - Consumers without CLAIM that only handle new messages - Consumers with CLAIM that process both pending and new entries ## Impact on Existing Commands The XCLAIM and XAUTOCLAIM commands may potentially benefit from the new pel_by_time index for improved performance, such optimizations require further investigation and testing. Enhancements to XCLAIM and XAUTOCLAIM are postponed for future work. ## Performance Benchmarks ### Latency Performance Comprehensive performance testing demonstrates significant improvements over the traditional XAUTOCLAIM approach: Test Methodology Two identical test scenarios were executed to compare XAUTOCLAIM against XREADGROUP with CLAIM: Test Setup: 1. Insert 20,000 messages into a stream 2. Read all messages with XREADGROUP to populate the pending entries list (PEL) 3. Set IDLE time to 1100ms on 1,000 randomly selected pending messages using XCLAIM 4. Set IDLE time to 50ms on all remaining 19,000 pending messages using XCLAIM 5. Execute the target command with min-idle-time=1000ms and COUNT=1000 to claim the eligible messages 6. Repeat steps 3-5 for 1,000 iterations Test 1 - XAUTOCLAIM (Traditional Approach): ``` XAUTOCLAIM Performance: Average: 54.671ms Median: 53.582ms Min: 3.738ms Max: 71.596ms P95: 62.536ms P99: 68.800ms ``` Test 2 - XREADGROUP with CLAIM (New Approach): ``` XREADGROUP CLAIM Performance: Average: 2.426ms Median: 2.571ms Min: 1.287ms Max: 4.653ms P95: 3.370ms P99: 4.212ms ``` Performance Analysis The new XREADGROUP CLAIM implementation delivers 22.5x faster average performance compared to XAUTOCLAIM: - Average latency reduction: 95.6% (54.671ms → 2.426ms) - Median latency reduction: 95.2% (53.582ms → 2.571ms) - P95 latency reduction: 94.6% (62.536ms → 3.370ms) - P99 latency reduction: 93.9% (68.800ms → 4.212ms) This performance improvement is achieved through the time-ordered PEL index (pel_by_time), which enables O(log n + k) retrieval of idle entries versus XAUTOCLAIM's less efficient scanning approach. ### Memory Performance To evaluate the memory overhead of the pel_by_time index, comprehensive memory testing was conducted comparing Redis with and without the index under realistic workload conditions. Test Methodology: - Insert 200,000 new messages into a stream - Read messages in blocks of 100 using XREADGROUP (populating the PEL with 200,000 pending entries) - Wait 5ms after each read block (simulating realistic processing delays that affect rax tree compression) - Measure memory usage before and after the reading phase Test Results - Without pel_by_time Index: ``` Initial memory (used): 926.10 KB After insertion (used): 6.80 MB After reading (used): 41.53 MB Memory increase from data: 5.90 MB Memory increase from reading: 34.72 MB Total memory increase: 40.62 MB ``` Test Results - With pel_by_time Index: ``` Initial memory (used): 927.44 KB After insertion (used): 6.81 MB After reading (used): 45.07 MB Memory increase from data: 5.90 MB Memory increase from reading: 38.27 MB Total memory increase: 44.17 MB ``` Memory Performance Analysis: The pel_by_time index introduces a measurable but reasonable memory overhead: Used Memory Impact: - Memory increase from pel_by_time index: 3.55 MB (38.27 MB - 34.72 MB) - Per-entry overhead: 18.6 bytes (3.55 MB / 200,000 entries) - Percentage overhead: 8.7% increase in total memory usage Per-Entry Memory Breakdown: The theoretical minimum for the pel_by_time index is 32 bytes per entry (composite key only, no node values). The observed 18.6 bytes per entry overhead is lower than the theoretical maximum, suggesting effective rax tree compression is occurring despite the 5ms delays between reads. ## Technical Implementation ### New Data Structure: Time-Ordered PEL Index (`pel_by_time`) To efficiently identify and claim idle pending entries, this PR introduces a new rax tree structure to the consumer group implementation: Structure Design: - Tree Type: Rax tree named pel_by_time added to each consumer group - Key Composition: 32-byte composite key consisting of: - `delivery_time` (timestamp when entry was last delivered) - `streamId` (stream entry ID) Key Format: `delivery_time` + `streamId` (concatenated) Node Value: None - all necessary information is encoded in the key itself for memory efficiency Key Properties: _Uniqueness Guarantee:_ While multiple pending entries may share the same `delivery_time`, the `streamId` component ensures each key is globally unique within the tree. _Lexicographical Ordering:_ The rax tree naturally orders nodes lexicographically by key. Since `delivery_time` forms the prefix of each key, entries are automatically sorted by delivery time, with oldest entries appearing first in the tree. _Efficient Range Operations:_ This time-based ordering enables highly efficient range searches. To find all entries idle for at least `min-idle-time` milliseconds, we simply perform a range query from the tree's beginning up to `current_time - min-idle-time`. Fast Retrieval: Once idle entries are identified via the `pel_by_time` index, the embedded `streamId` in each key is used to quickly retrieve the full pending message data structure for the subsequent `XREADGROUP` claim operation. Performance Characteristics: - Insertion: O(log n) when adding entries to PEL - Range Search: O(log n + k) where k is the number of idle entries found - Memory Overhead: 32 bytes per pending entry for the index key (no additional node values stored) This dual-index approach (existing PEL structures plus the new time-ordered index) allows XREADGROUP with CLAIM to efficiently identify claimable entries without scanning the entire PEL, making the operation suitable for consumer groups with large pending entry lists. ### COUNT Behavior with CLAIM When the `COUNT` option is used in conjunction with `CLAIM`, the command follows a two-phase execution strategy to maximize the specified count limit: Phase 1: Claim Idle Pending Entries - Retrieve claimable pending entries (idle for ≥ min-idle-time) up to the COUNT limit - These entries are claimed and returned to the consumer Phase 2: Fetch New Messages (if needed) - If the `COUNT` limit has not been satisfied by claimed pending entries, the command proceeds to read new messages from the stream - New messages are fetched up to the remaining available count: `remaining_count = COUNT - claimed_entries` This prioritization ensures that idle pending entries are always processed first, preventing indefinite message stalling while still allowing consumers to process new messages efficiently when pending entries are scarce. ### BLOCK Behavior with CLAIM When the CLAIM option is used in conjunction with the BLOCK option, the command exhibits sophisticated blocking behavior that responds to both new messages and pending entries becoming claimable: Blocking State Management: If there are no immediately claimable pending entries and no new messages available in the stream, the `XREADGROUP` command enters a blocking state for the specified duration. However, the implementation must handle a critical scenario: pending entries that become idle (and thus claimable) while the command is blocked must trigger an early wakeup to serve those entries. Implementation: `stream_claim_pending_keys` Dictionary To enable this reactive blocking behavior, a new `stream_claim_pending_keys` dictionary is introduced to the `redisDb` structure: - Key: Stream key being watched - Value: The minimum timestamp when the next pending entry in this stream will become claimable (i.e., will satisfy the min-idle-time requirement) Multi-Client Coordination: When multiple XREADGROUP commands with BLOCK and CLAIM are executed concurrently on the same stream, the dictionary value stores the shortest claimable time across all waiting clients. This ensures the earliest possible wakeup when any pending entry becomes available for claiming. Wakeup Mechanism: `handleClaimableStreamEntries` The `handleClaimableStreamEntries` function is invoked regularly from `blockedBeforeSleep` to monitor and react to claimable entries: 1. Scan Phase: Iterates through all entries in the `stream_claim_pending_keys` dictionary 2. Time Check: Compares each entry's claimable timestamp against the current time 3. Signal Phase: When `claimable_time ≤ current_time`, calls `signalKeyAsReady` to wake up all clients blocked on that stream 4. Client Processing: Awakened clients attempt to claim and process the newly available pending entries Resource Contention Handling: When the number of claimable entries is insufficient to satisfy all awakened clients: - Clients that successfully claim entries complete their operations - Remaining clients recalculate the next minimum claimable time based on remaining pending entries - These clients update the `stream_claim_pending_keys` dictionary with the new timestamp - They re-enter the blocking state to wait for the next batch of claimable entries This design ensures fair resource distribution and prevents busy-waiting while maintaining responsiveness to both new messages and aging pending entries.	2025-10-21 20:35:43 +08:00
Mincho Paskalev	aed879ad0a	Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435 ) # Description Add optimistic locking for string objects via compare-and-set and compare-and-delete mechanism. ## What's changed Introduction of new DIGEST command for string objects calculated via XXH3 hash. Extend SET command with new parameters supporting optimistic locking. The new value is set only if checks against a given (old) value or a given string digest pass. Introduction of new DELEX command to support conditionally deleting a key. Conditions are also checks against string value or string digest. ## Motivation For developers who need to to implement a compare-and-set and compare-and-delete single-key optimistic concurrency control this PR provides single-command based implementation. Compare-and-set and compare-and-delete are mostly used for [Optimistic concurrency control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control): a client (1) fetches the value, keeps the old value (or its digest, for a large string) in memory, (2) manipulates a local copy of the value, (3) applies the local changes to the server, but only if the server’s value hasn’t been changed (still equal to the old value). Note that compare-and-set [can also be implemented](https://redis.io/docs/latest/develop/using-commands/transactions/#optimistic-locking-using-check-and-set) with WATCH … MULTI … EXEC and Lua scripts. The new SET optional arguments and the DELEX command do not enable new functionality, however, they are much simpler and faster to use for the very common use case of single-key optimistic concurrency control. ## Related issues and PRs https://github.com/redis/redis/issues/12485 https://github.com/redis/redis/pull/8361 https://github.com/redis/redis/pull/4258 ## Description of the new commands ### DIGEST ``` DIGEST key ``` Get the hash digest of the value stored in key, as an hex string. Reply: - Null if key does not exist - error if key exists but holds a value which is not a string - (bulk string) the XXH3 digest of the value stored in key, as an hex string ### SET ``` SET key value [NX \| XX \| IFEQ match-value \| IFNE match-value \| IFDEQ match-digest \| IFDNE match-digest] [GET] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] ``` `IFEQ match-value` - Set the key’s value and expiration only if its current value is equal to match-value. If key doesn’t exist - it won’t be created. `IFNE match-value` - Set the key’s value and expiration only if its current value is not equal to match-value. If key doesn’t exist - it will be created. `IFDEQ match-digest` - Set the key’s value and expiration only if the digest of its current value is equal to match-digest. If key doesn’t exist - it won’t be created. `IFDNE match-digest` - Set the key’s value and expiration only if the digest of its current value is not equal to match-digest. If key doesn’t exist - it will be created. Reply update: - If GET was not specified: - Nil reply if either - the key doesn’t exist and XX/IFEQ/IFDEQ was specified. The key was not created. - the key exists, and NX was specified or a specified IFEQ/IFNE/IFDEQ/IFDNE condition is false. The key was not set. - Simple string reply: OK: The key was set. - If GET was specified, any of the following: - Nil reply: The key didn't exist before this command (whether the key was created or not). - Bulk string reply: The previous value of the key (whether the key was set or not). ### DELEX ``` DELEX key [IFEQ match-value \| IFNE match-value \| IFDEQ match-digest \| IFDNE match-digest] ``` Conditionally removes the specified key. A key is ignored if it does not exist. `IFEQ match-value` - Delete the key only if its value is equal to match-value `IFNE match-value` - Delete the key only if its value is not equal to match-value `IFDEQ match-digest` - Delete the key only if the digest of its value is equal to match-digest `IFDNE match-digest` - Delete the key only if the digest of its value is not equal to match-digest Reply: - error if key exists but holds a value that is not a string and IFEQ/IFNE/IFDEQ/IFDNE is specified. - (integer) 0 if not deleted (the key does not exist or a specified IFEQ/IFNE/IFDEQ/IFDNE condition is false), or 1 if deleted. ### Notes Added copy of xxhash repo to deps - [version](`c961fbe61a`) --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Yuan Wang <wangyuancode@163.com>	2025-10-21 10:32:49 +03:00
Moti Cohen	5b49119236	Fix crash in lookupKey() when `executing_client` is NULL (#14415 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This PR is based on: https://github.com/valkey-io/valkey/pull/2347 This was introduced in https://github.com/redis/redis/pull/13512 The server crashes with a null pointer dereference when lookupKey() is called from handleClientsBlockedOnKey(). The crash occurs because server.executing_client is NULL, but the code attempts to access server.executing_client->cmd->proc without checking. Crash scenario: Client 1 enables CLIENT NO-TOUCH Client 2 blocks on BRPOP mylist 0 Client 1 executes RPUSH mylist elem When unblocking Client 2, lookupKey() dereferences NULL server.executing_client → crash Solution Added proper null checks before dereferencing server.executing_client: Check if LOOKUP_NOTOUCH flag is already set before attempting to modify it Verify both server.current_client and server.executing_client are not NULL before accessing their members Maintain the TOUCH command exception for scripts Testing Added regression test in tests/unit/type/list.tcl that reproduces and verifies the fix for this crash scenario. This fix is based on valkey-io/valkey#2347 Co-authored-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>	2025-10-13 12:12:38 +03:00
张宇杭	083f38ef5a	Fix issues with server.allow_access_expired (#14262 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Close https://github.com/redis/redis/issues/14214 1. When the server.allow_access_expired flag is set to 1, it allows access to expired keys that have not yet been evicted. All places involving access to expired keys should consider the impact of this parameter. 2. The modifications involve five methods: hfieldIsExpired, hashTypeNext, hashTypeLength, keyIsExpired, and hashTypeIsExpired. When the server.allow_access_expired flag is set to 1, these methods will not skip expired keys, otherwise they follow the normal logic execution. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-10-12 11:02:03 +08:00
Moti Cohen	9b63e99d05	Refactor HFE: Introduce Per-Slot Expiration Store (estore) (#14294 ) Hash field expiration is managed with two levels of data structures. 1. At the DB level, an ebuckets structure maintains the set of all hashes that contain fields with expiration. 2. At the per-hash level, an ebuckets structure tracks fields with expiration. This pull request refactors the 1st level to operate per slot instead, and introduces a new API called estore (expiration store). Its design aligns closely with the existing kvstore API, ensuring consistency and simplifying usage. The terminology at that level has been updated from “HFE” or “hexpire” to “subexpiry”, reflecting a broader scope that can later support other data types.	2025-09-11 16:45:17 +03:00
debing.sun	60adba48aa	Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326 ) This PR is based on https://github.com/valkey-io/valkey/pull/1303 This PR introduces a DEBUG_DEFRAG compilation option that enables activedefrag functionality even when the allocator is not jemalloc, and always forces defragmentation regardless of the amount or ratio of fragmentation. ## Using ``` make SANITIZER=address DEBUG_DEFRAG=<force\|fully> ./runtest --debug-defrag ``` * DEBUG_DEFRAG=force * Ignore the threshold for defragmentation to ensure that defragmentation is always triggered. * Always reallocate pointers to probe for correctness issues in pointer reallocation. * DEBUG_DEFRAG=fully * Includes everything in the option `force`. * Additionally performs a full defrag on every defrag cycle, which is significantly slower but more accurate. --------- Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: oranagra <oran@redislabs.com>	2025-09-10 12:52:20 +08:00
Giuseppe Coco	5f8e7852f4	Fix: Validate ENTRIESREAD in XGROUP command (#14259 ) Fixes #14257 The XGROUP CREATE and SETID subcommands allowed setting an ENTRIESREAD value greater than the stream's total `entries_added` counter. This could lead to a logically inconsistent state. This commit adds a check to ensure the provided ENTRIESREAD value is not greater than the number of entries ever added to the stream. If ENTRIESREAD is too large, it gets set to the total number of entries in the stream, i.e. `s->entries_added`.	2025-09-01 08:36:38 +08:00
Moti Cohen	e6c261f3fb	Fix MEMORY USAGE command (#14288 ) After the key-value unification (kvobj), the MEMORY USAGE command may no longer account for the embedded key length stored within the kvobj. To fix this, replace sizeof(o) with zmalloc_size((void )o) to ensure the full allocated size is measured. In this context, the function objectComputeSize() was renamed and modified to kvobjComputeSize(). From computing only the value size to compute the key and its value.	2025-08-20 13:54:45 +03:00
debing.sun	b9d9d4000b	Prevent crash when cgroups_ref is null in streamEntryIsReferenced() after reload (#14276 ) This bug was introduced by https://github.com/redis/redis/pull/14130 found by @oranagra ### Summary Because `s->cgroup_ref` is created at runtime the first time a consumer group is linked with a message, but it is not released when all references are removed. However, after `debug reload` or restart, if the PEL is empty (meaning no consumer group is referencing any message), `s->cgroup_ref` will not be recreated. As a result, when executing XADD or XTRIM with `ACKED` option and checking whether a message that is being read but has not been ACKed can be deleted, the cgroup_ref being NULL will cause a crash. ### Code Path ``` xaddCommand -> streamTrim -> streamEntryIsReferenced ``` ### Solution Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced().	2025-08-15 15:15:16 +08:00
debing.sun	bec644aab1	Fix missing kvobj reassignment after reallocation in MOVE command (#14233 ) Introduced by https://github.com/redis/redis/issues/13806 Fixed a crash in the MOVE command when moving hash objects that have both key expiration and field expiration. The issue occurred in the following scenario: 1. A hash has both key expiration and field expiration. 2. During MOVE command, `setExpireByLink()` is called to set the expiration time for the target hash, which may reallocate the kvobj of hash. 3. Since the hash has field expiration, `hashTypeAddToExpires()` is called to update the minimum field expiration time Issue: However, the kvobj pointer wasn't updated with the return value from `setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed memory.	2025-07-30 22:24:56 +08:00

1 2 3 4 5 ...

465 commits