redis

mirror of https://github.com/redis/redis.git synced 2026-05-27 11:43:04 -04:00

Author	SHA1	Message	Date
debing.sun	18538461d1	Add separate statistics for active expiration of keys and hash fields (#14727 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details ### Summary Adds `expired_keys_active` and `expired_subkeys_active` counters to track keys and hash fields expired by the active expiration cycle, distinguishing them from lazy expirations. These new metrics are exposed in INFO stats output. ### Motivation Currently, Redis tracks the total number of expired keys (expired_keys) and expired hash fields (expired_subkeys), but there's no way to differentiate between expirations triggered by active expire and lazy expire. --------- Co-authored-by: Moti Cohen <moti.cohen@redis.com>	2026-01-22 22:30:25 +08:00
Sergei Georgiev	221409788a	Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details # Overview This PR introduces idempotency support to Redis Streams' XADD command, enabling automatic deduplication of duplicate message submissions through optional IDMPAUTO and IDMP parameters with producer identification. This enables reliable at-least-once delivery while preventing duplicate entries in streams. ## Problem Statement Current Redis Streams implementations lack built-in idempotency mechanisms, making reliable at-least-once delivery impossible without accepting duplicates: - Application-level tracking: Developers must maintain separate data structures to track submitted messages - Race conditions: Network failures and retries can result in duplicate stream entries - Complexity overhead: Each producer must implement custom deduplication logic - Memory inefficiency: External deduplication systems duplicate Redis's storage capabilities This lack of native idempotency support creates reliability challenges in distributed systems where at-least-once delivery semantics are required but exactly-once processing is desired. ## Solution Extends XADD with optional idempotency parameters that include producer identification: ``` XADD key [NOMKSTREAM] [KEEPREF \| DELREF \| ACKED] [IDMPAUTO pid \| IDMP pid iid] [MAXLEN \| MINID [= \| ~] threshold [LIMIT count]] <* \| id> field value [field value ...] ``` ### Producer ID (pid) - pid (producer id): A unique identifier for each producer - Must be unique per producer instance - Producers must use the same pid after restart to access their persisted idempotency tracking - Enables per-producer idempotency tracking, isolating duplicate detection between different producers Format: Binary or string, recommended max 36 bytes Generation: - Recommended: UUID v4 for globally unique identification - Alternative: `hostname:process_id` or application-assigned IDs ### Idempotency Modes IDMPAUTO pid (Automatic Idempotency): - Producer specifies its pid, Redis automatically calculates a unique idempotent ID (iid) based on entry content - Hash calculation combines XXH128 hashing of individual field-value pairs using an order-independent Sum + XOR approach with rotation (each pair: `XXH128(field \|\| field_length \|\| value)`) - 16-byte binary iid with extremely low accidental collision probability - XXH128 is a non-cryptographic hash function: fast and well-distributed, but does NOT prevent intentional collision attacks - For protection against adversarial collision crafting, use IDMP mode with cryptographically-signed idempotent IDs - Order-independent: field ordering does not affect the calculated iid - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without creating duplicate entry - Generally slower than manual mode due to hash calculation overhead IDMP pid iid (Manual Idempotency): - Caller provides explicit producer id (pid) and idempotent ID (iid) for deduplication - iid must be unique per message (either globally or per pid) - Faster processing than IDMPAUTO (no hash calculation overhead) - Enables shorter iids for reduced memory footprint - If (pid, iid) pair exists in producer's IDMP map: returns existing entry ID without comparing field contents - Caller responsible for iid uniqueness and consistency across retries Both modes can only be specified when entry ID is `` (auto-generated). ### Deduplication Logic When XADD is called with idempotency parameters: 1. Redis checks if the message was recently added to the stream based on the (pid, iid) pair 2. If the (pid, iid) pair matches a recently-seen pair for that producer, the message is assumed to be identical 3. No duplicate message is added to the stream; the existing entry ID is returned 4. With IDMP pid iid: Redis does not compare the specified fields and their values—two messages with the same (pid, iid) are assumed identical 5. With IDMPAUTO pid: Redis calculates the iid from message content and checks for duplicates ## IDMP Map: Per-Producer Time and Capacity-Based Expiration Each producer with idempotency enabled maintains its own isolated IDMP map (iid → entry_id) with dual expiration criteria: Time-based expiration (duration): - Each iid expires automatically after duration seconds from insertion - Provides operational guarantee: Redis will not forget an iid before duration elapses (unless capacity reached) - Configurable per-stream via XCFGSET Capacity-based expiration (maxsize): - Each producer's map enforces maximum capacity of maxsize entries - When capacity reached, oldest iids for that producer are evicted regardless of remaining duration - Prevents unbounded memory growth during extended usage ### Configuration Commands XINFO STREAM: View current configuration and metrics Use `XINFO STREAM key` to retrieve idempotency configuration (idmp-duration, idmp-maxsize) along with tracking metrics. XCFGSET: Configure expiration parameters ``` XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize] ``` - duration: Seconds to retain each iid (range: 1- 86400 seconds) - maxsize: Maximum iids to track per producer (range: 1-10,000 entries) - Calling XCFGSET clears all existing producer IDMP maps for the stream Default Configuration* (when XCFGSET not called): - Duration: 100 seconds - Maxsize: 100 iids per producer - Runtime configurable via: `stream-idmp-duration` and `stream-idmp-maxsize` ## Response Behavior On first submission (pid, iid) pair not in producer's map: - Entry added to stream with generated entry ID - (pid, iid) pair stored in producer's IDMP map with current timestamp - Returns new entry ID On duplicate submission (pid, iid) pair exists in producer's map: - No entry added to stream - Returns existing entry ID from producer's IDMP map - Identical response to original submission (client cannot distinguish) ## Stream Metadata XINFO STREAM extended with idempotency metrics and configuration: - idmp-duration: The duration value (in seconds) configured for the stream's IDMP map - idmp-maxsize: The maxsize value configured for the stream's IDMP map - pids-tracked: Current number of producers with active IDMP maps - iids-tracked: Current total number of iids across all producers' IDMP maps (reflects active iids that haven't expired or been evicted) - iids-added: Lifetime cumulative count of entries added with idempotency parameters - iids-duplicates: Lifetime cumulative count of duplicate iids detected across all producers ## Persistence and Restart Behavior IDMP maps are fully persisted and restored across Redis restarts: - RDB/AOF: All pid-iid pairs, timestamps, and configuration are included in snapshots and AOF logs - Recovery: On restart, all tracked (pid, iid) pairs remain valid and operational - Producer Requirement: Producers must reuse the same pid after restart to access their persisted IDMP map - Configuration: Stream-level settings (duration, maxsize) persist across restarts - Important: Calling XCFGSET after restart clears restored IDMP maps (same behavior as during runtime) ## Key Benefits - Enables At-most-once Producer Semantics: Makes it possible to safely retry message submissions without creating duplicates - Automatic Retry Safety: Network failures and retries cannot create duplicate entries - Producer Isolation: Each producer maintains independent idempotency tracking - Memory Efficient: Time and capacity-based expiration per producer prevents unbounded growth - Flexible Implementation: Choose automatic (IDMPAUTO) or manual (IDMP) based on performance needs - Backward Compatible: Fully optional parameters with zero impact on existing XADD behavior - Collision Resistant: XXH128 with Sum + XOR combination and field-length separators provides high-quality non-cryptographic hashing for IDMPAUTO with extremely low collision probability and prevents ambiguous concatenation attacks	2026-01-15 21:58:44 +08:00
Stav-Levi	73249497d4	Fix ACL key-pattern bypass in MSETEX command (#14659 ) MSETEX doesn't properly check ACL key permissions for all keys - only the first key is validated. MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX seconds Keys are at every 2nd position (step=2). When Redis extracts keys for ACL checking, it calculates where the last key is: last = first + numkeys - 1; => calculation ignores step last = first + (numkeys-1) * step; With 2 keys starting at position 2: Bug: last = 2 + 2 - 1 = 3 → only checks position 2 Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4 Fixes #14657	2026-01-08 08:41:55 +02:00
debing.sun	9ca860be9e	Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This bug was introduced by #14130 and found by guybe7 When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete strategies, if a node was eligible for removal but couldn't be removed directly (because consumer group references need to be checked), the code would incorrectly break out of the loop instead of continuing to process entries within the node. This fix allows the per-entry deletion logic to execute for eligible nodes when using non-KEEPREF strategies.	2026-01-05 21:17:36 +08:00
Stav-Levi	23aca15c8c	Fix the flexibility of argument positions in the Redis API's (#14416 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This PR implements flexible keyword-based argument parsing for all 12 hash field expiration commands, allowing users to specify arguments in any logical order rather than being constrained by rigid positional requirements. This enhancement follows Redis's modern design of keyword-based flexible argument ordering and significantly improves user experience. Commands with Flexible Parsing HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX some examples: HEXPIRE: * All these are equivalent and valid: HEXPIRE key EX 60 NX FIELDS 2 f1 f2 HEXPIRE key NX EX 60 FIELDS 2 f1 f2 HEXPIRE key FIELDS 2 f1 f2 EX 60 NX HEXPIRE key FIELDS 2 f1 f2 NX EX 60 HEXPIRE key NX FIELDS 2 f1 f2 EX 60 HGETEX: * All these are equivalent and valid: HGETEX key EX 60 FIELDS 2 f1 f2 HGETEX key FIELDS 2 f1 f2 EX 60 HSETEX: * All these are equivalent and valid: HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2 HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2 HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60 HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60	2025-12-14 09:35:12 +02:00
debing.sun	bb6389e823	Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Problem When destroying a consumer group with `XGROUP DESTROY`, the cached `min_cgroup_last_id` was not being invalidated. This caused incorrect behavior when using `XDELEX` with the `ACKED` option, as the cache still referenced the destroyed group's `last_id`. ## Solution Invalidate the `min_cgroup_last_id` cache when the destroyed group's `last_id` equals the cached minimum. The cache will be recalculated on the next call to `streamEntryIsReferenced()`. --------- Co-authored-by: guybe7 <guy.benoish@redislabs.com>	2025-11-21 22:37:17 +08:00
Oran Agra	0a6eacff1f	Add variable key-spec flags to SET IF* and DELEX (#14529 ) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details These commands behave as DEL and SET (blindly Remove or Overwrite) when they don't get IF* flags, and require the value of the key when they do run with these flags. Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can provide the right flags depending on the arguments used. (the plain flags when arguments are unknown are the common denominator ones) Move lookupKey call in DELEX to avoid double lookup, which also means (some, namely arity) syntax errors are checked (and reported) before checking the existence of the key.	2025-11-12 11:36:10 +02:00
Sergei Georgiev	90ba7ba4dc	Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524 ) ### Problem The XREADGROUP command with CLAIM parameter incorrectly returns delivery metadata (idle time and delivery count) as strings instead of integers, contradicting the Redis specification. ### Solution Updated the XREADGROUP CLAIM implementation to return delivery metadata fields as integers, aligning with the documented specification and maintaining consistency with Redis response conventions. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-11-11 19:05:22 +08:00
Moti Cohen	d25e582a17	Fix flaky test of hfe persist rdb reload (#14525 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details So far occured once on daily in the test-sanitizer-address job	2025-11-10 17:15:37 +02:00
Moti Cohen	189b7609f5	Add hfe rdb load test (#14511 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Verify that following RDB load fields keep their expiration time. Verify that hashes that had HFEs not counted following rdb load in subexpiry (by command `info keyspace`)	2025-11-09 09:49:54 +02:00
debing.sun	7f1bafc922	Fix XACKDEL stack overflow when IDs exceed STREAMID_STATIC_VECTOR_LEN (CVE-2025-62507) Some checks failed CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Has been cancelled Details This issue was introduced by redis/redis#14130. The problem is that when the number of IDs exceeds STREAMID_STATIC_VECTOR_LEN (8), the code forgot to reallocate memory for the IDs array, which causes a stack overflow.	2025-11-05 15:33:34 +02:00
sggeorgiev	3e2003ee0f	Fix HGETEX out-of-bounds read when FIELDS option missing numfields argument When the HGETEX command is used with the FIELDS option but without the required numfields argument, the server would attempt to access an out-of-bounds argv index. This PR adds a check to ensure numfields is present before accessing it, returning an error if it is missing. Also includes a test case to cover this scenario.	2025-11-05 15:33:34 +02:00
debing.sun	e436a0e548	Enforce 16-char hex digest length and case-insensitive comparison for IFDEQ/IFDNE (#14502 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Fix https://github.com/redis/redis/issues/14496 This PR makes the following changes: - DIGEST: Always return 16 hex characters with leading zeros Example: "00006c38adf31777" instead of "6c38adf31777" - IFDEQ/IFDNE: Validate the digest must be exactly 16 characters - IFDEQ/IFDNE: Use strcasecmp for case-insensitive hex comparison Both uppercase and lowercase hex digits now work identically --------- Co-authored-by: Marc Gravell <marc.gravell@gmail.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2025-11-03 16:59:50 +08:00
debing.sun	379fec1426	Use fixed position keys parameter for MSETEX command (#14470 ) In PR https://github.com/redis/redis/pull/14434, we made the keys parameter flexible, meaning it could appear anywhere among the command arguments. However, this also made key parsing more complex, since we could no longer determine the fixed position of key arguments. Therefore, in this PR, we reverted it back to using fixed positions for the keys. And also fix this [comment](https://github.com/redis/redis/pull/14434#discussion_r2459282563). --------- Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2025-10-27 17:20:29 +08:00
Stav-Levi	52ea47b792	Add MSETEX command (#14434 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details Introduce a new command MSETEX to set multiple string keys with a shared expiration in a single atomic operation. Also with flexible argument parsing. Syntax: MSETEX KEYS numkeys key value [key value …] [XX \| NX] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] Sets the given keys to their respective values. This command is an extension of the MSETNX that adds expiration and XX options. Options: EX seconds - Set the specified expiration time, in seconds PX milliseconds - Set the specified expiration time, in milliseconds EXAT timestamp-seconds - Set the specified Unix time at which the keys will expire, in seconds PXAT timestamp-milliseconds - Set the specified Unix time at which the keys will expire, in milliseconds KEEPTTL - Retain the time to live associated with the keys XX - Only set the keys and their expiration if all already exist NX - Only set the keys and their expiration if none exist Flexible Argument Parsing examples: - MSETEX EX 10 KEYS 2 k1 v1 k2 v2 - MSETEX KEYS 2 k1 v1 k2 v2 NX PX 5000 - MSETEX NX EX 10 KEYS 2 k1 v1 k2 v2 Return Values: Integer reply: 1 - All keys were set successfully Integer reply: 0 - No keys were set (due to NX/XX conditions) Error reply - Syntax error or invalid arguments	2025-10-23 19:12:02 +03:00
sggeorgiev	090ca801ea	Add CLAIM parameter to XREADGROUP for automatic pending entry claiming (#14402 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Reply-schemas linter / reply-schemas-linter (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details ## Overview This PR enhances Redis Streams consumer groups by adding an optional CLAIM parameter to the `XREADGROUP` command, enabling automatic claiming of idle pending entries alongside normal message consumption in a single operation. ## Problem Statement Current Redis Streams consumer group implementations require developers to manually orchestrate multiple commands to handle both pending and new entries: - `XPENDING` to discover idle pending entries - `XCLAIM/XAUTOCLAIM` to claim idle entries - `XREADGROUP` to consume new entries This multi-command approach creates: - Performance overhead from multiple round trips to Redis - Implementation complexity, particularly when working with multiple streams - Code duplication across consumer implementations ## Solution Extends XREADGROUP with a new optional CLAIM parameter: `XREADGROUP GROUP group consumer [COUNT count] [BLOCK milliseconds] [NOACK] [CLAIM min-idle-time] STREAMS key [key ...] id [id ...]` When CLAIM min-idle-time is specified, the command operates in two phases: 1. Claim Phase: Automatically claims pending entries idle for ≥ min-idle-time milliseconds 2. Read Phase: Processes new entries if the COUNT limit hasn't been reached ## Response Format Changes When the CLAIM option is used, the response format is extended to include delivery metadata for each entry: Standard XREADGROUP response (without CLAIM): ``` 127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 STREAMS mystream > 1) 1) "mystream" 2) 1) 1) "1609459200000-0" 2) 1) "field1" 2) "value1" ``` XREADGROUP response with CLAIM: ``` 127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 CLAIM 30000 STREAMS mystream > 1) 1) "mystream" 2) 1) 1) "1609459200000-0" 2) 1) "field1" 2) "value1" 3) 15000 4) 3 ``` Response structure with CLAIM: - Field 1: Stream entry ID (unchanged) - Field 2: Field-value pairs (unchanged) - Field 3: Idle time in milliseconds - the number of milliseconds elapsed since this entry was last delivered to a consumer - Field 4: Delivery count - the number of times this entry has been delivered: - `0` for new messages that haven't been delivered before - `1+` for claimed messages (previously unacknowledged entries) Purpose of the new fields: These fields enable intelligent client-side processing decisions: - Idle time enables time-based escalation strategies, detection of stuck messages, and priority processing for critically delayed work - Delivery count enables retry limits, dead-letter queue logic, poison message detection, and alternative processing strategies based on failure history Together, these fields provide the visibility needed to build robust, self-healing consumer systems without requiring additional XPENDING queries. Note: If the ID parameter is not `>`, the command returns entries that are pending for the consumer, and the CLAIM option is ignored. In this case, the response follows the standard format without the additional delivery metadata fields. ## Key Benefits - Reduced Complexity: Eliminates manual PEL management and multi-command orchestration - Improved Performance: Reduces round trips by 50-70% for workloads processing both pending and new entries - Backward Compatibility: Fully optional parameter with zero breaking changes to existing behavior - Multi-Stream Support: Works seamlessly across multiple streams in a single command - Flexible Consumer Patterns: Enables mixed consumer types within the same group: - Consumers without CLAIM that only handle new messages - Consumers with CLAIM that process both pending and new entries ## Impact on Existing Commands The XCLAIM and XAUTOCLAIM commands may potentially benefit from the new pel_by_time index for improved performance, such optimizations require further investigation and testing. Enhancements to XCLAIM and XAUTOCLAIM are postponed for future work. ## Performance Benchmarks ### Latency Performance Comprehensive performance testing demonstrates significant improvements over the traditional XAUTOCLAIM approach: Test Methodology Two identical test scenarios were executed to compare XAUTOCLAIM against XREADGROUP with CLAIM: Test Setup: 1. Insert 20,000 messages into a stream 2. Read all messages with XREADGROUP to populate the pending entries list (PEL) 3. Set IDLE time to 1100ms on 1,000 randomly selected pending messages using XCLAIM 4. Set IDLE time to 50ms on all remaining 19,000 pending messages using XCLAIM 5. Execute the target command with min-idle-time=1000ms and COUNT=1000 to claim the eligible messages 6. Repeat steps 3-5 for 1,000 iterations Test 1 - XAUTOCLAIM (Traditional Approach): ``` XAUTOCLAIM Performance: Average: 54.671ms Median: 53.582ms Min: 3.738ms Max: 71.596ms P95: 62.536ms P99: 68.800ms ``` Test 2 - XREADGROUP with CLAIM (New Approach): ``` XREADGROUP CLAIM Performance: Average: 2.426ms Median: 2.571ms Min: 1.287ms Max: 4.653ms P95: 3.370ms P99: 4.212ms ``` Performance Analysis The new XREADGROUP CLAIM implementation delivers 22.5x faster average performance compared to XAUTOCLAIM: - Average latency reduction: 95.6% (54.671ms → 2.426ms) - Median latency reduction: 95.2% (53.582ms → 2.571ms) - P95 latency reduction: 94.6% (62.536ms → 3.370ms) - P99 latency reduction: 93.9% (68.800ms → 4.212ms) This performance improvement is achieved through the time-ordered PEL index (pel_by_time), which enables O(log n + k) retrieval of idle entries versus XAUTOCLAIM's less efficient scanning approach. ### Memory Performance To evaluate the memory overhead of the pel_by_time index, comprehensive memory testing was conducted comparing Redis with and without the index under realistic workload conditions. Test Methodology: - Insert 200,000 new messages into a stream - Read messages in blocks of 100 using XREADGROUP (populating the PEL with 200,000 pending entries) - Wait 5ms after each read block (simulating realistic processing delays that affect rax tree compression) - Measure memory usage before and after the reading phase Test Results - Without pel_by_time Index: ``` Initial memory (used): 926.10 KB After insertion (used): 6.80 MB After reading (used): 41.53 MB Memory increase from data: 5.90 MB Memory increase from reading: 34.72 MB Total memory increase: 40.62 MB ``` Test Results - With pel_by_time Index: ``` Initial memory (used): 927.44 KB After insertion (used): 6.81 MB After reading (used): 45.07 MB Memory increase from data: 5.90 MB Memory increase from reading: 38.27 MB Total memory increase: 44.17 MB ``` Memory Performance Analysis: The pel_by_time index introduces a measurable but reasonable memory overhead: Used Memory Impact: - Memory increase from pel_by_time index: 3.55 MB (38.27 MB - 34.72 MB) - Per-entry overhead: 18.6 bytes (3.55 MB / 200,000 entries) - Percentage overhead: 8.7% increase in total memory usage Per-Entry Memory Breakdown: The theoretical minimum for the pel_by_time index is 32 bytes per entry (composite key only, no node values). The observed 18.6 bytes per entry overhead is lower than the theoretical maximum, suggesting effective rax tree compression is occurring despite the 5ms delays between reads. ## Technical Implementation ### New Data Structure: Time-Ordered PEL Index (`pel_by_time`) To efficiently identify and claim idle pending entries, this PR introduces a new rax tree structure to the consumer group implementation: Structure Design: - Tree Type: Rax tree named pel_by_time added to each consumer group - Key Composition: 32-byte composite key consisting of: - `delivery_time` (timestamp when entry was last delivered) - `streamId` (stream entry ID) Key Format: `delivery_time` + `streamId` (concatenated) Node Value: None - all necessary information is encoded in the key itself for memory efficiency Key Properties: _Uniqueness Guarantee:_ While multiple pending entries may share the same `delivery_time`, the `streamId` component ensures each key is globally unique within the tree. _Lexicographical Ordering:_ The rax tree naturally orders nodes lexicographically by key. Since `delivery_time` forms the prefix of each key, entries are automatically sorted by delivery time, with oldest entries appearing first in the tree. _Efficient Range Operations:_ This time-based ordering enables highly efficient range searches. To find all entries idle for at least `min-idle-time` milliseconds, we simply perform a range query from the tree's beginning up to `current_time - min-idle-time`. Fast Retrieval: Once idle entries are identified via the `pel_by_time` index, the embedded `streamId` in each key is used to quickly retrieve the full pending message data structure for the subsequent `XREADGROUP` claim operation. Performance Characteristics: - Insertion: O(log n) when adding entries to PEL - Range Search: O(log n + k) where k is the number of idle entries found - Memory Overhead: 32 bytes per pending entry for the index key (no additional node values stored) This dual-index approach (existing PEL structures plus the new time-ordered index) allows XREADGROUP with CLAIM to efficiently identify claimable entries without scanning the entire PEL, making the operation suitable for consumer groups with large pending entry lists. ### COUNT Behavior with CLAIM When the `COUNT` option is used in conjunction with `CLAIM`, the command follows a two-phase execution strategy to maximize the specified count limit: Phase 1: Claim Idle Pending Entries - Retrieve claimable pending entries (idle for ≥ min-idle-time) up to the COUNT limit - These entries are claimed and returned to the consumer Phase 2: Fetch New Messages (if needed) - If the `COUNT` limit has not been satisfied by claimed pending entries, the command proceeds to read new messages from the stream - New messages are fetched up to the remaining available count: `remaining_count = COUNT - claimed_entries` This prioritization ensures that idle pending entries are always processed first, preventing indefinite message stalling while still allowing consumers to process new messages efficiently when pending entries are scarce. ### BLOCK Behavior with CLAIM When the CLAIM option is used in conjunction with the BLOCK option, the command exhibits sophisticated blocking behavior that responds to both new messages and pending entries becoming claimable: Blocking State Management: If there are no immediately claimable pending entries and no new messages available in the stream, the `XREADGROUP` command enters a blocking state for the specified duration. However, the implementation must handle a critical scenario: pending entries that become idle (and thus claimable) while the command is blocked must trigger an early wakeup to serve those entries. Implementation: `stream_claim_pending_keys` Dictionary To enable this reactive blocking behavior, a new `stream_claim_pending_keys` dictionary is introduced to the `redisDb` structure: - Key: Stream key being watched - Value: The minimum timestamp when the next pending entry in this stream will become claimable (i.e., will satisfy the min-idle-time requirement) Multi-Client Coordination: When multiple XREADGROUP commands with BLOCK and CLAIM are executed concurrently on the same stream, the dictionary value stores the shortest claimable time across all waiting clients. This ensures the earliest possible wakeup when any pending entry becomes available for claiming. Wakeup Mechanism: `handleClaimableStreamEntries` The `handleClaimableStreamEntries` function is invoked regularly from `blockedBeforeSleep` to monitor and react to claimable entries: 1. Scan Phase: Iterates through all entries in the `stream_claim_pending_keys` dictionary 2. Time Check: Compares each entry's claimable timestamp against the current time 3. Signal Phase: When `claimable_time ≤ current_time`, calls `signalKeyAsReady` to wake up all clients blocked on that stream 4. Client Processing: Awakened clients attempt to claim and process the newly available pending entries Resource Contention Handling: When the number of claimable entries is insufficient to satisfy all awakened clients: - Clients that successfully claim entries complete their operations - Remaining clients recalculate the next minimum claimable time based on remaining pending entries - These clients update the `stream_claim_pending_keys` dictionary with the new timestamp - They re-enter the blocking state to wait for the next batch of claimable entries This design ensures fair resource distribution and prevents busy-waiting while maintaining responsiveness to both new messages and aging pending entries.	2025-10-21 20:35:43 +08:00
Mincho Paskalev	aed879ad0a	Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435 ) # Description Add optimistic locking for string objects via compare-and-set and compare-and-delete mechanism. ## What's changed Introduction of new DIGEST command for string objects calculated via XXH3 hash. Extend SET command with new parameters supporting optimistic locking. The new value is set only if checks against a given (old) value or a given string digest pass. Introduction of new DELEX command to support conditionally deleting a key. Conditions are also checks against string value or string digest. ## Motivation For developers who need to to implement a compare-and-set and compare-and-delete single-key optimistic concurrency control this PR provides single-command based implementation. Compare-and-set and compare-and-delete are mostly used for [Optimistic concurrency control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control): a client (1) fetches the value, keeps the old value (or its digest, for a large string) in memory, (2) manipulates a local copy of the value, (3) applies the local changes to the server, but only if the server’s value hasn’t been changed (still equal to the old value). Note that compare-and-set [can also be implemented](https://redis.io/docs/latest/develop/using-commands/transactions/#optimistic-locking-using-check-and-set) with WATCH … MULTI … EXEC and Lua scripts. The new SET optional arguments and the DELEX command do not enable new functionality, however, they are much simpler and faster to use for the very common use case of single-key optimistic concurrency control. ## Related issues and PRs https://github.com/redis/redis/issues/12485 https://github.com/redis/redis/pull/8361 https://github.com/redis/redis/pull/4258 ## Description of the new commands ### DIGEST ``` DIGEST key ``` Get the hash digest of the value stored in key, as an hex string. Reply: - Null if key does not exist - error if key exists but holds a value which is not a string - (bulk string) the XXH3 digest of the value stored in key, as an hex string ### SET ``` SET key value [NX \| XX \| IFEQ match-value \| IFNE match-value \| IFDEQ match-digest \| IFDNE match-digest] [GET] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] ``` `IFEQ match-value` - Set the key’s value and expiration only if its current value is equal to match-value. If key doesn’t exist - it won’t be created. `IFNE match-value` - Set the key’s value and expiration only if its current value is not equal to match-value. If key doesn’t exist - it will be created. `IFDEQ match-digest` - Set the key’s value and expiration only if the digest of its current value is equal to match-digest. If key doesn’t exist - it won’t be created. `IFDNE match-digest` - Set the key’s value and expiration only if the digest of its current value is not equal to match-digest. If key doesn’t exist - it will be created. Reply update: - If GET was not specified: - Nil reply if either - the key doesn’t exist and XX/IFEQ/IFDEQ was specified. The key was not created. - the key exists, and NX was specified or a specified IFEQ/IFNE/IFDEQ/IFDNE condition is false. The key was not set. - Simple string reply: OK: The key was set. - If GET was specified, any of the following: - Nil reply: The key didn't exist before this command (whether the key was created or not). - Bulk string reply: The previous value of the key (whether the key was set or not). ### DELEX ``` DELEX key [IFEQ match-value \| IFNE match-value \| IFDEQ match-digest \| IFDNE match-digest] ``` Conditionally removes the specified key. A key is ignored if it does not exist. `IFEQ match-value` - Delete the key only if its value is equal to match-value `IFNE match-value` - Delete the key only if its value is not equal to match-value `IFDEQ match-digest` - Delete the key only if the digest of its value is equal to match-digest `IFDNE match-digest` - Delete the key only if the digest of its value is not equal to match-digest Reply: - error if key exists but holds a value that is not a string and IFEQ/IFNE/IFDEQ/IFDNE is specified. - (integer) 0 if not deleted (the key does not exist or a specified IFEQ/IFNE/IFDEQ/IFDNE condition is false), or 1 if deleted. ### Notes Added copy of xxhash repo to deps - [version](`c961fbe61a`) --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Yuan Wang <wangyuancode@163.com>	2025-10-21 10:32:49 +03:00
Moti Cohen	5b49119236	Fix crash in lookupKey() when `executing_client` is NULL (#14415 ) Some checks are pending CI / test-ubuntu-latest (push) Waiting to run Details CI / test-sanitizer-address (push) Waiting to run Details CI / build-debian-old (push) Waiting to run Details CI / build-macos-latest (push) Waiting to run Details CI / build-32bit (push) Waiting to run Details CI / build-libc-malloc (push) Waiting to run Details CI / build-centos-jemalloc (push) Waiting to run Details CI / build-old-chain-jemalloc (push) Waiting to run Details Codecov / code-coverage (push) Waiting to run Details External Server Tests / test-external-standalone (push) Waiting to run Details External Server Tests / test-external-cluster (push) Waiting to run Details External Server Tests / test-external-nodebug (push) Waiting to run Details Spellcheck / Spellcheck (push) Waiting to run Details This PR is based on: https://github.com/valkey-io/valkey/pull/2347 This was introduced in https://github.com/redis/redis/pull/13512 The server crashes with a null pointer dereference when lookupKey() is called from handleClientsBlockedOnKey(). The crash occurs because server.executing_client is NULL, but the code attempts to access server.executing_client->cmd->proc without checking. Crash scenario: Client 1 enables CLIENT NO-TOUCH Client 2 blocks on BRPOP mylist 0 Client 1 executes RPUSH mylist elem When unblocking Client 2, lookupKey() dereferences NULL server.executing_client → crash Solution Added proper null checks before dereferencing server.executing_client: Check if LOOKUP_NOTOUCH flag is already set before attempting to modify it Verify both server.current_client and server.executing_client are not NULL before accessing their members Maintain the TOUCH command exception for scripts Testing Added regression test in tests/unit/type/list.tcl that reproduces and verifies the fix for this crash scenario. This fix is based on valkey-io/valkey#2347 Co-authored-by: Uri Yagelnik <uriy@amazon.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>	2025-10-13 12:12:38 +03:00
张宇杭	083f38ef5a	Fix issues with server.allow_access_expired (#14262 ) Some checks failed CI / test-ubuntu-latest (push) Has been cancelled Details CI / test-sanitizer-address (push) Has been cancelled Details CI / build-debian-old (push) Has been cancelled Details CI / build-macos-latest (push) Has been cancelled Details CI / build-32bit (push) Has been cancelled Details CI / build-libc-malloc (push) Has been cancelled Details CI / build-centos-jemalloc (push) Has been cancelled Details CI / build-old-chain-jemalloc (push) Has been cancelled Details Codecov / code-coverage (push) Has been cancelled Details External Server Tests / test-external-standalone (push) Has been cancelled Details External Server Tests / test-external-cluster (push) Has been cancelled Details External Server Tests / test-external-nodebug (push) Has been cancelled Details Spellcheck / Spellcheck (push) Has been cancelled Details Close https://github.com/redis/redis/issues/14214 1. When the server.allow_access_expired flag is set to 1, it allows access to expired keys that have not yet been evicted. All places involving access to expired keys should consider the impact of this parameter. 2. The modifications involve five methods: hfieldIsExpired, hashTypeNext, hashTypeLength, keyIsExpired, and hashTypeIsExpired. When the server.allow_access_expired flag is set to 1, these methods will not skip expired keys, otherwise they follow the normal logic execution. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-10-12 11:02:03 +08:00
Moti Cohen	9b63e99d05	Refactor HFE: Introduce Per-Slot Expiration Store (estore) (#14294 ) Hash field expiration is managed with two levels of data structures. 1. At the DB level, an ebuckets structure maintains the set of all hashes that contain fields with expiration. 2. At the per-hash level, an ebuckets structure tracks fields with expiration. This pull request refactors the 1st level to operate per slot instead, and introduces a new API called estore (expiration store). Its design aligns closely with the existing kvstore API, ensuring consistency and simplifying usage. The terminology at that level has been updated from “HFE” or “hexpire” to “subexpiry”, reflecting a broader scope that can later support other data types.	2025-09-11 16:45:17 +03:00
debing.sun	60adba48aa	Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326 ) This PR is based on https://github.com/valkey-io/valkey/pull/1303 This PR introduces a DEBUG_DEFRAG compilation option that enables activedefrag functionality even when the allocator is not jemalloc, and always forces defragmentation regardless of the amount or ratio of fragmentation. ## Using ``` make SANITIZER=address DEBUG_DEFRAG=<force\|fully> ./runtest --debug-defrag ``` * DEBUG_DEFRAG=force * Ignore the threshold for defragmentation to ensure that defragmentation is always triggered. * Always reallocate pointers to probe for correctness issues in pointer reallocation. * DEBUG_DEFRAG=fully * Includes everything in the option `force`. * Additionally performs a full defrag on every defrag cycle, which is significantly slower but more accurate. --------- Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: oranagra <oran@redislabs.com>	2025-09-10 12:52:20 +08:00
Giuseppe Coco	5f8e7852f4	Fix: Validate ENTRIESREAD in XGROUP command (#14259 ) Fixes #14257 The XGROUP CREATE and SETID subcommands allowed setting an ENTRIESREAD value greater than the stream's total `entries_added` counter. This could lead to a logically inconsistent state. This commit adds a check to ensure the provided ENTRIESREAD value is not greater than the number of entries ever added to the stream. If ENTRIESREAD is too large, it gets set to the total number of entries in the stream, i.e. `s->entries_added`.	2025-09-01 08:36:38 +08:00
Moti Cohen	e6c261f3fb	Fix MEMORY USAGE command (#14288 ) After the key-value unification (kvobj), the MEMORY USAGE command may no longer account for the embedded key length stored within the kvobj. To fix this, replace sizeof(o) with zmalloc_size((void )o) to ensure the full allocated size is measured. In this context, the function objectComputeSize() was renamed and modified to kvobjComputeSize(). From computing only the value size to compute the key and its value.	2025-08-20 13:54:45 +03:00
debing.sun	b9d9d4000b	Prevent crash when cgroups_ref is null in streamEntryIsReferenced() after reload (#14276 ) This bug was introduced by https://github.com/redis/redis/pull/14130 found by @oranagra ### Summary Because `s->cgroup_ref` is created at runtime the first time a consumer group is linked with a message, but it is not released when all references are removed. However, after `debug reload` or restart, if the PEL is empty (meaning no consumer group is referencing any message), `s->cgroup_ref` will not be recreated. As a result, when executing XADD or XTRIM with `ACKED` option and checking whether a message that is being read but has not been ACKed can be deleted, the cgroup_ref being NULL will cause a crash. ### Code Path ``` xaddCommand -> streamTrim -> streamEntryIsReferenced ``` ### Solution Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced().	2025-08-15 15:15:16 +08:00
debing.sun	bec644aab1	Fix missing kvobj reassignment after reallocation in MOVE command (#14233 ) Introduced by https://github.com/redis/redis/issues/13806 Fixed a crash in the MOVE command when moving hash objects that have both key expiration and field expiration. The issue occurred in the following scenario: 1. A hash has both key expiration and field expiration. 2. During MOVE command, `setExpireByLink()` is called to set the expiration time for the target hash, which may reallocate the kvobj of hash. 3. Since the hash has field expiration, `hashTypeAddToExpires()` is called to update the minimum field expiration time Issue: However, the kvobj pointer wasn't updated with the return value from `setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed memory.	2025-07-30 22:24:56 +08:00
Yuan Wang	db4fc2a833	Fix HINCRBYFLOAT removes field expiration on replica (#14224 ) Fixes #14218 Before, we replicate HINCRBYFLOAT as an HSET command with the final value in order to make sure that differences in float precision or formatting will not create differences in replicas or after an AOF restart. However, on the replica side, if the field has an expiration time, HSET will remove it, even though the master retains it. This leads to inconsistencies between the master and the replica. To address this, we now use the HSETEX command with the KEEPTTL flag instead of HSET, ensuring that the field’s TTL is preserved. This bug was introduced in version 7.4, but the HSETEX command was only implemented from version 8.0. Therefore, this patch does not fix the issue in the 7.4 branch, a separate commit is needed to address it in 7.4.	2025-07-28 21:09:46 +08:00
debing.sun	45c8fcc992	Only mark the client reprocessing flag when unblocked on keys (#14165 ) This PR is based on https://github.com/valkey-io/valkey/pull/2109 When we refactored the blocking framework we introduced the client reprocessing infrastructure. In cases the client was blocked on keys, it will attempt to reprocess the command. One challenge was to keep track of the command timeout, since we are reprocessing and do not want to re-register the client with a fresh timeout each time. The solution was to consider the client reprocessing flag when the client is blockedOnKeys: ```c if (!(c->flags & CLIENT_REPROCESSING_COMMAND)) { /* If the client is re-processing the command, we do not set the timeout * because we need to retain the client's original timeout. / c->bstate.timeout = timeout; } ``` However, this introduced a new issue. There are cases where the client will consecutive blocking of different types for example: ``` CLIENT PAUSE 10000 ALL BZPOPMAX zset 1 ``` would have the client blocked on the zset endlessly if nothing will be written to it. Credits to @uriyage for locating this with his fuzzer testing* The suggested solution is to only flag the client when it is specifically unblocked on keys. Signed-off-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com>	2025-07-21 20:05:47 +08:00
debing.sun	fa040a72c0	Add XDELEX and XACKDEL commands for stream (#14130 ) ## Summary and detailed design for new stream command ## XDELEX ### Syntax ``` XDELEX key [KEEPREF \| DELREF \| ACKED] IDS numids id [id ...] ``` ### Description The `XDELEX` command extends the Redis Streams `XDEL` command, offering enhanced control over message entry deletion with respect to consumer groups. It accepts optional `DELREF` or `ACKED` parameters to modify its behavior: - KEEPREF: Deletes the specified entries from the stream, but preserves existing references to these entries in all consumer groups' PEL. This behavior is similar to XDEL. - DELREF: Deletes the specified entries from the stream and also removes all references to these entries from all consumer groups' pending entry lists, effectively cleaning up all traces of the messages. - ACKED: Only trims entries that were read and acknowledged by all consumer groups. Note: The `IDS` block can appear at any position in the command, consistent with other commands. ### Reply Array reply, for each `id`: - `-1`: No such `id` exists in the provided stream `key`. - `1`: Entry was deleted from the stream. - `2`: Entry was not deleted, but there are still dangling references. (ACKED option) ## XACKDEL ### Syntax ``` XACKDEL key group [KEEPREF \| DELREF \| ACKED] IDS numids id [id ...] ``` ### Description The `XACKDEL` command combines `XACK` and `XDEL` functionalities in Redis Streams. It acknowledges specified message IDs in the given consumer group and attempts to delete corresponding stream entries. It accepts optional `DELREF` or `ACKED` parameters: - KEEPREF: Acknowledges the messages in the specified consumer group and deletes the entries from the stream, but preserves existing references to these entries in all consumer groups' PEL. - DELREF: Acknowledges the messages in the specified consumer group, deletes the entries from the stream, and also removes all references to these entries from all consumer groups' pending entry lists, effectively cleaning up all traces of the messages. - ACKED: Acknowledges the messages in the specified consumer group and only trims entries that were read and acknowledged by all consumer groups. ### Reply Array reply, for each `id`: - `-1`: No such `id` exists in the provided stream `key`. - `1`: Entry was acknowledged and deleted from the stream. - `2`: Entry was acknowledged but not deleted, but there are still dangling references. (ACKED option) # Redis Streams Commands Extension ## XTRIM ### Syntax ``` XTRIM key <MAXLEN \| MINID> [= \| ~] threshold [LIMIT count] [KEEPREF \| DELREF \| ACKED] ``` ### Description The `XTRIM` command trims a stream by removing entries based on specified criteria, extended to include optional `DELREF` or `ACKED` parameters for consumer group handling: - KEEPREF: Trims the stream according to the specified strategy (MAXLEN or MINID) regardless of whether entries are referenced by any consumer groups, but preserves existing references to these entries in all consumer groups' PEL. - DELREF: Trims the stream according to the specified strategy and also removes all references to the trimmed entries from all consumer groups' PEL. - ACKED: Only trims entries that were read and acknowledged by all consumer groups. ### Reply No change. ## XADD ### Syntax ``` XADD key [NOMKSTREAM] [<MAXLEN \| MINID> [= \| ~] threshold [LIMIT count]] [KEEPREF \| DELREF \| ACKED] <* \| id> field value [field value ...] ``` ### Description The `XADD` command appends a new entry to a stream and optionally trims it in the same operation, extended to include optional `DELREF` or `ACKED` parameters for trimming behavior: - KEEPREF: When trimming, removes entries from the stream according to the specified strategy (MAXLEN or MINID), regardless of whether they are referenced by any consumer groups, but preserves existing references to these entries in all consumer groups' PEL. - DELREF: When trimming, removes entries from the stream according to the specified strategy and also removes all references to these entries from all consumer groups' PEL. - ACKED: When trimming, only removes entries that were read and acknowledged by all consumer groups. Note that if the number of referenced entries is bigger than MAXLEN, we will still stop. ### Reply No change. ## Key implementation Since we currently have no simple way to track the association between an entry and consumer groups without iterating over all groups, we introduce two mechanisms to establish this link. This allows us to determine whether an entry has been seen by all consumer groups, and to identify which groups are referencing it. With this links, we can break the association when the entry is either acknowledged or deleted. 1) Added reference tracking between stream messages and consumer groups using `cgroups_ref` The cgroups_ref is implemented as a rax that maps stream message IDs to lists of consumer groups that reference those messages, and streamNACK stores the corresponding nodes of this list, so that the corresponding groups can be deleted during `ACK`. In this way, we can determine whether an entry has been seen but not ack. 2) Store a cache minimum last_id in the stream structure. The reason for doing this is that there is a situation where an entry has never been seen by the consume group. In this case, we think this entry has not been consumed either. If there is an "ACKED" option, we cannot directly delete this entry either. When a consumer group updates its last_id, we don’t immediately update the cached minimum last_id. Instead, we check whether the group’s previous last_id was equal to the current minimum, or whether the new last_id is smaller than the current minimum (when using `XGROUP SETID`). If either is true, we mark the cached minimum last_id as invalid, and defer the actual update until the next time it’s needed. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: moticless <moticless@github.com> Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com>	2025-07-01 21:00:42 +08:00
debing.sun	5ff81f68a3	Fix XPENDING reply schema for empty reply (#14129 ) When the PEL is empty, the reply of `XPENDING` without `start` option will be: ``` 1) (integer) 0 2) (nil) 3) (nil) 4) (nil) ``` It is not an empty array, so we need to create an individual reply schema for it.	2025-07-01 17:35:09 +08:00
yzc-yzc	117424f85c	Fix negative offset issue for ZRANGEBY[SCORE\|LEX] command (#14043 ) Fix #13952 This PR ensures that ZRANGE_SCORE/LEX command with a negative offset will return empty.	2025-06-20 13:51:52 +08:00
guybe7	6349a7c4f9	Add GETRANGE tests with negative indices (#13950 ) Inspired by https://github.com/redis/redis/pull/12272	2025-05-27 09:41:28 +08:00
Moti Cohen	e1789e4368	keyspace - Unify key and value & use dict no_value=1 (#13806 ) The idea of packing the key (`sds`), value (`robj`) and optionally TTL into a single struct in memory was mentioned a few times in the past by the community in various flavors. This approach improves memory efficiency, reduces pointer dereferences for faster lookups, and simplifies expiration management by keeping all relevant data in one place. This change goes along with setting keyspace's dict to no_value=1, and saving considerable amount of memory. Two more motivations that well aligned with this unification are: - Prepare the groundwork for replacing EXPIRE scan based implementation and evaluate instead new `ebuckets` data structure that was introduced as part of [Hash Field Expiration feature](https://redis.io/blog/hash-field-expiration-architecture-and-benchmarks/). Using this data structure requires embedding the ExpireMeta structure within each object. - Consider replacing dict with a more space efficient open addressing approach hash table that might rely on keeping a single pointer to object. Before this PR, I POC'ed on a variant of open addressing hash-table and was surprised to find that dict with no_value actually could provide a good balance between performance, memory efficiency, and simplicity. This realization prompted the separation of the unification step from the evaluation of a new hash table to avoid introducing too many changes at once and to evaluate its impact independently before considering replacement of existing hash-table. On an earlier [commit](https://github.com/redis/redis/pull/13683) I extended dict no_value optimization (which saves keeping dictEntry where possible) to be relevant also for objects with even addresses in memory. Combining it with this unification saves a considerable amount of memory for keyspace. # kvobj This PR adopts Valkey’s [packing](`3eb8314be6`) layout and logic for key, value, and TTL. However, unlike Valkey implementation, which retained a common `robj` throughout the project, this PR distinguishes between the general-purpose, overused `robj`, and the new `kvobj`, which embeds both the key and value and used by the keyspace. Conceptually, `robj` serves as a base class, while `kvobj` acts as a derived class. Two new flags introduced into redis object, `iskvobj` and `expirable`: ``` struct redisObject { unsigned type:4; unsigned encoding:4; unsigned lru:LRU_BITS; unsigned iskvobj : 1; /* new flag / unsigned expirable : 1; / new flag / unsigned refcount : 30; / modified: 32bits->30bits / void ptr; }; typedef struct redisObject robj; typedef struct redisObject kvobj; ``` When the `iskvobj` flag is set, the object includes also the key and it is appended to the end of the object. If the `expirable` flag is set, an additional 8 bytes are added to the object. If the object is of type string, and the string is rather short, then it will be embedded as well. As a result, all keys in the keyspace are promoted to be of type `kvobj`. This term attempts to align with the existing Redis object, robj, and the kvstore data structure. # EXPIRE Implementation As `kvobj` embeds expiration time as well, looking up expiration times is now an O(1) operation. And the hash-table of EXPIRE is set now to be `no_value` mode, directly referencing `kvobj` entries, and in turn, saves memory. Next, I plan to evaluate replacing the EXPIRE implementation with the [ebuckets](https://github.com/redis/redis/blob/unstable/src/ebuckets.h) data structure, which would eliminate keyspace scans for expired keys. This requires embedding `ExpireMeta` within each `kvobj` of each key with expiration. In such implementation, the `expirable` flag will be shifted to indicate whether `ExpireMeta` is attached. # Implementation notes ## Manipulating keyspace (find, modify, insert) Initially, unifying the key and value into a single object and storing it in dict with `no_value` optimization seemed like a quick win. However, it (quickly) became clear that this change required deeper modifications to how keys are manipulated. The challenge was handling cases where a dictEntry is opt-out due to no_value optimization. In such cases, many of the APIs that return the dictEntry from a lookup become insufficient, as it just might be the key itself. To address this issue, a new-old approach of returning a "link" to the looked-up key's `dictEntry` instead of the `dictEntry` itself. The term `link` was already somewhat available in dict API, and is well aligned with the new dictEntLink declaration: ``` typedef dictEntry *dictEntLink; ``` This PR introduces two new function APIs to dict to leverage returned link from the search: ``` dictEntLink dictFindLink(dict d, const void key, dictEntLink bucket); void dictSetKeyAtLink(dict d, void key, dictEntLink link, int newItem); ``` After calling `link = dictFindLink(...)`, any necessary updates must be performed immediately after by calling `dictSetKeyAtLink()` without any intervening operations on given dict. Otherwise, `dictEntLink` may become invalid. Example: ``` / replace existing key / link = dictFindLink(d, key, &bucket, 0); // ... Do something, but don't modify the dict ... // assert(link != NULL); dictSetKeyAtLink(d, kv, &link, 0); / Add new value (If no space for the new key, dict will be expanded and bucket will be looked up again.) / link = dictFindLink(d, key, &bucket); // ... Do something, but don't modify the dict ... // assert(link == NULL); dictSetKeyAtLink(d, kv, &bucket, 1); ``` ## dict.h - The dict API has became cluttered with many unused functions. I have removed these from dict.h. - Additionally, APIs specifically related to hash maps (no_value=0), primarily those handling key-value access, have been gathered and isolated. - Removed entirely internal functions ending with “ByHash()” that were originally added for optimization and not required any more. - Few other legacy dict functions were adapted at API level to work with the term dictEntLink as well. - Simplified and generalized an optimization that related to comparison of length of keys of type strings. ## Hash Field Expiration Until now each hash object with expiration on fields needed to maintain a reference to its key-name (of the hash object), such that in case it will be active-expired, then it will be possible to resolve the key-name for the notification sake. Now there is no need anymore. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-05-12 10:15:17 +03:00
nesty92	8468ded667	Fix incorrect lag due to trimming stream via XTRIM or XADD command (#13958 ) This PR fix the lag calculation by ensuring that when consumer group's last_id is behind the first entry, the consumer group's entries read is considered invalid and recalculated from the start of the stream Supplement to PR #13473 Close #13957 Signed-off-by: Ernesto Alejandro Santana Hidalgo <ernesto.alejandrosantana@gmail.com>	2025-04-22 10:11:10 +08:00
Cong Chen	981aa5c12f	Fix timing issue in HEXPIREAT test (#13873 ) This fixes an error that occurs in the job [test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397) of the Daily workflow: ``` *** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test) ```	2025-03-26 10:00:38 +08:00
Filipe Oliveira (Redis)	3e012c9260	Fix string2d usage in case of hexadecimal strings parsing and overflow (#13845 ) Since https://github.com/redis/redis/pull/11884, what was previously accepted as a valid input (hexadecimal string) before 8.0 returned an error. This PR addresses it. To avoid performance penalties if hints the compiler that the fallbacks are not likely to happen. Furthermore, we were ignoring std::result_out_of_range outputs from fast_float. This PR addresses it as well and includes tests for both identified scenarios. --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2025-03-19 20:08:45 +08:00
Yuan Wang	f1d6542b1a	Stabilize tcl test cases (#13829 ) Recently encountered some errors as bellow, HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate current time by `[clock seconds]` that may have a delay that causes results greater than expected. Dismiss memory test error, now we introduced rdb-channel replication, the full synchronization might finish before the child process exits. So we may fail if calling `bgsave` immediately after full sync.	2025-02-25 16:31:53 +08:00
Denis Nevmerzhitskii	33f03f6fc8	Fix wrong behavior of XREAD + after last entry of stream have been removed (#13632 ) Close #13628 This PR changes behavior of special `+` id of XREAD command. Now it uses `streamLastValidID` to find last entry instead of `last_id` field of stream object. This PR adds test for the issue. Notes Initial idea to update `last_id` while executing XDEL seems to be wrong. `last_id` is used to strore last generated id and not id of last entry. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: guybe7 <guy.benoish@redislabs.com>	2025-02-25 13:40:24 +08:00
Ozan Tezcan	e2608478b6	Add HGETDEL, HGETEX and HSETEX hash commands (#13798 ) This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These commands enable user to do multiple operations in one step atomically e.g. set a hash field and update its TTL with a single command. Previously, it was only possible to do it by calling hset and hexpire commands subsequently. - HGETDEL command ``` HGETDEL <key> FIELDS <numfields> field [field ...] ``` Description Get and delete the value of one or more fields of a given hash key Reply Array reply: list of the value associated with each field or nil if the field doesn’t exist. - HGETEX command ``` HGETEX <key> [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| PERSIST] FIELDS <numfields> field [field ...] ``` Description Get the value of one or more fields of a given hash key, and optionally set their expiration Options: EX seconds: Set the specified expiration time, in seconds. PX milliseconds: Set the specified expiration time, in milliseconds. EXAT timestamp-seconds: Set the specified Unix time at which the field will expire, in seconds. PXAT timestamp-milliseconds: Set the specified Unix time at which the field will expire, in milliseconds. PERSIST: Remove the time to live associated with the field. Reply Array reply: list of the value associated with each field or nil if the field doesn’t exist. - HSETEX command ``` HSETEX <key> [FNX \| FXX] [EX seconds \| PX milliseconds \| EXAT unix-time-seconds \| PXAT unix-time-milliseconds \| KEEPTTL] FIELDS <numfields> field value [field value...] ``` Description Set the value of one or more fields of a given hash key, and optionally set their expiration Options: FNX: Only set the fields if all do not already exist. FXX: Only set the fields if all already exist. EX seconds: Set the specified expiration time, in seconds. PX milliseconds: Set the specified expiration time, in milliseconds. EXAT timestamp-seconds: Set the specified Unix time at which the field will expire, in seconds. PXAT timestamp-milliseconds: Set the specified Unix time at which the field will expire, in milliseconds. KEEPTTL: Retain the time to live associated with the field. Note: If no option is provided, any associated expiration time will be discarded similar to how SET command behaves. Reply Integer reply: 0 if no fields were set Integer reply: 1 if all the fields were set	2025-02-14 17:13:35 +03:00
YaacovHazan	0aeb86d78d	Revert "Improve GETRANGE command behavior (#12272 )" Although the commit #6ceadfb58 improves GETRANGE command behavior, we can't accept it as we should avoid breaking changes for non-critical bug fixes. This reverts commit `6ceadfb580`.	2025-02-05 20:49:42 +02:00
Yuan Wang	64a40b20d9	Async IO Threads (#13695 ) ## Introduction Redis introduced IO Thread in 6.0, allowing IO threads to handle client request reading, command parsing and reply writing, thereby improving performance. The current IO thread implementation has a few drawbacks. - The main thread is blocked during IO thread read/write operations and must wait for all IO threads to complete their current tasks before it can continue execution. In other words, the entire process is synchronous. This prevents the efficient utilization of multi-core CPUs for parallel processing. - When the number of clients and requests increases moderately, it causes all IO threads to reach full CPU utilization due to the busy wait mechanism used by the IO threads. This makes it challenging for us to determine which part of Redis has reached its bottleneck. - When IO threads are enabled with TLS and io-threads-do-reads, a disconnection of a connection with pending data may result in it being assigned to multiple IO threads simultaneously. This can cause race conditions and trigger assertion failures. Related issue: redis#12540 Therefore, we designed an asynchronous IO threads solution. The IO threads adopt an event-driven model, with the main thread dedicated to command processing, meanwhile, the IO threads handle client read and write operations in parallel. ## Implementation ### Overall As before, we did not change the fact that all client commands must be executed on the main thread, because Redis was originally designed to be single-threaded, and processing commands in a multi-threaded manner would inevitably introduce numerous race and synchronization issues. But now each IO thread has independent event loop, therefore, IO threads can use a multiplexing approach to handle client read and write operations, eliminating the CPU overhead caused by busy-waiting. the execution process can be briefly described as follows: the main thread assigns clients to IO threads after accepting connections, IO threads will notify the main thread when clients finish reading and parsing queries, then the main thread processes queries from IO threads and generates replies, IO threads handle writing reply to clients after receiving clients list from main thread, and then continue to handle client read and write events. ### Each IO thread has independent event loop We now assign each IO thread its own event loop. This approach eliminates the need for the main thread to perform the costly `epoll_wait` operation for handling connections (except for specific ones). Instead, the main thread processes requests from the IO threads and hands them back once completed, fully offloading read and write events to the IO threads. Additionally, all TLS operations, including handling pending data, have been moved entirely to the IO threads. This resolves the issue where io-threads-do-reads could not be used with TLS. ### Event-notified client queue To facilitate communication between the IO threads and the main thread, we designed an event-notified client queue. Each IO thread and the main thread have two such queues to store clients waiting to be processed. These queues are also integrated with the event loop to enable handling. We use pthread_mutex to ensure the safety of queue operations, as well as data visibility and ordering, and race conditions are minimized, as each IO thread and the main thread operate on independent queues, avoiding thread suspension due to lock contention. And we implemented an event notifier based on `eventfd` or `pipe` to support event-driven handling. ### Thread safety Since the main thread and IO threads can execute in parallel, we must handle data race issues carefully. client->flags The primary tasks of IO threads are reading and writing, i.e. `readQueryFromClient` and `writeToClient`. However, IO threads and the main thread may concurrently modify or access `client->flags`, leading to potential race conditions. To address this, we introduced an io-flags variable to record operations performed by IO threads, thereby avoiding race conditions on `client->flags`. Pause IO thread In the main thread, we may want to operate data of IO threads, maybe uninstall event handler, access or operate query/output buffer or resize event loop, we need a clean and safe context to do that. We pause IO thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To avoid thread suspended, we use busy waiting to confirm the target status. Besides we use atomic variable to make sure memory visibility and ordering. We introduce these functions to pause/resume IO Threads as below. ``` pauseIOThread, resumeIOThread pauseAllIOThreads, resumeAllIOThreads pauseIOThreadsRange, resumeIOThreadsRange ``` Testing has shown that `pauseIOThread` is highly efficient, allowing the main thread to execute nearly 200,000 operations per second during stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can handle up to nearly 56,000 operations per second. But operations performed between pausing and resuming IO threads must be quick; otherwise, they could cause the IO threads to reach full CPU utilization. freeClient and freeClientAsync The main thread may need to terminate a client currently running on an IO thread, for example, due to ACL rule changes, reaching the output buffer limit, or evicting a client. In such cases, we need to pause the IO thread to safely operate on the client. maxclients and maxmemory-clients updating When adjusting `maxclients`, we need to resize the event loop for all IO threads. Similarly, when modifying `maxmemory-clients`, we need to traverse all clients to calculate their memory usage. To ensure safe operations, we pause all IO threads during these adjustments. Client info reading The main thread may need to read a client’s fields to generate a descriptive string, such as for the `CLIENT LIST` command or logging purposes. In such cases, we need to pause the IO thread handling that client. If information for all clients needs to be displayed, all IO threads must be paused. Tracking redirect Redis supports the tracking feature and can even send invalidation messages to a connection with a specified ID. But the target client may be running on IO thread, directly manipulating the client’s output buffer is not thread-safe, and the IO thread may not be aware that the client requires a response. In such cases, we pause the IO thread handling the client, modify the output buffer, and install a write event handler to ensure proper handling. clientsCron In the `clientsCron` function, the main thread needs to traverse all clients to perform operations such as timeout checks, verifying whether they have reached the soft output buffer limit, resizing the output/query buffer, or updating memory usage. To safely operate on a client, the IO thread handling that client must be paused. If we were to pause the IO thread for each client individually, the efficiency would be very low. Conversely, pausing all IO threads simultaneously would be costly, especially when there are many IO threads, as clientsCron is invoked relatively frequently. To address this, we adopted a batched approach for pausing IO threads. At most, 8 IO threads are paused at a time. The operations mentioned above are only performed on clients running in the paused IO threads, significantly reducing overhead while maintaining safety. ### Observability In the current design, the main thread always assigns clients to the IO thread with the least clients. To clearly observe the number of clients handled by each IO thread, we added the new section in INFO output. The `INFO THREADS` section can show the client count for each IO thread. ``` # Threads io_thread_0:clients=0 io_thread_1:clients=2 io_thread_2:clients=2 ``` Additionally, in the `CLIENT LIST` output, we also added a field to indicate the thread to which each client is assigned. `id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name= lib-ver= io-thread=1` ## Trade-off ### Special Clients For certain special types of clients, keeping them running on IO threads would result in severe race issues that are difficult to resolve. Therefore, we chose not to offload these clients to the IO threads. For replica, monitor, subscribe, and tracking clients, main thread may directly write them a reply when conditions are met. Race issues are difficult to resolve, so we have them processed in the main thread. This includes the Lua debug clients as well, since we may operate connection directly. For blocking client, after the IO thread reads and parses a command and hands it over to the main thread, if the client is identified as a blocking type, it will be remained in the main thread. Once the blocking operation completes and the reply is generated, the client is transferred back to the IO thread to send the reply and wait for event triggers. ### Clients Eviction To support client eviction, it is necessary to update each client’s memory usage promptly during operations such as read, write, or command execution. However, when a client operates on an IO thread, it is not feasible to update the memory usage immediately due to the risk of data races. As a result, memory usage can only be updated either in the main thread while processing commands or in the `ClientsCron` periodically. The downside of this approach is that updates might experience a delay of up to one second, which could impact the precision of memory management for eviction. To avoid incorrectly evicting clients. We adopted a best-effort compensation solution, when we decide to eviction a client, we update its memory usage again before evicting, if the memory used by the client does not decrease or memory usage bucket is not changed, then we will evict it, otherwise, not evict it. However, we have not completely solved this problem. Due to the delay in memory usage updates, it may lead us to make incorrect decisions about the need to evict clients. ### Defragment In the majority of cases we do NOT use the data from argv directly in the db. 1. key names We store a copy that we allocate in the main thread, see `sdsdup()` in `dbAdd()`. 2. hash key and value We store key as hfield and store value as sds, see `hfieldNew()` and `sdsdup()` in `hashTypeSet()`. 3. other datatypes They don't even use SDS, so there is no reference issues. But in some cases client the data from argv may be retain by the main thread. As a result, during fragmentation cleanup, we need to move allocations from the IO thread’s arena to the main thread’s arena. We always allocate new memory in the main thread’s arena, but the memory released by IO threads may not yet have been reclaimed. This ultimately causes the fragmentation rate to be higher compared to creating and allocating entirely within a single thread. The following cases below will lead to memory allocated by the IO thread being kept by the main thread. 1. string related command: `append`, `getset`, `mset` and `set`. If `tryObjectEncoding()` does not change argv, we will keep it directly in the main thread, see the code in `tryObjectEncoding()`(specifically `trimStringObjectIfNeeded()`) 2. block related command. the key names will be kept in `c->db->blocking_keys`. 3. watch command the key names will be kept in `c->db->watched_keys`. 4. [s]subscribe command channel name will be kept in `serverPubSubChannels`. 5. script load command script will be kept in `server.lua_scripts`. 7. some module API: `RM_RetainString`, `RM_HoldString` Those issues will be handled in other PRs. ## Testing ### Functional Testing The commit with enabling IO Threads has passed all TCL tests, but we did some changes: Client query buffer: In the original code, when using a reusable query buffer, ownership of the query buffer would be released after the command was processed. However, with IO threads enabled, the client transitions from an IO thread to the main thread for processing. This causes the ownership release to occur earlier than the command execution. As a result, when IO threads are enabled, the client's information will never indicate that a shared query buffer is in use. Therefore, we skip the corresponding query buffer tests in this case. Defragment: Add a new defragmentation test to verify the effect of io threads on defragmentation. Command delay: For deferred clients in TCL tests, due to clients being assigned to different threads for execution, delays may occur. To address this, we introduced conditional waiting: the process proceeds to the next step only when the `client list` contains the corresponding commands. ### Sanitizer Testing The commit passed all TCL tests and reported no errors when compiled with the `fsanitizer=thread` and `fsanitizer=address` options enabled. But we made the following modifications: we suppressed the sanitizer warnings for clients with watched keys when updating `client->flags`, we think IO threads read `client->flags`, but never modify it or read the `CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so there is no actual data race. ## Others ### IO thread number In the new multi-threaded design, the main thread is primarily focused on command processing to improve performance. Typically, the main thread does not handle regular client I/O operations but is responsible for clients such as replication and tracking clients. To avoid breaking changes, we still consider the main thread as the first IO thread. When the io-threads configuration is set to a low value (e.g., 2), performance does not show a significant improvement compared to a single-threaded setup for simple commands (such as SET or GET), as the main thread does not consume much CPU for these simple operations. This results in underutilized multi-core capacity. However, for more complex commands, having a low number of IO threads may still be beneficial. Therefore, it’s important to adjust the `io-threads` based on your own performance tests. Additionally, you can clearly monitor the CPU utilization of the main thread and IO threads using `top -H -p $redis_pid`. This allows you to easily identify where the bottleneck is. If the IO thread is the bottleneck, increasing the `io-threads` will improve performance. If the main thread is the bottleneck, the overall performance can only be scaled by increasing the number of shards or replicas. --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: oranagra <oran@redislabs.com>	2024-12-23 14:16:40 +08:00
debing.sun	3fc7ef8f81	Fix race in stream-cgroups test (#13593 ) failed CI: https://github.com/redis/redis/actions/runs/11171608362/job/31056659165 https://github.com/redis/redis/actions/runs/11226025974/job/31205787575	2024-10-12 09:23:19 +08:00
Moti Cohen	5f28bd96db	Fix race in HFE tests (#13563 ) Test 1 - give more time for expiration Test 2 - Evaluate expiration time boundaries [+1,+2] before setting expiration [+1] Test 3 - Avoid race on test HFEs propagated to replica	2024-09-23 10:30:29 +03:00
Moti Cohen	9a89e32a95	HFE - Fix key ref by the hash on RENAME/MOVE/SWAPDB/RESTORE (#13539 ) If the hash previously had HFEs (hash-fields with expiration) but later no longer does, the key ref in the hash might become outdated after a MOVE, COPY, RENAME or RESTORE operation. These commands maintain the key ref only if HFEs are present. That is, we can only be sure that key ref is valid as long as the hash has HFEs.	2024-09-12 12:40:12 +03:00
Moti Cohen	569584d463	HFE - Simplify logic of HGETALL command (#13425 )	2024-09-05 12:48:44 +03:00
Zihao Lin	6ceadfb580	Improve GETRANGE command behavior (#12272 ) Fixed the issue about GETRANGE and SUBSTR command return unexpected result caused by the `start` and `end` out of definition range of string. --- ## break change Before this PR, when negative `end` was out of range (i.e., end < -strlen), we would fix it to 0 to get the substring, which also resulted in the first character still being returned for this kind of out of range. After this PR, we ensure that `GETRANGE` returns an empty bulk when the negative end index is out of range. Closes #11738 --------- Co-authored-by: debing.sun <debing.sun@redis.com>	2024-08-20 12:34:43 +08:00
debing.sun	2b88db90aa	Fix incorrect lag due to trimming stream via XTRIM command (#13473 ) ## Describe When using the `XTRIM` command to trim a stream, it does not update the maximal tombstone (`max_deleted_entry_id`). This leads to an issue where the lag calculation incorrectly assumes that there are no tombstones after the consumer group's last_id, resulting in an inaccurate lag. The reason XTRIM doesn't need to update the maximal tombstone is that it always trims from the beginning of the stream. This means that it consistently changes the position of the first entry, leading to the following scenarios: 1) First entry trimmed after maximal tombstone: If the first entry is trimmed to a position after the maximal tombstone, all tombstones will be before the first entry, so they won't affect the consumer group's lag. 2) First entry trimmed before maximal tombstone: If the first entry is trimmed to a position before the maximal tombstone, the maximal tombstone will not be updated. ## Solution Therefore, this PR optimizes the lag calculation by ensuring that when both the consumer group's last_id and the maximal tombstone are behind the first entry, the consumer group's lag is always equal to the number of remaining elements in the stream. Supplement to PR https://github.com/redis/redis/pull/13338	2024-08-16 23:13:31 +08:00
debing.sun	b94b714f81	Fix error message for XREAD command with wrong parameter (#13474 ) Fixed a missing from #13117. When the number of streams is incorrect, the error message for `XREAD` needs to include the '+' symbol.	2024-08-14 21:40:43 +08:00
Moti Cohen	806459f481	On HDEL last field with expiry, update global HFE DS (#13470 ) Hash field expiration is optimized to avoid frequent update global HFE DS for each field deletion. Eventually active-expiration will run and update or remove the hash from global HFE DS gracefully. Nevertheless, statistic "subexpiry" might reflect wrong number of hashes with HFE to the user if HDEL deletes the last field with expiration in hash (yet there are more fields without expiration). Following this change, if HDEL the last field with expiration in the hash then take care to remove the hash from global HFE DS as well.	2024-08-11 16:39:03 +03:00
debing.sun	93fb83b4cb	Fix incorrect lag field in XINFO when tombstone is after the last_id of consume group (#13338 ) Fix #13337 Ths PR fixes fixed two bugs that caused lag calculation errors. 1. When the latest tombstone is before the first entry, the tombstone may stil be after the last id of consume group. 2. When a tombstone is after the last id of consume group, the group's counter will be invalid, we should caculate the entries_read by using estimates.	2024-07-30 22:31:31 +08:00
Moti Cohen	a84cc20aef	HFE - Fix statistic to count also lazy expired and rename INFO params (#13372 ) * INFO command : rename `hashes_with_expiry_fields` to `subexpiry` * INFO command : rename `expired_hash_fields` to `expired_subkeys` * Fix statistic of `expired_subkeys` to count also lazy expired * Remove TODOs comments leftover in TCL * Fix potential flaky test of rdb load of hash-field-expiration	2024-07-02 18:22:10 +03:00

1 2 3 4 5 ...

440 commits