Commit graph

440 commits

Author SHA1 Message Date
debing.sun
18538461d1
Add separate statistics for active expiration of keys and hash fields (#14727)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
### Summary

Adds `expired_keys_active` and `expired_subkeys_active` counters to
track keys and hash fields expired by the active expiration cycle,
distinguishing them from lazy expirations.
These new metrics are exposed in INFO stats output.

### Motivation

Currently, Redis tracks the total number of expired keys (expired_keys)
and expired hash fields (expired_subkeys), but there's no way to
differentiate between expirations triggered by active expire and lazy
expire.

---------

Co-authored-by: Moti Cohen <moti.cohen@redis.com>
2026-01-22 22:30:25 +08:00
Sergei Georgiev
221409788a
Add idempotency support to XADD via IDMPAUTO and IDMP parameters (#14615)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Overview

This PR introduces idempotency support to Redis Streams' XADD command,
enabling automatic deduplication of duplicate message submissions
through optional IDMPAUTO and IDMP parameters with producer
identification. This enables reliable at-least-once delivery while
preventing duplicate entries in streams.

## Problem Statement

Current Redis Streams implementations lack built-in idempotency
mechanisms, making reliable at-least-once delivery impossible without
accepting duplicates:

- **Application-level tracking**: Developers must maintain separate data
structures to track submitted messages
- **Race conditions**: Network failures and retries can result in
duplicate stream entries
- **Complexity overhead**: Each producer must implement custom
deduplication logic
- **Memory inefficiency**: External deduplication systems duplicate
Redis's storage capabilities

This lack of native idempotency support creates reliability challenges
in distributed systems where at-least-once delivery semantics are
required but exactly-once processing is desired.

## Solution

Extends XADD with optional idempotency parameters that include producer
identification:

```
XADD key [NOMKSTREAM] [KEEPREF | DELREF | ACKED] [IDMPAUTO pid | IDMP pid iid] [MAXLEN | MINID [= | ~] threshold [LIMIT count]] <* | id> field value [field value ...]
```

### Producer ID (pid)

- **pid** (producer id): A unique identifier for each producer
- Must be unique per producer instance
- Producers must use the same pid after restart to access their
persisted idempotency tracking
- Enables per-producer idempotency tracking, isolating duplicate
detection between different producers

**Format**: Binary or string, recommended max 36 bytes

**Generation**: 
- **Recommended**: UUID v4 for globally unique identification
- **Alternative**: `hostname:process_id` or application-assigned IDs

### Idempotency Modes

**IDMPAUTO pid (Automatic Idempotency)**:

- Producer specifies its pid, Redis automatically calculates a unique
idempotent ID (iid) based on entry content
- Hash calculation combines XXH128 hashing of individual field-value
pairs using an order-independent Sum + XOR approach with rotation (each
pair: `XXH128(field || field_length || value)`)
- 16-byte binary iid with extremely low accidental collision probability
- XXH128 is a non-cryptographic hash function: fast and
well-distributed, but does NOT prevent intentional collision attacks
- For protection against adversarial collision crafting, use IDMP mode
with cryptographically-signed idempotent IDs
- Order-independent: field ordering does not affect the calculated iid
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without creating duplicate entry
- Generally slower than manual mode due to hash calculation overhead

**IDMP pid iid (Manual Idempotency)**:

- Caller provides explicit producer id (pid) and idempotent ID (iid) for
deduplication
- iid must be unique per message (either globally or per pid)
- Faster processing than IDMPAUTO (no hash calculation overhead)
- Enables shorter iids for reduced memory footprint
- If (pid, iid) pair exists in producer's IDMP map: returns existing
entry ID without comparing field contents
- Caller responsible for iid uniqueness and consistency across retries

Both modes can only be specified when entry ID is `*` (auto-generated).

### Deduplication Logic

When XADD is called with idempotency parameters:

1. Redis checks if the message was recently added to the stream based on
the (pid, iid) pair
2. If the (pid, iid) pair matches a recently-seen pair for that
producer, the message is assumed to be identical
3. No duplicate message is added to the stream; the existing entry ID is
returned
4. With **IDMP pid iid**: Redis does not compare the specified fields
and their values—two messages with the same (pid, iid) are assumed
identical
5. With **IDMPAUTO pid**: Redis calculates the iid from message content
and checks for duplicates

## IDMP Map: Per-Producer Time and Capacity-Based Expiration

Each producer with idempotency enabled maintains its own isolated IDMP
map (iid → entry_id) with dual expiration criteria:

**Time-based expiration (duration)**:

- Each iid expires automatically after duration seconds from insertion
- Provides operational guarantee: Redis will not forget an iid before
duration elapses (unless capacity reached)
- Configurable per-stream via XCFGSET

**Capacity-based expiration (maxsize)**:

- Each producer's map enforces maximum capacity of maxsize entries
- When capacity reached, oldest iids for that producer are evicted
regardless of remaining duration
- Prevents unbounded memory growth during extended usage

### Configuration Commands

**XINFO STREAM**: View current configuration and metrics

Use `XINFO STREAM key` to retrieve idempotency configuration
(idmp-duration, idmp-maxsize) along with tracking metrics.

**XCFGSET**: Configure expiration parameters

```
XCFGSET key [IDMP-DURATION duration] [IDMP-MAXSIZE maxsize]
```

- **duration**: Seconds to retain each iid (range: 1- 86400 seconds)
- **maxsize**: Maximum iids to track per producer (range: 1-10,000
entries)
- Calling XCFGSET clears all existing producer IDMP maps for the stream

**Default Configuration** (when XCFGSET not called):

- Duration: 100 seconds
- Maxsize: 100 iids per producer
- Runtime configurable via: `stream-idmp-duration` and
`stream-idmp-maxsize`

## Response Behavior

**On first submission** (pid, iid) pair not in producer's map:

- Entry added to stream with generated entry ID
- (pid, iid) pair stored in producer's IDMP map with current timestamp
- Returns new entry ID

**On duplicate submission** (pid, iid) pair exists in producer's map:

- No entry added to stream
- Returns existing entry ID from producer's IDMP map
- Identical response to original submission (client cannot distinguish)

## Stream Metadata

XINFO STREAM extended with idempotency metrics and configuration:

- **idmp-duration**: The duration value (in seconds) configured for the
stream's IDMP map
- **idmp-maxsize**: The maxsize value configured for the stream's IDMP
map
- **pids-tracked**: Current number of producers with active IDMP maps
- **iids-tracked**: Current total number of iids across all producers'
IDMP maps (reflects active iids that haven't expired or been evicted)
- **iids-added**: Lifetime cumulative count of entries added with
idempotency parameters
- **iids-duplicates**: Lifetime cumulative count of duplicate iids
detected across all producers

## Persistence and Restart Behavior

**IDMP maps are fully persisted and restored across Redis restarts**:

- **RDB/AOF**: All pid-iid pairs, timestamps, and configuration are
included in snapshots and AOF logs
- **Recovery**: On restart, all tracked (pid, iid) pairs remain valid
and operational
- **Producer Requirement**: Producers must reuse the same pid after
restart to access their persisted IDMP map
- **Configuration**: Stream-level settings (duration, maxsize) persist
across restarts
- **Important**: Calling XCFGSET after restart clears restored IDMP maps
(same behavior as during runtime)

## Key Benefits

- **Enables At-most-once Producer Semantics**: Makes it possible to
safely retry message submissions without creating duplicates
- **Automatic Retry Safety**: Network failures and retries cannot create
duplicate entries
- **Producer Isolation**: Each producer maintains independent
idempotency tracking
- **Memory Efficient**: Time and capacity-based expiration per producer
prevents unbounded growth
- **Flexible Implementation**: Choose automatic (IDMPAUTO) or manual
(IDMP) based on performance needs
- **Backward Compatible**: Fully optional parameters with zero impact on
existing XADD behavior
- **Collision Resistant**: XXH128 with Sum + XOR combination and
field-length separators provides high-quality non-cryptographic hashing
for IDMPAUTO with extremely low collision probability and prevents
ambiguous concatenation attacks
2026-01-15 21:58:44 +08:00
Stav-Levi
73249497d4
Fix ACL key-pattern bypass in MSETEX command (#14659)
MSETEX doesn't properly check ACL key permissions for all keys - only
the first key is validated.

MSETEX arguments look like: MSETEX <numkeys> key1 val1 key2 val2 ... EX
seconds

Keys are at every 2nd position (step=2). When Redis extracts keys for
ACL checking, it calculates where the last key is:

last = first + numkeys - 1;        => calculation ignores step
last = first + (numkeys-1) * step; 
With 2 keys starting at position 2:

Bug: last = 2 + 2 - 1 = 3 → only checks position 2
Fix: last = 2 + (2-1)*2 = 4 → checks positions 2 and 4

Fixes #14657
2026-01-08 08:41:55 +02:00
debing.sun
9ca860be9e
Fix XTRIM/XADD with approx not deletes entries for DELREF/ACKED strategies (#14623)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This bug was introduced by #14130 and found by guybe7 

When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete
strategies, if a node was eligible for removal but couldn't be removed
directly (because consumer group references need to be checked), the
code would incorrectly break out of the loop instead of continuing to
process entries within the node. This fix allows the per-entry deletion
logic to execute for eligible nodes when using non-KEEPREF strategies.
2026-01-05 21:17:36 +08:00
Stav-Levi
23aca15c8c
Fix the flexibility of argument positions in the Redis API's (#14416)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR implements flexible keyword-based argument parsing for all 12
hash field expiration commands, allowing users to specify arguments in
any logical order rather than being constrained by rigid positional
requirements.
This enhancement follows Redis's modern design of keyword-based flexible
argument ordering and significantly improves user experience.

Commands with Flexible Parsing
HEXPIRE, HPEXPIRE, HEXPIREAT, HPEXPIREAT, HGETEX, HSETEX

some examples: 
HEXPIRE: 
* All these are equivalent and valid:
HEXPIRE key EX 60 NX FIELDS 2 f1 f2
HEXPIRE key NX EX 60 FIELDS 2 f1 f2  
HEXPIRE key FIELDS 2 f1 f2 EX 60 NX
HEXPIRE key FIELDS 2 f1 f2 NX EX 60
HEXPIRE key NX FIELDS 2 f1 f2 EX 60

HGETEX:
* All these are equivalent and valid:
HGETEX key EX 60 FIELDS 2 f1 f2
HGETEX key FIELDS 2 f1 f2 EX 60

HSETEX:
* All these are equivalent and valid:
HSETEX key FNX EX 60 FIELDS 2 f1 v1 f2 v2
HSETEX key EX 60 FNX FIELDS 2 f1 v1 f2 v2
HSETEX key FIELDS 2 f1 v1 f2 v2 FNX EX 60
HSETEX key FIELDS 2 f1 v1 f2 v2 EX 60 FNX
HSETEX key FNX FIELDS 2 f1 v1 f2 v2 EX 60
2025-12-14 09:35:12 +02:00
debing.sun
bb6389e823
Fix min_cgroup_last_id cache not updated when destroying consumer group (#14552)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Problem

When destroying a consumer group with `XGROUP DESTROY`, the cached
`min_cgroup_last_id` was not being invalidated. This caused incorrect
behavior when using `XDELEX` with the `ACKED` option, as the cache still
referenced the destroyed group's `last_id`.

## Solution

Invalidate the `min_cgroup_last_id` cache when the destroyed group's
`last_id` equals the cached minimum. The cache will be recalculated on
the next call to `streamEntryIsReferenced()`.

---------

Co-authored-by: guybe7 <guy.benoish@redislabs.com>
2025-11-21 22:37:17 +08:00
Oran Agra
0a6eacff1f
Add variable key-spec flags to SET IF* and DELEX (#14529)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
These commands behave as DEL and SET (blindly Remove or Overwrite) when
they don't get IF* flags, and require the value of the key when they do
run with these flags.

Making sure they have the VARIABLE_FLAGS flag, and getKeysProc that can
provide the right flags depending on the arguments used. (the plain
flags when arguments are unknown are the common denominator ones)

Move lookupKey call in DELEX to avoid double lookup, which also means
(some, namely arity) syntax errors are checked (and reported) before
checking the existence of the key.
2025-11-12 11:36:10 +02:00
Sergei Georgiev
90ba7ba4dc
Fix XREADGROUP CLAIM to return delivery metadata as integers (#14524)
### Problem
The XREADGROUP command with CLAIM parameter incorrectly returns delivery
metadata (idle time and delivery count) as strings instead of integers,
contradicting the Redis specification.

### Solution
Updated the XREADGROUP CLAIM implementation to return delivery metadata
fields as integers, aligning with the documented specification and
maintaining consistency with Redis response conventions.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-11-11 19:05:22 +08:00
Moti Cohen
d25e582a17
Fix flaky test of hfe persist rdb reload (#14525)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
So far occured once on daily in the test-sanitizer-address job
2025-11-10 17:15:37 +02:00
Moti Cohen
189b7609f5
Add hfe rdb load test (#14511)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Verify that following RDB load fields keep their expiration time.
Verify that hashes that had HFEs not counted following rdb load in
subexpiry (by command `info keyspace`)
2025-11-09 09:49:54 +02:00
debing.sun
7f1bafc922 Fix XACKDEL stack overflow when IDs exceed STREAMID_STATIC_VECTOR_LEN (CVE-2025-62507)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
This issue was introduced by redis/redis#14130.
The problem is that when the number of IDs exceeds STREAMID_STATIC_VECTOR_LEN (8), the code forgot to reallocate memory for the IDs array, which causes a stack overflow.
2025-11-05 15:33:34 +02:00
sggeorgiev
3e2003ee0f Fix HGETEX out-of-bounds read when FIELDS option missing numfields argument
When the HGETEX command is used with the FIELDS option but without the required
numfields argument, the server would attempt to access an out-of-bounds argv index.

This PR adds a check to ensure numfields is present before accessing it,
returning an error if it is missing. Also includes a test case to cover this scenario.
2025-11-05 15:33:34 +02:00
debing.sun
e436a0e548
Enforce 16-char hex digest length and case-insensitive comparison for IFDEQ/IFDNE (#14502)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Fix https://github.com/redis/redis/issues/14496

This PR makes the following changes:
- DIGEST: Always return 16 hex characters with leading zeros
  Example: "00006c38adf31777" instead of "6c38adf31777"

- IFDEQ/IFDNE: Validate the digest must be exactly 16 characters

- IFDEQ/IFDNE: Use strcasecmp for case-insensitive hex comparison
  Both uppercase and lowercase hex digits now work identically

---------

Co-authored-by: Marc Gravell <marc.gravell@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-11-03 16:59:50 +08:00
debing.sun
379fec1426
Use fixed position keys parameter for MSETEX command (#14470)
In PR https://github.com/redis/redis/pull/14434, we made the keys
parameter flexible, meaning it could appear anywhere among the command
arguments. However, this also made key parsing more complex, since we
could no longer determine the fixed position of key arguments.
Therefore, in this PR, we reverted it back to using fixed positions for
the keys.

And also fix this
[comment](https://github.com/redis/redis/pull/14434#discussion_r2459282563).

---------

Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-10-27 17:20:29 +08:00
Stav-Levi
52ea47b792
Add MSETEX command (#14434)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Introduce a new command MSETEX to set multiple string keys with a shared
expiration in a single atomic operation. Also with flexible argument
parsing.

Syntax:
MSETEX KEYS numkeys key value [key value …] [XX | NX] [EX seconds | PX
milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds |
KEEPTTL]

Sets the given keys to their respective values.
This command is an extension of the MSETNX that adds expiration and XX
options.

Options: 
EX seconds - Set the specified expiration time, in seconds
PX milliseconds - Set the specified expiration time, in milliseconds
EXAT timestamp-seconds - Set the specified Unix time at which the keys
will expire, in seconds
PXAT timestamp-milliseconds - Set the specified Unix time at which the
keys will expire, in milliseconds
KEEPTTL - Retain the time to live associated with the keys
XX - Only set the keys and their expiration if all already exist
NX - Only set the keys and their expiration if none exist

Flexible Argument Parsing examples:
  - MSETEX EX 10 KEYS 2 k1 v1 k2 v2
  - MSETEX KEYS 2 k1 v1 k2 v2 NX PX 5000
  - MSETEX NX EX 10 KEYS 2 k1 v1 k2 v2
  
Return Values:
Integer reply: 1 - All keys were set successfully
Integer reply: 0 - No keys were set (due to NX/XX conditions)
Error reply - Syntax error or invalid arguments
2025-10-23 19:12:02 +03:00
sggeorgiev
090ca801ea
Add CLAIM parameter to XREADGROUP for automatic pending entry claiming (#14402)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
## Overview
This PR enhances Redis Streams consumer groups by adding an optional
CLAIM parameter to the `XREADGROUP` command, enabling automatic claiming
of idle pending entries alongside normal message consumption in a single
operation.
## Problem Statement
Current Redis Streams consumer group implementations require developers
to manually orchestrate multiple commands to handle both pending and new
entries:

- `XPENDING` to discover idle pending entries
- `XCLAIM/XAUTOCLAIM` to claim idle entries
- `XREADGROUP` to consume new entries

This multi-command approach creates:
- **Performance overhead** from multiple round trips to Redis
- **Implementation complexity**, particularly when working with multiple
streams
- **Code duplication** across consumer implementations

## Solution
Extends XREADGROUP with a new optional CLAIM parameter:
`XREADGROUP GROUP group consumer [COUNT count] [BLOCK milliseconds]
[NOACK] [CLAIM min-idle-time] STREAMS key [key ...] id [id ...]`

When CLAIM min-idle-time is specified, the command operates in two
phases:

1. **Claim Phase:** Automatically claims pending entries idle for ≥
min-idle-time milliseconds
2. **Read Phase:** Processes new entries if the COUNT limit hasn't been
reached

## Response Format Changes
When the CLAIM option is used, the response format is extended to
include delivery metadata for each entry:
**Standard XREADGROUP response (without CLAIM):**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 STREAMS mystream >
1) 1) "mystream"
   2) 1) 1) "1609459200000-0"
         2) 1) "field1"
            2) "value1"
```
**XREADGROUP response with CLAIM:**
```
127.0.0.1:6379> XREADGROUP GROUP mygroup consumer1 CLAIM 30000 STREAMS mystream >
1) 1) "mystream"
   2) 1) 1) "1609459200000-0"
         2) 1) "field1"
            2) "value1"
         3) 15000
         4) 3
```
**Response structure with CLAIM:**

- **Field 1:** Stream entry ID (unchanged)
- **Field 2:** Field-value pairs (unchanged)
- **Field 3:** Idle time in milliseconds - the number of milliseconds
elapsed since this entry was last delivered to a consumer
- **Field 4:** Delivery count - the number of times this entry has been
delivered:
  
  - `0` for new messages that haven't been delivered before
  - `1+` for claimed messages (previously unacknowledged entries)

**Purpose of the new fields:**
These fields enable intelligent client-side processing decisions:

- Idle time enables time-based escalation strategies, detection of stuck
messages, and priority processing for critically delayed work
- Delivery count enables retry limits, dead-letter queue logic, poison
message detection, and alternative processing strategies based on
failure history

Together, these fields provide the visibility needed to build robust,
self-healing consumer systems without requiring additional XPENDING
queries.

**Note:** If the ID parameter is not `>`, the command returns entries
that are pending for the consumer, and the CLAIM option is ignored. In
this case, the response follows the standard format without the
additional delivery metadata fields.
## Key Benefits

- **Reduced Complexity:** Eliminates manual PEL management and
multi-command orchestration
- **Improved Performance:** Reduces round trips by 50-70% for workloads
processing both pending and new entries
- **Backward Compatibility:** Fully optional parameter with zero
breaking changes to existing behavior
- **Multi-Stream Support:** Works seamlessly across multiple streams in
a single command
- **Flexible Consumer Patterns:** Enables mixed consumer types within
the same group:
  -   Consumers without CLAIM that only handle new messages
  -   Consumers with CLAIM that process both pending and new entries

## Impact on Existing Commands
The XCLAIM and XAUTOCLAIM commands may potentially benefit from the new
pel_by_time index for improved performance, such optimizations require
further investigation and testing. Enhancements to XCLAIM and XAUTOCLAIM
are postponed for future work.

## Performance Benchmarks
### Latency Performance
Comprehensive performance testing demonstrates significant improvements
over the traditional XAUTOCLAIM approach:
**Test Methodology**
Two identical test scenarios were executed to compare XAUTOCLAIM against
XREADGROUP with CLAIM:
**Test Setup:**
1. Insert 20,000 messages into a stream
2. Read all messages with XREADGROUP to populate the pending entries
list (PEL)
3. Set IDLE time to 1100ms on 1,000 randomly selected pending messages
using XCLAIM
4. Set IDLE time to 50ms on all remaining 19,000 pending messages using
XCLAIM
5. Execute the target command with min-idle-time=1000ms and COUNT=1000
to claim the eligible messages
6. Repeat steps 3-5 for 1,000 iterations

**Test 1 - XAUTOCLAIM (Traditional Approach):**
```
XAUTOCLAIM Performance:
  Average:    54.671ms
  Median:     53.582ms
  Min:        3.738ms
  Max:        71.596ms
  P95:        62.536ms
  P99:        68.800ms
```
**Test 2 - XREADGROUP with CLAIM (New Approach):**
```
XREADGROUP CLAIM Performance:
  Average:    2.426ms
  Median:     2.571ms
  Min:        1.287ms
  Max:        4.653ms
  P95:        3.370ms
  P99:        4.212ms
```
**Performance Analysis**
The new XREADGROUP CLAIM implementation delivers **22.5x faster average
performance** compared to XAUTOCLAIM:
- **Average latency reduction:** 95.6% (54.671ms → 2.426ms)
- **Median latency reduction:** 95.2% (53.582ms → 2.571ms)
- **P95 latency reduction:** 94.6% (62.536ms → 3.370ms)
- **P99 latency reduction:** 93.9% (68.800ms → 4.212ms)

This performance improvement is achieved through the time-ordered PEL
index (pel_by_time), which enables O(log n + k) retrieval of idle
entries versus XAUTOCLAIM's less efficient scanning approach.

### Memory Performance
To evaluate the memory overhead of the pel_by_time index, comprehensive
memory testing was conducted comparing Redis with and without the index
under realistic workload conditions.

**Test Methodology:**
- Insert 200,000 new messages into a stream
- Read messages in blocks of 100 using XREADGROUP (populating the PEL
with 200,000 pending entries)
- Wait 5ms after each read block (simulating realistic processing delays
that affect rax tree compression)
- Measure memory usage before and after the reading phase

**Test Results - Without pel_by_time Index:**
```
Initial memory (used):                                   926.10 KB
After insertion (used):                                    6.80 MB
After reading (used):                                     41.53 MB
Memory increase from data:                                 5.90 MB
Memory increase from reading:                             34.72 MB
Total memory increase:                                    40.62 MB
```
**Test Results - With pel_by_time Index:**
```
Initial memory (used):                                   927.44 KB
After insertion (used):                                    6.81 MB
After reading (used):                                     45.07 MB
Memory increase from data:                                 5.90 MB
Memory increase from reading:                             38.27 MB
Total memory increase:                                    44.17 MB
```
**Memory Performance Analysis:**
The pel_by_time index introduces a measurable but reasonable memory
overhead:

**Used Memory Impact:**
- Memory increase from pel_by_time index: **3.55 MB** (38.27 MB - 34.72
MB)
- Per-entry overhead: **18.6 bytes** (3.55 MB / 200,000 entries)
- Percentage overhead: **8.7%** increase in total memory usage

**Per-Entry Memory Breakdown:**
The theoretical minimum for the pel_by_time index is 32 bytes per entry
(composite key only, no node values). The observed 18.6 bytes per entry
overhead is lower than the theoretical maximum, suggesting effective rax
tree compression is occurring despite the 5ms delays between reads.

## Technical Implementation

### New Data Structure: Time-Ordered PEL Index (`pel_by_time`)
To efficiently identify and claim idle pending entries, this PR
introduces a new rax tree structure to the consumer group
implementation:
**Structure Design:**

- Tree Type: Rax tree named pel_by_time added to each consumer group
- Key Composition: 32-byte composite key consisting of:
  - `delivery_time` (timestamp when entry was last delivered)
  - `streamId` (stream entry ID)

**Key Format:** `delivery_time` + `streamId` (concatenated)
**Node Value:** None - all necessary information is encoded in the key
itself for memory efficiency
**Key Properties:**
_Uniqueness Guarantee:_ While multiple pending entries may share the
same `delivery_time`, the `streamId` component ensures each key is
globally unique within the tree.
_Lexicographical Ordering:_ The rax tree naturally orders nodes
lexicographically by key. Since `delivery_time` forms the prefix of each
key, entries are automatically sorted by delivery time, with oldest
entries appearing first in the tree.
_Efficient Range Operations:_ This time-based ordering enables highly
efficient range searches. To find all entries idle for at least
`min-idle-time` milliseconds, we simply perform a range query from the
tree's beginning up to `current_time - min-idle-time`.
**Fast Retrieval:** 
Once idle entries are identified via the `pel_by_time` index, the
embedded `streamId` in each key is used to quickly retrieve the full
pending message data structure for the subsequent `XREADGROUP` claim
operation.
**Performance Characteristics:**
- **Insertion:** O(log n) when adding entries to PEL
- **Range Search:** O(log n + k) where k is the number of idle entries
found
- **Memory Overhead:** 32 bytes per pending entry for the index key (no
additional node values stored)

This dual-index approach (existing PEL structures plus the new
time-ordered index) allows XREADGROUP with CLAIM to efficiently identify
claimable entries without scanning the entire PEL, making the operation
suitable for consumer groups with large pending entry lists.

### COUNT Behavior with CLAIM
When the `COUNT` option is used in conjunction with `CLAIM`, the command
follows a two-phase execution strategy to maximize the specified count
limit:
**Phase 1:** Claim Idle Pending Entries

- Retrieve claimable pending entries (idle for ≥ min-idle-time) up to
the COUNT limit
- These entries are claimed and returned to the consumer

**Phase 2:** Fetch New Messages (if needed)

- If the `COUNT` limit has not been satisfied by claimed pending
entries, the command proceeds to read new messages from the stream
- New messages are fetched up to the remaining available count:
`remaining_count = COUNT - claimed_entries`

This prioritization ensures that idle pending entries are always
processed first, preventing indefinite message stalling while still
allowing consumers to process new messages efficiently when pending
entries are scarce.


### BLOCK Behavior with CLAIM
When the CLAIM option is used in conjunction with the BLOCK option, the
command exhibits sophisticated blocking behavior that responds to both
new messages and pending entries becoming claimable:

**Blocking State Management:**
If there are no immediately claimable pending entries and no new
messages available in the stream, the `XREADGROUP` command enters a
blocking state for the specified duration. However, the implementation
must handle a critical scenario: pending entries that become idle (and
thus claimable) while the command is blocked must trigger an early
wakeup to serve those entries.

**Implementation: `stream_claim_pending_keys` Dictionary**
To enable this reactive blocking behavior, a new
`stream_claim_pending_keys` dictionary is introduced to the `redisDb`
structure:
- **Key:** Stream key being watched
- **Value:** The minimum timestamp when the next pending entry in this
stream will become claimable (i.e., will satisfy the min-idle-time
requirement)

**Multi-Client Coordination:**
When multiple XREADGROUP commands with BLOCK and CLAIM are executed
concurrently on the same stream, the dictionary value stores the
shortest claimable time across all waiting clients. This ensures the
earliest possible wakeup when any pending entry becomes available for
claiming.

**Wakeup Mechanism: `handleClaimableStreamEntries`**
The `handleClaimableStreamEntries` function is invoked regularly from
`blockedBeforeSleep` to monitor and react to claimable entries:
1. **Scan Phase:** Iterates through all entries in the
`stream_claim_pending_keys` dictionary
2. **Time Check:** Compares each entry's claimable timestamp against the
current time
3. **Signal Phase:** When `claimable_time ≤ current_time`, calls
`signalKeyAsReady` to wake up all clients blocked on that stream
4. **Client Processing:** Awakened clients attempt to claim and process
the newly available pending entries

**Resource Contention Handling:**
When the number of claimable entries is insufficient to satisfy all
awakened clients:

- Clients that successfully claim entries complete their operations
- Remaining clients recalculate the next minimum claimable time based on
remaining pending entries
- These clients update the `stream_claim_pending_keys` dictionary with
the new timestamp
- They re-enter the blocking state to wait for the next batch of
claimable entries

This design ensures fair resource distribution and prevents busy-waiting
while maintaining responsiveness to both new messages and aging pending
entries.
2025-10-21 20:35:43 +08:00
Mincho Paskalev
aed879ad0a
Optimistic locking for string objects - compare-and-set and compare-and-delete (#14435)
# Description

Add optimistic locking for string objects via compare-and-set and
compare-and-delete mechanism.

## What's changed

Introduction of new DIGEST command for string objects calculated via
XXH3 hash.

Extend SET command with new parameters supporting optimistic locking.
The new value is set only if checks against a given (old) value or a
given string digest pass.

Introduction of new DELEX command to support conditionally deleting a
key. Conditions are also checks against string value or string digest.

## Motivation

For developers who need to to implement a compare-and-set and
compare-and-delete single-key optimistic concurrency control this PR
provides single-command based implementation.

Compare-and-set and compare-and-delete are mostly used for [Optimistic
concurrency
control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control):
a client (1) fetches the value, keeps the old value (or its digest, for
a large string) in memory, (2) manipulates a local copy of the value,
(3) applies the local changes to the server, but only if the server’s
value hasn’t been changed (still equal to the old value).

Note that compare-and-set [can also be
implemented](https://redis.io/docs/latest/develop/using-commands/transactions/#optimistic-locking-using-check-and-set)
with WATCH … MULTI … EXEC and Lua scripts. The new SET optional
arguments and the DELEX command do not enable new functionality,
however, they are much simpler and faster to use for the very common use
case of single-key optimistic concurrency control.

## Related issues and PRs

https://github.com/redis/redis/issues/12485
https://github.com/redis/redis/pull/8361
https://github.com/redis/redis/pull/4258

## Description of the new commands

### DIGEST

```
DIGEST key
```

Get the hash digest of the value stored in key, as an hex string.

Reply:
- Null if key does not exist
- error if key exists but holds a value which is not a string
- (bulk string) the XXH3 digest of the value stored in key, as an hex
string

### SET

```
SET key value [NX | XX | IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest] [GET] [EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT unix-time-milliseconds | KEEPTTL]
```

`IFEQ match-value` - Set the key’s value and expiration only if its
current value is equal to match-value. If key doesn’t exist - it won’t
be created.
`IFNE match-value` - Set the key’s value and expiration only if its
current value is not equal to match-value. If key doesn’t exist - it
will be created.
`IFDEQ match-digest` - Set the key’s value and expiration only if the
digest of its current value is equal to match-digest. If key doesn’t
exist - it won’t be created.
`IFDNE match-digest` - Set the key’s value and expiration only if the
digest of its current value is not equal to match-digest. If key doesn’t
exist - it will be created.

Reply update:
- If GET was not specified:
   - Nil reply if either
- the key doesn’t exist and XX/IFEQ/IFDEQ was specified. The key was not
created.
- the key exists, and NX was specified or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false. The key was not set.
   - Simple string reply: OK: The key was set.
- If GET was specified, any of the following:
- Nil reply: The key didn't exist before this command (whether the key
was created or not).
- Bulk string reply: The previous value of the key (whether the key was
set or not).

### DELEX

```
DELEX key [IFEQ match-value | IFNE match-value | IFDEQ match-digest | IFDNE match-digest]
```

Conditionally removes the specified key. A key is ignored if it does not
exist.

`IFEQ match-value` - Delete the key only if its value is equal to
match-value
`IFNE match-value` - Delete the key only if its value is not equal to
match-value
`IFDEQ match-digest` - Delete the key only if the digest of its value is
equal to match-digest
`IFDNE match-digest` - Delete the key only if the digest of its value is
not equal to match-digest

Reply: 
- error if key exists but holds a value that is not a string and
IFEQ/IFNE/IFDEQ/IFDNE is specified.
- (integer) 0 if not deleted (the key does not exist or a specified
IFEQ/IFNE/IFDEQ/IFDNE condition is false), or 1 if deleted.

### Notes

Added copy of xxhash repo to deps -
[version](c961fbe61a)

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: Yuan Wang <wangyuancode@163.com>
2025-10-21 10:32:49 +03:00
Moti Cohen
5b49119236
Fix crash in lookupKey() when executing_client is NULL (#14415)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
This PR is based on:
https://github.com/valkey-io/valkey/pull/2347

This was introduced in https://github.com/redis/redis/pull/13512

The server crashes with a null pointer dereference when lookupKey() is
called from handleClientsBlockedOnKey(). The crash occurs because
server.executing_client is NULL, but the code attempts to access
server.executing_client->cmd->proc without checking.

**Crash scenario:**
Client 1 enables CLIENT NO-TOUCH
Client 2 blocks on BRPOP mylist 0
Client 1 executes RPUSH mylist elem
When unblocking Client 2, lookupKey() dereferences NULL
server.executing_client → crash

**Solution**
Added proper null checks before dereferencing server.executing_client:
Check if LOOKUP_NOTOUCH flag is already set before attempting to modify
it
Verify both server.current_client and server.executing_client are not
NULL before accessing their members
Maintain the TOUCH command exception for scripts

**Testing**
Added regression test in tests/unit/type/list.tcl that reproduces and
verifies the fix for this crash scenario.

This fix is based on valkey-io/valkey#2347

Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
2025-10-13 12:12:38 +03:00
张宇杭
083f38ef5a
Fix issues with server.allow_access_expired (#14262)
Some checks failed
CI / test-ubuntu-latest (push) Has been cancelled
CI / test-sanitizer-address (push) Has been cancelled
CI / build-debian-old (push) Has been cancelled
CI / build-macos-latest (push) Has been cancelled
CI / build-32bit (push) Has been cancelled
CI / build-libc-malloc (push) Has been cancelled
CI / build-centos-jemalloc (push) Has been cancelled
CI / build-old-chain-jemalloc (push) Has been cancelled
Codecov / code-coverage (push) Has been cancelled
External Server Tests / test-external-standalone (push) Has been cancelled
External Server Tests / test-external-cluster (push) Has been cancelled
External Server Tests / test-external-nodebug (push) Has been cancelled
Spellcheck / Spellcheck (push) Has been cancelled
Close https://github.com/redis/redis/issues/14214

1. When the server.allow_access_expired flag is set to 1, it allows
access to expired keys that have not yet been evicted. All places
involving access to expired keys should consider the impact of this
parameter.
2. The modifications involve five methods: hfieldIsExpired,
hashTypeNext, hashTypeLength, keyIsExpired, and hashTypeIsExpired. When
the server.allow_access_expired flag is set to 1, these methods will not
skip expired keys, otherwise they follow the normal logic execution.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-10-12 11:02:03 +08:00
Moti Cohen
9b63e99d05
Refactor HFE: Introduce Per-Slot Expiration Store (estore) (#14294)
Hash field expiration is managed with two levels of data structures.

1. At the DB level, an ebuckets structure maintains the set of all
hashes that contain fields with expiration.
2. At the per-hash level, an ebuckets structure tracks fields with
expiration.

This pull request refactors the 1st level to operate per slot instead,
and introduces a new API called estore (expiration store). Its design
aligns closely with the existing kvstore API, ensuring consistency and
simplifying usage. The terminology at that level has been updated from
“HFE” or “hexpire” to “subexpiry”, reflecting a broader scope that can
later support other data types.
2025-09-11 16:45:17 +03:00
debing.sun
60adba48aa
Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326)
This PR is based on https://github.com/valkey-io/valkey/pull/1303

This PR introduces a DEBUG_DEFRAG compilation option that enables
activedefrag functionality even when the allocator is not jemalloc, and
always forces defragmentation regardless of the amount or ratio of
fragmentation.

## Using
```
make SANITIZER=address DEBUG_DEFRAG=<force|fully>
./runtest --debug-defrag
```

* DEBUG_DEFRAG=force
   * Ignore the threshold for defragmentation to ensure that
defragmentation is always triggered.
   * Always reallocate pointers to probe for correctness issues in pointer
reallocation.

* DEBUG_DEFRAG=fully
   * Includes everything in the option `force`.
   * Additionally performs a full defrag on every defrag cycle, which is
significantly slower but more accurate.

---------

Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: oranagra <oran@redislabs.com>
2025-09-10 12:52:20 +08:00
Giuseppe Coco
5f8e7852f4
Fix: Validate ENTRIESREAD in XGROUP command (#14259)
Fixes #14257

The XGROUP CREATE and SETID subcommands allowed setting an ENTRIESREAD
value greater than the stream's total `entries_added` counter. This
could lead to a logically inconsistent state.

This commit adds a check to ensure the provided ENTRIESREAD value is not
greater than the number of entries ever added to the stream. If
ENTRIESREAD is too large, it gets set to the total number of entries in
the stream, i.e. `s->entries_added`.
2025-09-01 08:36:38 +08:00
Moti Cohen
e6c261f3fb
Fix MEMORY USAGE command (#14288)
After the key-value unification (kvobj), the MEMORY USAGE command may no
longer account for the embedded key length stored within the kvobj. To
fix this, replace sizeof(*o) with zmalloc_size((void *)o) to ensure the
full allocated size is measured.

In this context, the function objectComputeSize() was renamed and
modified to kvobjComputeSize(). From computing only the value size to
compute the key and its value.
2025-08-20 13:54:45 +03:00
debing.sun
b9d9d4000b
Prevent crash when cgroups_ref is null in streamEntryIsReferenced() after reload (#14276)
This bug was introduced by https://github.com/redis/redis/pull/14130
found by @oranagra

### Summary

Because `s->cgroup_ref` is created at runtime the first time a consumer
group is linked with a message, but it is not released when all
references are removed.

However, after `debug reload` or restart, if the PEL is empty (meaning
no consumer group is referencing any message), `s->cgroup_ref` will not
be recreated.

As a result, when executing XADD or XTRIM with `ACKED` option and
checking whether a message that is being read but has not been ACKed can
be deleted, the cgroup_ref being NULL will cause a crash.

### Code Path
```
xaddCommand -> streamTrim -> streamEntryIsReferenced
```

### Solution

Check if `s->cgroup_ref` is NULL in streamEntryIsReferenced().
2025-08-15 15:15:16 +08:00
debing.sun
bec644aab1
Fix missing kvobj reassignment after reallocation in MOVE command (#14233)
Introduced by https://github.com/redis/redis/issues/13806

Fixed a crash in the MOVE command when moving hash objects that have
both key expiration and field expiration.

The issue occurred in the following scenario:
1. A hash has both key expiration and field expiration.
2. During MOVE command, `setExpireByLink()` is called to set the
expiration time for the target hash, which may reallocate the kvobj of
hash.
3. Since the hash has field expiration, `hashTypeAddToExpires()` is
called to update the minimum field expiration time

Issue:
However, the kvobj pointer wasn't updated with the return value from
`setExpireByLink()`, causing `hashTypeAddToExpires()` to use freed
memory.
2025-07-30 22:24:56 +08:00
Yuan Wang
db4fc2a833
Fix HINCRBYFLOAT removes field expiration on replica (#14224)
Fixes #14218

Before, we replicate HINCRBYFLOAT as an HSET command with the final
value in order to make sure that differences in float precision or
formatting will not create differences in replicas or after an AOF
restart.
However, on the replica side, if the field has an expiration time, HSET
will remove it, even though the master retains it. This leads to
inconsistencies between the master and the replica.

To address this, we now use the HSETEX command with the KEEPTTL flag
instead of HSET, ensuring that the field’s TTL is preserved.

This bug was introduced in version 7.4, but the HSETEX command was only
implemented from version 8.0. Therefore, this patch does not fix the
issue in the 7.4 branch, a separate commit is needed to address it in
7.4.
2025-07-28 21:09:46 +08:00
debing.sun
45c8fcc992
Only mark the client reprocessing flag when unblocked on keys (#14165)
This PR is based on https://github.com/valkey-io/valkey/pull/2109

When we refactored the blocking framework we introduced the client
reprocessing infrastructure. In cases the client was blocked on keys, it
will attempt to reprocess the command. One challenge was to keep track
of the command timeout, since we are reprocessing and do not want to
re-register the client with a fresh timeout each time. The solution was
to consider the client reprocessing flag when the client is
blockedOnKeys:

```c
    if (!(c->flags & CLIENT_REPROCESSING_COMMAND)) {
        /* If the client is re-processing the command, we do not set the timeout
         * because we need to retain the client's original timeout. */
        c->bstate.timeout = timeout;
    }
```

However, this introduced a new issue. There are cases where the client
will consecutive blocking of different types for example:
```
CLIENT PAUSE 10000 ALL
BZPOPMAX zset 1
```
would have the client blocked on the zset endlessly if nothing will be
written to it.

**Credits to @uriyage for locating this with his fuzzer testing**

The suggested solution is to only flag the client when it is
specifically unblocked on keys.

Signed-off-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2025-07-21 20:05:47 +08:00
debing.sun
fa040a72c0
Add XDELEX and XACKDEL commands for stream (#14130)
## Summary and detailed design for new stream command

## XDELEX

### Syntax
```
XDELEX key [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```

### Description
The `XDELEX` command extends the Redis Streams `XDEL` command, offering
enhanced control over message entry deletion with respect to consumer
groups. It accepts optional `DELREF` or `ACKED` parameters to modify its
behavior:

- **KEEPREF:** Deletes the specified entries from the stream, but
preserves existing references to these entries in all consumer groups'
PEL. This behavior is similar to XDEL.
- **DELREF:** Deletes the specified entries from the stream and also
removes all references to these entries from all consumer groups'
pending entry lists, effectively cleaning up all traces of the messages.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.

**Note:** The `IDS` block can appear at any position in the command,
consistent with other commands.

### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was deleted from the stream.
- `2`: Entry was not deleted, but there are still dangling references.
(ACKED option)

## XACKDEL

### Syntax
```
XACKDEL key group [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```

### Description
The `XACKDEL` command combines `XACK` and `XDEL` functionalities in
Redis Streams. It acknowledges specified message IDs in the given
consumer group and attempts to delete corresponding stream entries. It
accepts optional `DELREF` or `ACKED` parameters:

- **KEEPREF:** Acknowledges the messages in the specified consumer group
and deletes the entries from the stream, but preserves existing
references to these entries in all consumer groups' PEL.
- **DELREF:** Acknowledges the messages in the specified consumer group,
deletes the entries from the stream, and also removes all references to
these entries from all consumer groups' pending entry lists, effectively
cleaning up all traces of the messages.
- **ACKED:** Acknowledges the messages in the specified consumer group
and only trims entries that were read and acknowledged by all consumer
groups.


### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was acknowledged and deleted from the stream.
- `2`: Entry was acknowledged but not deleted, but there are still
dangling references. (ACKED option)

# Redis Streams Commands Extension

## XTRIM

### Syntax
```
XTRIM key <MAXLEN | MINID> [= | ~] threshold [LIMIT count] [KEEPREF | DELREF | ACKED]
```

### Description
The `XTRIM` command trims a stream by removing entries based on
specified criteria, extended to include optional `DELREF` or `ACKED`
parameters for consumer group handling:

- **KEEPREF:** Trims the stream according to the specified strategy
(MAXLEN or MINID) regardless of whether entries are referenced by any
consumer groups, but preserves existing references to these entries in
all consumer groups' PEL.
- **DELREF:** Trims the stream according to the specified strategy and
also removes all references to the trimmed entries from all consumer
groups' PEL.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.

### Reply
No change.

## XADD

### Syntax
```
XADD key [NOMKSTREAM] [<MAXLEN | MINID> [= | ~] threshold [LIMIT count]] [KEEPREF | DELREF | ACKED] <* | id> field value [field value ...]
```

### Description
The `XADD` command appends a new entry to a stream and optionally trims
it in the same operation, extended to include optional `DELREF` or
`ACKED` parameters for trimming behavior:

- **KEEPREF:** When trimming, removes entries from the stream according
to the specified strategy (MAXLEN or MINID), regardless of whether they
are referenced by any consumer groups, but preserves existing references
to these entries in all consumer groups' PEL.
- **DELREF:** When trimming, removes entries from the stream according
to the specified strategy and also removes all references to these
entries from all consumer groups' PEL.
- **ACKED:** When trimming, only removes entries that were read and
acknowledged by all consumer groups. Note that if the number of
referenced entries is bigger than MAXLEN, we will still stop.

### Reply
No change.

## Key implementation

Since we currently have no simple way to track the association between
an entry and consumer groups without iterating over all groups, we
introduce two mechanisms to establish this link. This allows us to
determine whether an entry has been seen by all consumer groups, and to
identify which groups are referencing it. With this links, we can break
the association when the entry is either acknowledged or deleted.

1) Added reference tracking between stream messages and consumer groups
using `cgroups_ref`
The cgroups_ref is implemented as a rax that maps stream message IDs to
lists of consumer groups that reference those messages, and streamNACK
stores the corresponding nodes of this list, so that the corresponding
groups can be deleted during `ACK`.
In this way, we can determine whether an entry has been seen but not
ack.
2) Store a cache minimum last_id in the stream structure.
The reason for doing this is that there is a situation where an entry
has never been seen by the consume group. In this case, we think this
entry has not been consumed either. If there is an "ACKED" option, we
cannot directly delete this entry either.
When a consumer group updates its last_id, we don’t immediately update
the cached minimum last_id. Instead, we check whether the group’s
previous last_id was equal to the current minimum, or whether the new
last_id is smaller than the current minimum (when using `XGROUP SETID`).
If either is true, we mark the cached minimum last_id as invalid, and
defer the actual update until the next time it’s needed.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: moticless <moticless@github.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-07-01 21:00:42 +08:00
debing.sun
5ff81f68a3
Fix XPENDING reply schema for empty reply (#14129)
When the PEL is empty, the reply of `XPENDING` without `start` option
will be:
```
1) (integer) 0
2) (nil)
3) (nil)
4) (nil)
```

It is not an empty array, so we need to create an individual reply
schema for it.
2025-07-01 17:35:09 +08:00
yzc-yzc
117424f85c
Fix negative offset issue for ZRANGEBY[SCORE|LEX] command (#14043)
Fix #13952

This PR ensures that ZRANGE_SCORE/LEX command with a negative offset
will return empty.
2025-06-20 13:51:52 +08:00
guybe7
6349a7c4f9
Add GETRANGE tests with negative indices (#13950)
Inspired by https://github.com/redis/redis/pull/12272
2025-05-27 09:41:28 +08:00
Moti Cohen
e1789e4368
keyspace - Unify key and value & use dict no_value=1 (#13806)
The idea of packing the key (`sds`), value (`robj`) and optionally TTL
into a single struct in memory was mentioned a few times in the past by
the community in various flavors. This approach improves memory
efficiency, reduces pointer dereferences for faster lookups, and
simplifies expiration management by keeping all relevant data in one
place. This change goes along with setting keyspace's dict to
no_value=1, and saving considerable amount of memory.

Two more motivations that well aligned with this unification are:

- Prepare the groundwork for replacing EXPIRE scan based implementation
and evaluate instead new `ebuckets` data structure that was introduced
as part of [Hash Field Expiration
feature](https://redis.io/blog/hash-field-expiration-architecture-and-benchmarks/).
Using this data structure requires embedding the ExpireMeta structure
within each object.
- Consider replacing dict with a more space efficient open addressing
approach hash table that might rely on keeping a single pointer to
object.

Before this PR, I POC'ed on a variant of open addressing hash-table and
was surprised to find that dict with no_value actually could provide a
good balance between performance, memory efficiency, and simplicity.
This realization prompted the separation of the unification step from
the evaluation of a new hash table to avoid introducing too many changes
at once and to evaluate its impact independently before considering
replacement of existing hash-table. On an earlier
[commit](https://github.com/redis/redis/pull/13683) I extended dict
no_value optimization (which saves keeping dictEntry where possible) to
be relevant also for objects with even addresses in memory. Combining it
with this unification saves a considerable amount of memory for
keyspace.

# kvobj
This PR adopts Valkey’s
[packing](3eb8314be6)
layout and logic for key, value, and TTL. However, unlike Valkey
implementation, which retained a common `robj` throughout the project,
this PR distinguishes between the general-purpose, overused `robj`, and
the new `kvobj`, which embeds both the key and value and used by the
keyspace. Conceptually, `robj` serves as a base class, while `kvobj`
acts as a derived class.

Two new flags introduced into redis object, `iskvobj` and `expirable`:
```
struct redisObject {
    unsigned type:4;
    unsigned encoding:4;
    unsigned lru:LRU_BITS;
    unsigned iskvobj : 1;             /* new flag */
    unsigned expirable : 1;           /* new flag */
    unsigned refcount : 30;           /* modified: 32bits->30bits */
    void *ptr;
};

typedef struct redisObject robj;
typedef struct redisObject kvobj;
```
When the `iskvobj` flag is set, the object includes also the key and it
is appended to the end of the object. If the `expirable` flag is set, an
additional 8 bytes are added to the object. If the object is of type
string, and the string is rather short, then it will be embedded as
well.

As a result, all keys in the keyspace are promoted to be of type
`kvobj`. This term attempts to align with the existing Redis object,
robj, and the kvstore data structure.

# EXPIRE Implementation
As `kvobj` embeds expiration time as well, looking up expiration times
is now an O(1) operation. And the hash-table of EXPIRE is set now to be
`no_value` mode, directly referencing `kvobj` entries, and in turn,
saves memory.

Next, I plan to evaluate replacing the EXPIRE implementation with the
[ebuckets](https://github.com/redis/redis/blob/unstable/src/ebuckets.h)
data structure, which would eliminate keyspace scans for expired keys.
This requires embedding `ExpireMeta` within each `kvobj` of each key
with expiration. In such implementation, the `expirable` flag will be
shifted to indicate whether `ExpireMeta` is attached.


# Implementation notes

## Manipulating keyspace (find, modify, insert)
Initially, unifying the key and value into a single object and storing
it in dict with `no_value` optimization seemed like a quick win.
However, it (quickly) became clear that this change required deeper
modifications to how keys are manipulated. The challenge was handling
cases where a dictEntry is opt-out due to no_value optimization. In such
cases, many of the APIs that return the dictEntry from a lookup become
insufficient, as it just might be the key itself. To address this issue,
a new-old approach of returning a "link" to the looked-up key's
`dictEntry` instead of the `dictEntry` itself. The term `link` was
already somewhat available in dict API, and is well aligned with the new
dictEntLink declaration:
```
typedef dictEntry **dictEntLink;
```
This PR introduces two new function APIs to dict to leverage returned
link from the search:
```
dictEntLink dictFindLink(dict *d, const void *key, dictEntLink *bucket);
void dictSetKeyAtLink(dict *d, void *key, dictEntLink *link, int newItem);
```
After calling `link = dictFindLink(...)`, any necessary updates must be
performed immediately after by calling `dictSetKeyAtLink()` without any
intervening operations on given dict. Otherwise, `dictEntLink` may
become invalid. Example:
```
/* replace existing key */
link = dictFindLink(d, key, &bucket, 0);
// ... Do something, but don't modify the dict ...
// assert(link != NULL);
dictSetKeyAtLink(d, kv, &link, 0);
     
/* Add new value (If no space for the new key, dict will be expanded and 
   bucket will be looked up again.) */  
link = dictFindLink(d, key, &bucket);
// ... Do something, but don't modify the dict ...
// assert(link == NULL);
dictSetKeyAtLink(d, kv, &bucket, 1);
```
## dict.h 
- The dict API has became cluttered with many unused functions. I have
removed these from dict.h.
- Additionally, APIs specifically related to hash maps (no_value=0),
primarily those handling key-value access, have been gathered and
isolated.
- Removed entirely internal functions ending with “*ByHash()” that were
originally added for optimization and not required any more.
- Few other legacy dict functions were adapted at API level to work with
the term dictEntLink as well.
- Simplified and generalized an optimization that related to comparison
of length of keys of type strings.

## Hash Field Expiration
Until now each hash object with expiration on fields needed to maintain
a reference to its key-name (of the hash object), such that in case it
will be active-expired, then it will be possible to resolve the key-name
for the notification sake. Now there is no need anymore.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-05-12 10:15:17 +03:00
nesty92
8468ded667
Fix incorrect lag due to trimming stream via XTRIM or XADD command (#13958)
This PR fix the lag calculation by ensuring that when consumer group's last_id
is behind the first entry, the consumer group's entries read is considered
invalid and recalculated from the start of the stream

Supplement to PR #13473 

Close #13957

Signed-off-by: Ernesto Alejandro Santana Hidalgo <ernesto.alejandrosantana@gmail.com>
2025-04-22 10:11:10 +08:00
Cong Chen
981aa5c12f
Fix timing issue in HEXPIREAT test (#13873)
This fixes an error that occurs in the job
[test-valgrind-no-malloc-usable-size-test](https://github.com/redis/redis/actions/runs/13912357739/job/38929051397)
of the Daily workflow:

```
*** [err]: HEXPIREAT - Set time and then get TTL (listpackex) in tests/unit/type/hash-field-expire.tcl
Expected '999' to be between to '1000' and '2000' (context: type eval line 6 cmd {assert_range [r hpttl myhash FIELDS 1 field1] 1000 2000} proc ::test)
```
2025-03-26 10:00:38 +08:00
Filipe Oliveira (Redis)
3e012c9260
Fix string2d usage in case of hexadecimal strings parsing and overflow (#13845)
Since https://github.com/redis/redis/pull/11884, what was previously
accepted as a valid input (hexadecimal string) before 8.0 returned an
error. This PR addresses it. To avoid performance penalties if hints the
compiler that the fallbacks are not likely to happen.
Furthermore, we were ignoring std::result_out_of_range outputs from
fast_float. This PR addresses it as well and includes tests for both
identified scenarios.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-03-19 20:08:45 +08:00
Yuan Wang
f1d6542b1a
Stabilize tcl test cases (#13829)
Recently encountered some errors as bellow,

HGETEX/HSETEX with PXAT/EXAT options, after getting ttl, we calculate
current time by `[clock seconds]` that may have a delay that causes
results greater than expected.

Dismiss memory test error, now we introduced rdb-channel replication,
the full synchronization might finish before the child process exits. So
we may fail if calling `bgsave` immediately after full sync.
2025-02-25 16:31:53 +08:00
Denis Nevmerzhitskii
33f03f6fc8
Fix wrong behavior of XREAD + after last entry of stream have been removed (#13632)
Close #13628

This PR changes behavior of special `+` id of XREAD command. Now it uses
`streamLastValidID` to find last entry instead of `last_id` field of
stream object.
This PR adds test for the issue.

**Notes**

Initial idea to update `last_id` while executing XDEL seems to be wrong.
`last_id` is used to strore last generated id and not id of last entry.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: guybe7 <guy.benoish@redislabs.com>
2025-02-25 13:40:24 +08:00
Ozan Tezcan
e2608478b6
Add HGETDEL, HGETEX and HSETEX hash commands (#13798)
This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These
commands enable user to do multiple operations in one step atomically
e.g. set a hash field and update its TTL with a single command.
Previously, it was only possible to do it by calling hset and hexpire
commands subsequently.

- **HGETDEL command**

  ```
  HGETDEL <key> FIELDS <numfields> field [field ...]
  ```
  
  **Description**  
  Get and delete the value of one or more fields of a given hash key
  
  **Reply**  
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HGETEX command**

  ```
   HGETEX <key>  
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
     FIELDS <numfields> field [field ...]
  ```

  **Description**
Get the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  PERSIST: Remove the time to live associated with the field.

  **Reply** 
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HSETEX command**

  ```
  HSETEX <key>
     [FNX | FXX]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
     FIELDS <numfields> field value [field value...]
  ```
  **Description**
Set the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  FNX: Only set the fields if all do not already exist.
  FXX: Only set the fields if all already exist.

  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  KEEPTTL: Retain the time to live associated with the field.

  
Note: If no option is provided, any associated expiration time will be
discarded similar to how SET command behaves.

  **Reply**
  Integer reply: 0 if no fields were set
  Integer reply: 1 if all the fields were set
2025-02-14 17:13:35 +03:00
YaacovHazan
0aeb86d78d Revert "Improve GETRANGE command behavior (#12272)"
Although the commit #6ceadfb58 improves GETRANGE command behavior,
we can't accept it as we should avoid breaking changes for non-critical bug fixes.

This reverts commit 6ceadfb580.
2025-02-05 20:49:42 +02:00
Yuan Wang
64a40b20d9
Async IO Threads (#13695)
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.

- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.

- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540

Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.

## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.

the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.

### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.

Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.

### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.

### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.

**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.

**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.

**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.

**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.

**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.

**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.

**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.

### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```

Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.

`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`

## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.

For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.

For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.

### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.

To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.

However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.

### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
   They don't even use SDS, so there is no reference issues.

But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
    the key names will be kept in `c->db->blocking_keys`.
3. watch command
    the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
    channel name will be kept in `serverPubSubChannels`.
5. script load command
    script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`

Those issues will be handled in other PRs.

## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.

### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.

## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.

When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.

Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
2024-12-23 14:16:40 +08:00
debing.sun
3fc7ef8f81
Fix race in stream-cgroups test (#13593)
failed CI:
https://github.com/redis/redis/actions/runs/11171608362/job/31056659165
https://github.com/redis/redis/actions/runs/11226025974/job/31205787575
2024-10-12 09:23:19 +08:00
Moti Cohen
5f28bd96db
Fix race in HFE tests (#13563)
Test 1 - give more time for expiration
Test 2 - Evaluate expiration time boundaries [+1,+2] before setting expiration [+1]
Test 3 - Avoid race on test HFEs propagated to replica
2024-09-23 10:30:29 +03:00
Moti Cohen
9a89e32a95
HFE - Fix key ref by the hash on RENAME/MOVE/SWAPDB/RESTORE (#13539)
If the hash previously had HFEs (hash-fields with expiration) but later no longer
does, the key ref in the hash might become outdated after a MOVE, COPY,
RENAME or RESTORE operation. These commands maintain the key ref only
if HFEs are present. That is, we can only be sure that key ref is valid as long as the
hash has HFEs.
2024-09-12 12:40:12 +03:00
Moti Cohen
569584d463
HFE - Simplify logic of HGETALL command (#13425) 2024-09-05 12:48:44 +03:00
Zihao Lin
6ceadfb580
Improve GETRANGE command behavior (#12272)
Fixed the issue about GETRANGE and SUBSTR command
return unexpected result caused by the `start` and `end` out of
definition range of string.

---
## break change
Before this PR, when negative `end` was out of range (i.e., end <
-strlen), we would fix it to 0 to get the substring, which also resulted
in the first character still being returned for this kind of out of
range.
After this PR, we ensure that `GETRANGE` returns an empty bulk when the
negative end index is out of range.

Closes #11738

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-08-20 12:34:43 +08:00
debing.sun
2b88db90aa
Fix incorrect lag due to trimming stream via XTRIM command (#13473)
## Describe
When using the `XTRIM` command to trim a stream, it does not update the
maximal tombstone (`max_deleted_entry_id`). This leads to an issue where
the lag calculation incorrectly assumes that there are no tombstones
after the consumer group's last_id, resulting in an inaccurate lag.

The reason XTRIM doesn't need to update the maximal tombstone is that it
always trims from the beginning of the stream. This means that it
consistently changes the position of the first entry, leading to the
following scenarios:

1) First entry trimmed after maximal tombstone:
If the first entry is trimmed to a position after the maximal tombstone,
all tombstones will be before the first entry, so they won't affect the
consumer group's lag.

2) First entry trimmed before maximal tombstone:
If the first entry is trimmed to a position before the maximal
tombstone, the maximal tombstone will not be updated.

## Solution
Therefore, this PR optimizes the lag calculation by ensuring that when
both the consumer group's last_id and the maximal tombstone are behind
the first entry, the consumer group's lag is always equal to the number
of remaining elements in the stream.

Supplement to PR https://github.com/redis/redis/pull/13338
2024-08-16 23:13:31 +08:00
debing.sun
b94b714f81
Fix error message for XREAD command with wrong parameter (#13474)
Fixed a missing from #13117.
When the number of streams is incorrect, the error message for `XREAD`
needs to include the '+' symbol.
2024-08-14 21:40:43 +08:00
Moti Cohen
806459f481
On HDEL last field with expiry, update global HFE DS (#13470)
Hash field expiration is optimized to avoid frequent update global HFE DS for
each field deletion. Eventually active-expiration will run and update or remove
the hash from global HFE DS gracefully. Nevertheless, statistic "subexpiry"
might reflect wrong number of hashes with HFE to the user if HDEL deletes
the last field with expiration in hash (yet there are more fields without expiration).

Following this change, if HDEL the last field with expiration in the hash then
take care to remove the hash from global HFE DS as well.
2024-08-11 16:39:03 +03:00
debing.sun
93fb83b4cb
Fix incorrect lag field in XINFO when tombstone is after the last_id of consume group (#13338)
Fix #13337

Ths PR fixes fixed two bugs that caused lag calculation errors.
1. When the latest tombstone is before the first entry, the tombstone
may stil be after the last id of consume group.
2. When a tombstone is after the last id of consume group, the group's
counter will be invalid, we should caculate the entries_read by using
estimates.
2024-07-30 22:31:31 +08:00
Moti Cohen
a84cc20aef
HFE - Fix statistic to count also lazy expired and rename INFO params (#13372)
* INFO command : rename `hashes_with_expiry_fields` to `subexpiry`
* INFO command : rename `expired_hash_fields` to `expired_subkeys`
* Fix statistic of `expired_subkeys` to count also lazy expired
* Remove TODOs comments leftover in TCL
* Fix potential flaky test of rdb load of hash-field-expiration
2024-07-02 18:22:10 +03:00