mirror of
https://github.com/redis/redis.git
synced 2026-06-08 16:24:26 -04:00
318 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2e46d2e735
|
Hold GCRA out of the release (#15191)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
After introducing GCRA algorithm into redis https://github.com/redis/redis/pull/14826 and subsequent introduction of new RATE_LIMIT object type - https://github.com/redis/redis/pull/14905. It was internally decided not to introduce GCRA into the new release. As still no decision is made on whether it will be kept or not in the future, this PR only makes the code related to GCRA dead - commands are inaccessible and AOF/RDB load+save is disabled. --------- Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
0d9576435f
|
Implement the new Redis Array type (#15162)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
# Redis Array For years, Redis has been missing a real indexed data structure for the use cases where the index and the spatial relationship of elements are semantic. Hashes give you random lookups, but you have to store an index as a key, and have no range visibility. Lists give you appending and trimming, but what is in the middle remains hard to access. Streams give you append-only events, which is another (useful, indeed) beast. None of these is what you want when the *position itself* has business meaning — slot 37, step 4, row 18552, day from 2934 to 2949, file line 11, 12, 15 and so forth. And, all those types, for different reasons, are all suboptimal when you want a **ring buffer** able to store the latest N observed samples of something. Up to now, users found ways (they always do \o/) using the fact that the data structures that are obvious in this universe are also extremely powerful, if well implemented. But this forces compromises. Arrays handle these index-first requirements natively, and usually with much better memory and CPU usage than the workarounds. If the use case is the right one, Arrays often provide much better space, time and usability at the same time. ## Internal encoding 1. When dense, an Array is essentially a more fancy C array. You don't pay anything for storing the index. 2. Yet, instead of going really flat, arrays are sliced into 4096-element slices, and each slice, when it contains just a few elements, uses a special sparse encoding. When a slice is empty it's just a `NULL` stored in the directory. 3. Small ints, floats, and short strings are pointer-tagged, so they cost zero additional memory beyond the pointer slot itself. 4. When very sparse, a super-directory of windowed directories is used. This allows the data type to be safe, instead of exhibiting pathological space or time behavior. This representation is only triggered when there are more than 8 million elements or very high indexes set. ## Use cases Arrays are mostly stateless if not for the fact that each array remembers the index of the latest added item, allowing `ARINSERT` and `ARRING` to work properly. Otherwise it is a set/get at this index game, with solid support for both setting / getting ranges, server-side scanning, returning only populated elements in a time which is proportional not to the range size, but to the population size. A few concrete examples, that may work as mental models for the set of problems that are similar to them (from the POV of the data modeling). **Thermometer.** A sensor reporting once per minute, with gaps: ``` ARSET temp:room12:day7 123 22.3 ARGETRANGE temp:room12:day7 600 660 # the 10:00–11:00 window, with NULLs ARSCAN temp:room12:day7 600 660 # only populated elements AROP temp:room12:day7 0 1439 MAX # peak of the day, server-side ``` Missing minutes cost little to nothing. Numeric aggregation runs inside Redis. Telemetry, IoT, meter readings, KPI rollups. **Calendar.** A clinic with 96 fifteen-minute slots per day: ``` ARSET sched:room12:day 32 booking:991 ARSCAN sched:room12:day 0 95 # only occupied slots ARGETRANGE sched:room12:day 48 63 # the afternoon full view to render ``` The slot number is the business key in this case. Room booking, parking spaces, warehouse bins, lockers, ... **Ring buffer.** ARRING replaces the classic LPUSH+LTRIM pattern. Imagine remote `dmesg`. ``` ARRING machine:123 200 "[141087.430123]: arm_cpu_init(): cpu 14 online" # Capped to 200 entries ARLASTITEMS machine:123 50 REV # 50 newest first ``` Faster than LPUSH+LTRIM, keep indexed access to past elements. Last-N alarms, recent fraud scores, access history, remote logs, device events. Ok here the use cases are mainly the ones of the old pattern: it is just a better fit and allows to access random items in the middle, aggregate server-side, and so forth. **Workflow.** Step number is the index, value is the status. Gaps are meaningful: ``` ARSET claim:99172 0 received ARSET claim:99172 3 waiting:reviewer42 ARSET claim:99172 5 approved ARGETRANGE claim:99172 0 5 # full workflow view, with NULLs for missing steps ARSCAN claim:99172 0 5 # only steps that have a state ARCOUNT claim:99172 # number of recorded steps ARLEN claim:99172 # highest reached step + 1 ``` **Skills knowledge base for agents.** Arrays are good at representing / grepping into Markdown files: ``` ARSET skill:metal_gpu 0 "...." ARSET skill:metal_gpu 1 "...." ARSET skill:metal_gpu 2 "...." ARGREP skill:metal_gpu - + RE "M3|M4" WITHVALUES ``` ARGREP has EXACT, MATCH, GLOB, RE, you can have multiple predicates, can select AND or OR behavior. **Bulk import results.** Sparse row annotations over millions of rows / CSV / ...: ``` ARSET import:job551 18552 ERR:bad_email ARSCAN import:job551 0 1000000 # Provides only rows that have something ``` ## TLDR If the position is part of the meaning, use an Array. If you want to aggregate or grep remotely, use an Array. Feedback welcome :) --------- Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Shubham S Taple <155555100+ShubhamTaple@users.noreply.github.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> Co-authored-by: Marc Gravell <marc.gravell@gmail.com> |
||
|
|
7cf63635f0
|
Pass size hint to jemalloc for faster deallocation (#15071)
This PR is based on https://github.com/valkey-io/valkey/pull/453 and https://github.com/valkey-io/valkey/pull/694 When jemalloc frees memory, it performs a lookup to find the allocation's size class. `sdallocx()` lets us skip this lookup by passing the size we already know. Introduced a new free function wrapper for this: `zfree_with_size()`. Note: Impact of this optimization is only visible on hot paths e.g. on repeated memory deallocations. For the initial phase, I integrated this at `sdsfree()` only. Over time, we may expand the usage of this new API for other performance sensitive paths. For testing, added jemalloc config `--enable-opt-size-checks` to the daily fortify build. This makes jemalloc validate that the size passed to `sdallocx()` matches the actual allocation's size class, aborting on mismatch. ---- Signed-off-by: Vadym Khoptynets <vadymkh@amazon.com> Signed-off-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Signed-off-by: ranshid <ranshid@amazon.com> |
||
|
|
15cb40dac2
|
Fix command-docs and corrupt-dump-fuzzer of OBJ_GCRA type (#15055)
### Problem While the new type `OBJ_GCRA` was added, several related code paths were not updated accordingly, leading to failures in the `reply-schemas-validator` CI job and `corrupt-dump-fuzzer.tcl` ##### reply-schemas-validator Failed CI: https://github.com/redis/redis/actions/runs/24485248057/job/71558533290#step:10:903 ```shell Traceback (most recent call last): File "/home/runner/work/redis/redis/./utils/req-res-log-validator.py", line 238, in process_file jsonschema.validate(instance=res.json, schema=req.schema, cls=schema_validator) File "/home/runner/.local/lib/python3.12/site-packages/jsonschema/validators.py", line 1121, in validate raise error jsonschema.exceptions.ValidationError: 'rate_limit' is not valid under any of the given schemas Failed validating 'oneOf' in schema['patternProperties']['^.*$']['properties']['group']: {'description': 'the functional group to which the command belongs', 'oneOf': [{'const': 'bitmap'}, {'const': 'cluster'}, {'const': 'connection'}, {'const': 'generic'}, {'const': 'geo'}, {'const': 'hash'}, {'const': 'hyperloglog'}, {'const': 'list'}, {'const': 'module'}, {'const': 'pubsub'}, {'const': 'scripting'}, {'const': 'sentinel'}, {'const': 'server'}, {'const': 'set'}, {'const': 'sorted-set'}, {'const': 'stream'}, {'const': 'string'}, {'const': 'transactions'}]} On instance['gcrasetvalue']['group']: 'rate_limit' ``` ##### `corrupt-dump-fuzzer.tcl` Also fixed `: Fuzzer corrupt restore payloads - sanitize_dump: yes in tests/integration/corrupt-dump-fuzzer.tcl` Failed daily test : https://github.com/redis/redis/actions/runs/24485248057/job/71558533312#step:6:8652 ```shell Server crashed (by signal: 0, err: key "gcra" not known in dictionary), with payload: "\x1C\x0A\x02\x5F\x37\xC0\x06\xC0\x00\x02\x5F\x39\xC0\x08\x02\x5F\x33\x02\x5F\x35\x02\x5F\x31\xC0\x02\xC0\x04\x0E\x00\xA9\x71\xBF\xEE\x6F\x46\xEF\xA6" violating commands: Done 1434 cycles in 600 seconds. RESTORE: successful: 601, rejected: 833 Total commands sent in traffic: 1194776, crashes during traffic: 1 (0 by signal). [: Fuzzer corrupt restore payloads - sanitize_dump: yes in tests/integration/corrupt-dump-fuzzer.tcl Expected '1' to be equal to '0' (context: type eval line 155 cmd {assert_equal $stat_terminated_in_traffic 0} proc ::test) [147/147 done]: integration/corrupt-dump-fuzzer (1201 seconds) ``` ### Changed This change completes the necessary updates across all relevant components to ensure consistent handling of the rate_limit group and restores CI stability. |
||
|
|
3cd464263b
|
Fix gen_write_load error on MOVED/ASK during atomic-slot-migration tests (#15016)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
|
||
|
|
747dfe578e
|
Add XNACK command for releasing stream messages back to the group (#14797)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
### Overview
This PR enhances Redis Streams consumer groups by adding a new `XNACK`
command that allows consumers to explicitly release pending messages
back to the group without acknowledging them. Released (NACKed) entries
become immediately available for re-delivery to other consumers,
eliminating the idle-timeout delay currently required for message
recovery. The command supports three modes — SILENT, FAIL, and FATAL —
giving consumers fine-grained control over delivery counter semantics to
handle graceful shutdowns, transient failures, and poison messages
respectively.
### Problem Statement
For developers using Redis Streams with consumer groups, there are
several common scenarios where a consumer needs to release a message it
has claimed without acknowledging it:
1. **Transient internal failures**: A consumer may fail to process a
message because of problems unrelated to the message itself — for
example, it cannot connect to an external service to fetch required
context. The message is perfectly valid and should be retried promptly
by another consumer.
2. **Resource pressure**: A consumer under resource stress (low CPU, low
memory) may be unable to handle a specific message (e.g., a complex or
large message) within acceptable QoS. It should leave the opportunity to
other consumers in the group, with minimal delay.
3. **Graceful shutdown**: A consumer about to shut down would like to
immediately release all unprocessed messages it has claimed, so they can
be picked up by remaining consumers without waiting for idle timeouts.
4. **Poison / malicious messages**: A consumer may detect or suspect
that a claimed message is invalid or malicious and wants to mark it as
permanently failed (for dead-letter queue processing when available).
**Currently, a consumer cannot NACK a message.** It can either:
- **XACK** it — marks it as "processed" and removes it from the PEL
entirely, losing the ability to redeliver it
- **Leave it pending** — requires other consumers to discover it via
`XPENDING` and claim it via `XCLAIM`/`XAUTOCLAIM` or `XREADGROUP CLAIM`
after the idle timeout expires, introducing a long, unnecessary delay
In all these cases, the logic that applications must implement
introduces **message handling delays**, **implementation complexity**,
and **code duplication** across consumer implementations.
### Solution
Introduces a new `XNACK` (Negative ACKnowledge) command that explicitly
releases pending messages from their owning consumer back to the group's
PEL, making them immediately claimable via `XCLAIM` and `XAUTOCLAIM`,
and prioritized for re-delivery in `XREADGROUP CLAIM`:
```
XNACK key group <SILENT|FAIL|FATAL> IDS numids id [id ...] [RETRYCOUNT count] [FORCE]
```
When executed, the command:
1. **Disassociates** the entry from its owning consumer (`consumer =
NULL`)
2. **Repositions** the entry to the head of the PEL time-ordered list
(`delivery_time = 0`), making it immediately claimable with any
`min-idle-time` threshold
3. **Adjusts the delivery counter** based on the specified mode, giving
consumers fine-grained control over retry semantics
4. **Returns** the count of successfully NACKed entries
**Mode** controls the delivery counter adjustment and communicates the
reason for the NACK:
| Mode | Delivery Counter Behavior | Use Case |
|----------|---------------------------------------------------|---------------------------------------------|
| `SILENT` | Decrement by 1 (undo the delivery increment) | Consumer
shutdown / transient internal error — the delivery "didn't count" |
| `FAIL` | No change (keep the incremented value) | Message too complex
for this consumer, but may work for others — count this as an attempt |
| `FATAL` | Set to `LLONG_MAX` | Invalid / suspected malicious message —
mark as permanently failed |
The three modes map directly to the real-world scenarios above:
- **SILENT** for graceful shutdown or transient failures unrelated to
the message
- **FAIL** for resource-constrained consumers that cannot handle a
specific message
- **FATAL** for poison message detection and dead-letter queue
integration
**Optional parameters:**
- **`RETRYCOUNT count`**: Directly sets `delivery_count` to the
specified value, overriding the mode-based adjustment
- **`FORCE`**: Creates new unowned PEL entries for IDs that are not
already in the group PEL (the entry must exist in the stream). When
`FORCE` creates an entry, the delivery counter is set to `0` (or to
`RETRYCOUNT` if specified, or to `LLONG_MAX` if mode is `FATAL`). This
is used internally for AOF rewrite and replication.
### Response Format
The command returns an integer — the number of messages successfully
NACKed (released back to the group PEL):
```
127.0.0.1:6379> XADD mystream 1-0 f v1
"1-0"
127.0.0.1:6379> XADD mystream 2-0 f v2
"2-0"
127.0.0.1:6379> XGROUP CREATE mystream grp 0
OK
127.0.0.1:6379> XREADGROUP GROUP grp c1 STREAMS mystream >
1) 1) "mystream"
2) 1) 1) "1-0"
2) 1) "f"
2) "v1"
2) 1) "2-0"
2) 1) "f"
2) "v2"
127.0.0.1:6379> XNACK mystream grp FAIL IDS 2 1-0 2-0
(integer) 2
```
After XNACK, the entries appear with an empty consumer in XPENDING
output:
```
127.0.0.1:6379> XPENDING mystream grp - + 10
1) 1) "1-0"
2) ""
3) (integer) -1
4) (integer) 1
2) 1) "2-0"
2) ""
3) (integer) -1
4) (integer) 1
```
### NACK Zone: Data Structure Extension
To support unowned PEL entries and ensure they are prioritized for
re-delivery, a **NACK zone** is introduced at the head of the existing
PEL time-ordered doubly-linked list. A new `pel_nack_tail` pointer is
added to the `streamCG` structure:
**PEL ordering:**
```
[pel_time_head] <-> ... <-> [pel_nack_tail] <-> [owned entries...] <-> [pel_time_tail]
|_____________ NACK zone ______________| |_______ normal PEL ________|
```
The head of the PEL contains all NACKed messages (FIFO-ordered),
followed by all delivered messages that were not NACKed (same order as
today). This ensures NACKed messages are always prioritized over idle
pending messages.
The delivery order for `XREADGROUP` is therefore:
1. If `CLAIM` was specified: first deliver NACKed messages, then deliver
due pending messages (current behavior)
2. Deliver new entries after the group's last-delivered-id (current
behavior)
**Structure Design:**
- NACKed entries occupy positions from `pel_time_head` to
`pel_nack_tail` in the time-ordered list
- Their `delivery_time` is set to `0`, ensuring they always appear
"oldest" and are immediately claimable
- Their `consumer` pointer is set to `NULL`, marking them as unowned
- `pel_nack_tail` is `NULL` when no NACKed entries exist
**Key Properties:**
- **O(1) insertion**: New NACKed entries are inserted right after
`pel_nack_tail` (or at the list head if the zone is empty)
- **FIFO ordering** among NACKed entries: entries are NACKed in the
order they are released
- **Immediate claimability**: Since `delivery_time = 0`, NACKed entries
have maximum idle time and satisfy any `min-idle-time` threshold in
`XCLAIM` and `XAUTOCLAIM`, In `XREADGROUP CLAIM`, NACKed entries are
also prioritized over other pending entries due to their position at the
head of the PEL.
- **Zone integrity**: The `pelListInsertSorted` function is updated to
stop scanning at the `pel_nack_tail` boundary, ensuring owned entries
are never placed inside the NACK zone
### Impact on Existing Commands
All commands that interact with the PEL are updated to handle unowned
(`consumer = NULL`) entries:
- **XPENDING**: Shows NACKed entries with an empty consumer name
- **XCLAIM / XAUTOCLAIM**: Can claim NACKed entries (they satisfy any
min-idle-time since `delivery_time = 0`)
- **XREADGROUP CLAIM**: NACKed entries are picked up by the claim phase
- **XACK**: Works correctly on NACKed entries (removes from group PEL)
- **XINFO STREAM FULL**: Displays NACKed entries with an empty consumer
name
- **XGROUP DELCONSUMER**: Unaffected — NACKed entries are not in any
consumer's PEL
Propagation is also updated: when `XCLAIM` or `XAUTOCLAIM` encounters a
deleted stream entry for an unowned NACK, it propagates `XACK` (instead
of `XCLAIM`) to replicas and AOF, since there is no source consumer to
reference.
### Persistence
**RDB:**
- A new RDB type `RDB_TYPE_STREAM_LISTPACKS_5` (type 27) is introduced
- After saving consumer PEL entries, the NACK zone stream IDs are saved
separately (count + encoded IDs)
- On load, NACK zone entries are reconstructed by looking them up in the
group PEL, unlinking from their sorted position, and re-inserting into
the NACK zone via `pelListInsertNacked`
- Backward compatibility is preserved: old RDB types continue to load
with the existing validation (all entries must have consumers)
**AOF:**
- AOF rewrite emits `XNACK <key> <group> FAIL IDS <n> <id...> RETRYCOUNT
<cnt> FORCE` commands for entries in the NACK zone
- Consecutive entries with the same `delivery_count` are batched into a
single command (up to `AOF_REWRITE_ITEMS_PER_CMD` IDs per command)
### Defragmentation
The defragmentation logic is restructured to handle unowned entries:
- **`defragStreamCGPendingEntry`** (new): Walks the group-level PEL rax,
defragments each NACK, updates the doubly-linked list pointers
(`pel_prev`, `pel_next`), `pel_time_head`, `pel_time_tail`,
`pel_nack_tail`, and the consumer PEL back-pointer for owned entries
- **`defragStreamConsumerPendingEntry`** (simplified): Only fixes up
back-pointers to the possibly-relocated consumer and CG, since actual
defragmentation is now done at the group-level walk. Unowned (NACK zone)
entries have no consumer PEL walk, so the group-level pass is their only
chance
### Key Benefits
- **Immediate re-delivery**: NACKed entries are instantly claimable by
other consumers via `XCLAIM` and `XAUTOCLAIM` (since `delivery_time = 0`
satisfies any `min-idle-time`), and prioritized for re-delivery in
`XREADGROUP CLAIM`, eliminating idle-time delays that can range from
seconds to minutes
- **Explicit release semantics**: Consumers can release messages
intentionally, with fine-grained control over retry behavior — a
capability that exists in competing systems like RabbitMQ
- **Flexible retry control**: Three modes (SILENT, FAIL, FATAL) plus
RETRYCOUNT cover the full spectrum of failure handling strategies, from
graceful shutdown to poison message detection
- **Reduced application complexity**: Eliminates the need for
application-level workarounds involving XPENDING polling, arbitrary idle
timeouts, and manual XCLAIM orchestration
- **Dead-letter queue readiness**: FATAL mode + delivery count enables
straightforward poison message detection and future DLQ integration
- **Backward compatibility**: Fully optional new command with zero
breaking changes to existing behavior
|
||
|
|
e86882efe9
|
RED-184929: Auto-backups and restore test configuration (#14753)
# Automatic Test Configuration Restoration
This PR introduces an infrastructure improvement in the TCL testing
framework.
## Problem Statement
When running multiple tests under the same server, or in external server
mode, we always have the issue of potential leak of configuration
parameters set by one test polluting another.
This has caused us many problems in the past, and it's a nuisance having
to fix these when they occur. Alternatively, handling the backup and
restoration manually and explicitly when writing the test adds "noise"
that makes tests longer (lines of code wise) than necessary.
Note also that even if there is explicit configuration restoration code
at the end of a test, that code will not run in case the test fails and
exits before reaching the restoration code.
## Objective
Every test should be completely isolated (unless explicitly designed to
depend on previous tests): It should set up its environment from
scratch, and clean up after itself in case it might affect subsequent
tests. Preferably, this should happen automatically on behalf of the
test writer.
The purpose is to have a mechanism at the level of the individual test
that will perform automatic restoration of the configuration to what it
was before the test started.
This improvement should be an opt-in option. It shouldn't change
existing tests (there are existing cases in which there is a sequence of
tests, wherein the latter tests depend on changes made by the former
ones). It should allow explicit update of existing tests to use the new
mechanism, but this will occur only when explicitly triggered.
## The Solution
This PR introduces such a mechanism.
To trigger automatic restoration of changed configuration, a new
`config:restore` tag should be attached to the test. Once the tag is
attached, the configuration will be automatically restored to its state
at the beginning of the test.
### Usage Example
```tcl
test "Modify maxmemory temporarily" {
r config set maxmemory 100000000
# ... test that needs specific maxmemory ...
} {} {config:restore}
# maxmemory automatically restored to original value
```
## Covered Scenarios
- Basic Single-Server Tests
- Cluster Scenarios
- Multi-Server (Nested `start_server`) Scenarios
- Failure Handling (tests failing due to various reasons)
## Implementation Details
### Infrastructure Changes
**`save_server_configs`** (test.tcl) - Captures configuration state for
all servers in the `::servers` stack:
- Iterates through all active servers
- Executes `CONFIG GET *` on each server
- Returns a list of `[server_index, config_dict]` pairs
**`restore_server_configs`** (test.tcl) - Restores configurations using
diff-based detection:
- Compares current config to saved config
- Only restores parameters that actually changed (optimization)
- Checks server responsiveness before attempting restoration
**`test` procedure changes** (test.tcl):
- Detects `config:restore` tag in the tags list
- Calls `save_server_configs` before test execution
- Calls `restore_server_configs` after test completion (success path)
and before re-raising errors (failure path)
**`ping_server_with_timeout`** (server.tcl) - Non-blocking server
responsiveness check:
- Prevents restoration from hanging on unresponsive servers
|
||
|
|
62059a2438
|
Chore complete Tcl 9 support and fix regressions in test suite (#14845)
## Problem
PR https://github.com/redis/redis/pull/14787 introduced **Tcl 9 support
for the test suite**, but it still fails on my machine (**macOS 26.3,
Tcl 9.0.3**). Some tests fail and the runner may hang.
Example:
```bash
$ tclsh <<<'puts $tcl_version'
9.0.3
$ make test
[err]: BITCOUNT against test vector #2
Expected [r bitcount str] == 4
```
This is caused by **behavior changes in Tcl 9**, including:
- `string length` returning **character count** instead of **byte
count**
- binary sockets rejecting characters with code points **>255**
- differences in `string is wideinteger`
Parts of the Redis Tcl test framework rely on **byte-oriented
behavior**, which breaks under Tcl 9.
## Changes
### 1. Fix IPC payload encoding in test runner
`tests/unit/memefficiency.tcl` contains a **non-ASCII quote character**:
|
||
|
|
707757e478
|
Support Tcl 9.0 in Redis test suite (#14787)
## Summary
This PR adds support for running the Redis test suite using Tcl 9.0.
## Changes
- **runtest**: Added `9.0` to the list of Tcl versions the script
searches for.
- **Version Requirements**: Updated `package require Tcl` from `8.5` to
`8.5-10` in key test files. In Tcl, a simple version requirement like
`8.5` is interpreted as "8.5 or higher within major version 8".
Specifying the range `8.5-10` allows Tcl 9.0 to be used if the code is
compatible.
- **Tcl Precision**: Wrapped `set tcl_precision 17` in a conditional
check `if {$tcl_version < 9.0}`. In Tcl 9.0, `tcl_precision` has been
removed as double-to-string conversions are now lossless by default.
- Adjusts the Tcl Redis client to preserve binary arguments under Tcl 9
by only UTF-8 converting strings that contain non-byte characters before
building the RESP command.
## Testing
Verified that `tclsh9.0` successfully parses the updated `package
require` and handles the conditional `tcl_precision` assignment. This
allows the test suite to run on modern Linux distributions where Tcl 9.0
is the default version.
|
||
|
|
e3c38aab66
|
Handle primary/replica clients in IO threads (#14335)
# Problem While introducing Async IO threads(https://github.com/redis/redis/pull/13695) primary and replica clients were left to be handled inside main thread due to data race and synchronization issues. This PR solves this issue with the additional hope it increases performance of replication. # Overview ## Moving the clients to IO threads Since clients first participate in a handshake and an RDB replication phases it was decided they are moved to IO-thread after RDB replication is done. For primary client this was trivial as the master client is created only after RDB sync (+ some additional checks one can see in `isClientMustHandledByMainThread`). Replica clients though are moved to IO threads immediately after connection (as are all clients) so currently in `unstable` replication happens while this client is in IO-thread. In this PR it was moved to main thread after receiving the first `REPLCONF` message from the replica, but it is a bit hacky and we can remove it. I didn't find issues between the two versions. ## Primary client (replica node) We have few issues here: - during `serverCron` a `replicationCron` is ran which periodically sends `REPLCONF ACK` message to the master, also checks for timed-out master. In order to prevent data races we utilize`IOThreadClientsCron`. The client is periodically sent to main thread and during `processClientsFromIOThread` it's checked if it needs to run the replication cron behaviour. - data races with main thread - specifically `lastinteraction` and `read_reploff` members of the primary client that are written to in `readQueryFromClient` could be accessed at the same time from main thread during execution of `INFO REPLICATION`(`genRedisInfoString`). To solve this the members were duplicated so if the client is in IO-thread it writes to the duplicates and they are synced with the original variables each time the client is send to main thread ( that means `INFO REPLICATION` could potentially return stale values). - During `freeClient` the primary client is fetched to main thread but when caching it(`replicationCacheMaster`) the thread id will remain the id of the IO thread it was from. This creates problems when resurrecting the master client. Here the call to `unbindClientFromIOThreadEventLoop` in `freeClient` was rewritten to call `keepClientInMainThread` which automatically fixes the problem. - During `exitScriptTimedoutMode` the master is queued for reprocessing (specifically process any pending commands ASAP after it's unblocked). We do that by putting it in the `server.unblocked_clients` list, which are processed in the next `beforeSleep` cycle in main thread. Since this will create a contention between main and IO thread, we just skip this queueing in `unblocked_clients` and just queue the client to main thread - the `processClientsFromIOThread` will process the pending commands just as main would have. ## Replica clients (primary node) We move the client after RDB replication is done and after replication backlog is fed with its first message. We do that so that the client's reference to the first replication backlog node is initialized before it's read from IO-thread, hence no contention with main thread on it. ### Shared replication buffer Currently in unstable the replication buffer is shared amongst clients. This is done via clients holding references to the nodes inside the buffer. A node from the buffer can be trimmed once each replica client has read it and send its contents. The reference is `client->ref_repl_buf_node`. The replication buffer is written to by main thread in `feedReplicationBuffer` and the refcounting is intrusive - it's inside the replication-buffer nodes themselves. Since the replica client changes the refcount (decreases the refcount of the node it has just read, and increases the refcount of the next node it starts to read) during `writeToClient` we have a data race with main thread when it feeds the replication buffer. Moreover, main thread also updates the `used` size of the node - how much it has written to it, compared to its capacity which the replica client relies on to know how much to read. Obviously replica being in IO-thread creates another data race here. To mitigate these issues a few new variables were added to the client's struct: - `io_curr_repl_node` - starting node this replica is reading from inside IO-thread - `io_bound_repl_node` - the last node in the replication buffer the replica sees before being send to IO-thread. These values are only allowed to be updated in main thread. The client keeps track of how much it has read into the buffer via the old `ref_repl_buf_node`. Generally while in IO-thread the replica client will now keep refcount of the `io_curr_repl_node` until it's processed all the nodes up to `io_bound_repl_node` - at that point its returned to main thread which can safely update the refcounts. The `io_bound_repl_node` reference is there so the replica knows when to stop reading from the repl buffer - imagine that replica reads from the last node of the replication buffer while main thread feeds data to it - we will create a data race on the `used` value (`_writeToClientSlave`(IO-thread) vs `feedReplicationBuffer`(main)). That's why this value is updated just before the replica is being send to IO thread. *NOTE*, this means that when replicas are handled by IO threads they will hold more than one node at a time (i.e `io_curr_repl_node` up to `io_bound_repl_node`) meaning trimming will happen a bit less frequently. Tests show no significant problems with that. (tnx to @ShooterIT for the `io_curr_repl_node` and `io_bound_repl_node` mechanism as my initial implementation had similar semantics but was way less clear) Example of how this works: * Replication buffer state at time N: | node 0| ... | node M, used_size K | * replica caches `io_curr_repl_node`=0, `io_bound_repl_node`=M and `io_bound_block_pos`=K * replica moves to IO thread and processes all the data it sees * Replication buffer state at time N + 1: | node 0| ... | node M, used_size Full | |node M + 1| |node M + 2, used_size L|, where Full > M * replica moves to main thread at time N + 1, at this point following happens - refcount to node 0 (io_curr_repl_node) is decreased - `ref_repl_buf_node` becomes node M(io_bound_repl_node) (we still have size-K bytes to process from there) - refcount to node M is increased (now all nodes from 0 up to M-1 including can be trimmed unless some other replica holds reference to them) - And just before the replica is send back to IO thread the following are updated: - `io_bound_repl_node` ref becomes node M+2 - `io_bound_block_pos` becomes L Note that replica client is only moved to main if it has processed all the data it knows about (i.e up to `io_bound_repl_node` + `io_bound_block_pos`) ### Replica clients kept in main as much as possible During implementation an issue arose - how fast is the replica client able to get knowledge about new data from the replication buffer and how fast can it trim it. In order for that to happen ASAP whenever a replica is moved to main it remains there until the replication buffer is fed new data. At that point its put in the pending write queue and special cased in handleClientsWithPendingWrites so that its send to IO thread ASAP to write the new data to replica. Also since each time the replica writes its whole repl data it knows about that means after it's send to main thread `processClientsFromIOThread` is able to immediately update the refcounts and trim whatever it can. ### ACK messages from primary Slave clients need to periodically read `REPLCONF ACK` messages from client. Since replica can remain in main thread indefinitely if no DB change occurs, a new atomic `pending_read` was added during `readQueryFromClient`. If a replica client has a pending read it's returned back to IO-thread in order to process the read even if there is no pending repl data to write. ### Replicas during shutdown During shutdown the main thread pauses write actions and periodically checks if all replicas have reached the same replication offset as the primary node. During `finishShutdown` that may or may not be the case. Either way a client data may be read from the replicas and even we may try to write any pending data to them inside `flushSlavesOutputBuffers`. In order to prevent races all the replicas from IO threads are moved to main via `fetchClientFromIOThread`. A cancel of the shutdown should be ok, since the mechanism employed by `handleClientsWithPendingWrites` should return the client back to IO thread when needed. ## Notes While adding new tests timing issues with Tsan tests were found and fixed. Also there is a data race issue caught by Tsan on the `last_error` member of the `client` struct. It happens when both IO-thread and main thread make a syscall using a `client` instance - this can happen only for primary and replica clients since their data can be accessed by commands send from other clients. Specific example is the `INFO REPLICATION` command. Although other such races were fixed, as described above, this once is insignificant and it was decided to be ignored in `tsan.sup`. --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
d2da5cca37
|
Fix timeout waiting for blocked clients in pause test (#14716)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
To verify the pause duration, we need to wait for the client to be unpause and the command to complete, so add `$rd read` to wait for the command to finish. The test failure was caused by $rd still being blocked and not closed in the previous test, so the next test would get 2 blocked clients instead of 1 client, causing the test to fail. |
||
|
|
82fbf213eb
|
fix test tag leakage that can result in skipping tests (#14572)
Some checks are pending
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
some error handling paths didn't remove the tags they added, but most importantly, if the start_server proc is given the "tags" argument more than once, on exit, it only removed the last one. this problem exists in start_cluster in list.tcl, and the result was that the "external:skip cluster modules" were not removed |
||
|
|
32497c0a5f |
Fix MurmurHash64A overflow in HyperLogLog with 2GB+ entries
The MurmurHash64A function in hyperloglog.c used an int parameter for length, causing integer overflow when processing PFADD entries larger than 2GB. This could lead to server crashes. Changed the len parameter from int to size_t to properly handle large inputs up to SIZE_MAX in HyperLogLog operations. Refer to the implementation in facebook/mcrouter@2dbee3d/mcrouter/lib/fbi/hash.c#L54 |
||
|
|
2bc4e0299d
|
Add Atomic Slot Migration (ASM) support (#14414)
Some checks failed
CI / test-ubuntu-latest (push) Waiting to run
CI / test-sanitizer-address (push) Waiting to run
CI / build-debian-old (push) Waiting to run
CI / build-macos-latest (push) Waiting to run
CI / build-32bit (push) Waiting to run
CI / build-libc-malloc (push) Waiting to run
CI / build-centos-jemalloc (push) Waiting to run
CI / build-old-chain-jemalloc (push) Waiting to run
Codecov / code-coverage (push) Waiting to run
External Server Tests / test-external-standalone (push) Waiting to run
External Server Tests / test-external-cluster (push) Waiting to run
External Server Tests / test-external-nodebug (push) Waiting to run
Spellcheck / Spellcheck (push) Waiting to run
Reply-schemas linter / reply-schemas-linter (push) Has been cancelled
## <a name="overview"></a> Overview This PR is a joint effort with @ShooterIT . I’m just opening it on behalf of both of us. This PR introduces Atomic Slot Migration (ASM) for Redis Cluster — a new mechanism for safely and efficiently migrating hash slots between nodes. Redis Cluster distributes data across nodes using 16384 hash slots, each owned by a specific node. Sometimes slots need to be moved — for example, to rebalance after adding or removing nodes, or to mitigate a hot shard that’s overloaded. Before ASM, slot migration was non-atomic and client-dependent, relying on CLUSTER SETSLOT, GETKEYSINSLOT, MIGRATE commands, and client-side handling of ASK/ASKING replies. This process was complex, error-prone, slow and could leave clusters in inconsistent states after failures. Clients had to implement redirect logic, multi-key commands could fail mid-migration, and errors often resulted in orphaned keys or required manual cleanup. Several related discussions can be found in the issue list, some examples: https://github.com/redis/redis/issues/14300 , https://github.com/redis/redis/issues/4937 , https://github.com/redis/redis/issues/10370 , https://github.com/redis/redis/issues/4333 , https://github.com/redis/redis/issues/13122, https://github.com/redis/redis/issues/11312 Atomic Slot Migration (ASM) makes slot rebalancing safe, transparent, and reliable, addressing many of the limitations of the legacy migration method. Instead of moving keys one by one, ASM replicates the entire slot’s data plus live updates to the target node, then performs a single atomic handoff. Clients keep working without handling ASK/ASKING replies, multi-key operations remain consistent, failures don’t leave partial states, and replicas stay in sync. The migration process also completes significantly faster. Operators gain new commands (CLUSTER MIGRATION IMPORT, STATUS, CANCEL) for monitoring and control, while modules can hook into migration events for deeper integration. ### The problems of legacy method in detail Operators and developers ran into multiple issues with the legacy method, some of these issues in detail: 1. **Redirects and Client Complexity:** While a slot was being migrated, some keys were already moved while others were not. Clients had to handle `-ASK` and `-ASKING` responses, reissuing requests to the target node. Not all client libraries implemented this correctly, leading to failed commands or subtle bugs. Even when implemented, it increased latency and broke naive pipelines. 2. **Multi-Key Operations Became Unreliable:** Commands like `MGET key1 key2` could fail with `TRYAGAIN` if part of the slot was already migrated. This made application logic unpredictable during resharding. 3. **Risk of failure:** Keys were moved one-by-one (with MIGRATE command). If the source crashed, or the destination ran out of memory, the system could be left in an inconsistent state: some keys moved, others lost, slots partially migrated. Manual intervention was often needed, sometimes resulting in data loss. 4. **Replica and Failover Issues:** Replicas weren’t aware of migrations in progress. If a failover occurred mid-migration, manual intervention was required to clean up or resume the process safely. 5. **Operational Overhead:** Operators had to coordinate multiple commands (CLUSTER SETSLOT, MIGRATE, GETKEYSINSLOT, etc.) with little visibility into progress or errors, making rebalancing slow and error-prone. 6. **Poor performance:** Key-by-key migration was inherently slow and inefficient for large slot ranges. 7. **Large keys:** Large keys could fail to migrate or cause latency spikes on the destination node. ### How Atomic Slot Migration Fixes This Atomic Slot Migration (ASM) eliminates all of these issues by: 1. **Clients:** Clients no longer need to handle ASK/ASKING; the migration is fully transparent. 2. **Atomic ownership transfer:** The entire slot’s data (snapshot + live updates) is replicated and handed off in a single atomic step. 3. **Performance**: ASM completes migrations significantly faster by streaming slot data in parallel (snapshot + incremental updates) and eliminating key-by-key operations. 4. **Consistency guarantees:** Multi-key operations and pipelines continue to work reliably throughout migration. 5. **Resilience:** Failures no longer leave orphaned keys or partial states; migration tasks can be retried or safely cancelled. 6. **Replica awareness:** Replicas remain consistent during migration, and failovers will no longer leave partially imported keys. 7. **Operator visibility:** New CLUSTER MIGRATION subcommands (IMPORT, STATUS, CANCEL) provide clear observability and management for operators. ### ASM Diagram and Migration Steps ``` ┌─────────────┐ ┌────────────┐ ┌───────────┐ ┌───────────┐ ┌───────┐ │ │ │Destination │ │Destination│ │ Source │ │Source │ │ Operator │ │ master │ │ replica │ │ master │ │ Fork │ │ │ │ │ │ │ │ │ │ │ └──────┬──────┘ └─────┬──────┘ └─────┬─────┘ └─────┬─────┘ └───┬───┘ │ │ │ │ │ │ │ │ │ │ │CLUSTER MIGRATION IMPORT │ │ │ │ │ <start-slot> <end-slot>..│ │ │ │ ├───────────────────────────►│ │ │ │ │ │ │ │ │ │ Reply with <task-id> │ │ │ │ │◄───────────────────────────┤ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ CLUSTER SYNCSLOTS│SYNC │ │ │ CLUSTER MIGRATION STATUS │ <task-id> <start-slot> <end-slot>.│ │ Monitor │ ID <task-id> ├────────────────────────────────────►│ │ task ┌─►├───────────────────────────►│ │ │ │ state │ │ │ │ │ │ till │ │ Reply status │ Negotiation with multiple channels │ │ completed └─ │◄───────────────────────────┤ (i.e rdbchannel repl) │ │ │ │◄───────────────────────────────────►│ │ │ │ │ │ Fork │ │ │ │ ├──────────►│ ─┐ │ │ │ │ │ │ Slot snapshot as RESTORE commands │ │ │ │◄────────────────────────────────────────────────┤ │ │ Propagate │ │ │ │ ┌─────────────┐ ├─────────────────►│ │ │ │ │ │ │ │ │ │ │ Snapshot │ Client │ │ │ │ │ │ delivery │ │ │ Replication stream for slot range │ │ │ duration └──────┬──────┘ │◄────────────────────────────────────┤ │ │ │ │ Propagate │ │ │ │ │ ├─────────────────►│ │ │ │ │ │ │ │ │ │ │ SET key value1 │ │ │ │ │ ├─────────────────────────────────────────────────────────────────►│ │ │ │ +OK │ │ │ │ ─┘ │◄─────────────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ │ Drain repl stream │ ──┐ │ │ │◄────────────────────────────────────┤ │ │ │ SET key value2 │ │ │ │ │ ├─────────────────────────────────────────────────────────────────►│ │Write │ │ │ │ │ │pause │ │ │ │ │ │ │ │ │ Publish new config via cluster bus │ │ │ │ +MOVED ├────────────────────────────────────►│ ──┘ │ │◄─────────────────────────────────────────────────────────────────┤ ──┐ │ │ │ │ │ │ │ │ │ │ │ │Trim │ │ │ │ │ ──┘ │ │ SET key value2 │ │ │ │ ├───────────────────────────►│ │ │ │ │ +OK │ │ │ │ │◄───────────────────────────┤ │ │ │ │ │ │ │ │ │ │ │ │ │ ``` ### New commands introduced There are two new commands: 1. A command to start, monitor and cancel the migration operation: `CLUSTER MIGRATION <arg>` 2. An internal command to manage slot transfer between source and destination: `CLUSTER SYNCSLOTS <arg>` For more details, please refer to the [New Commands](#new-commands) section. Internal command messaging is mostly omitted in the diagram for simplicity. ### Steps 1. Slot migration begins when the operator sends `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> ...` to the destination master. The process is initiated from the destination node, similar to REPLICAOF. This approach allows us to reuse the same logic and share code with the new replication mechanism (see https://github.com/redis/redis/pull/13732). The command can include multiple slot ranges. The destination node creates one migration task per source node, regardless of how many slot ranges are specified. Upon successfully creating the task, the destination node replies IMPORT command with the assigned task ID. The operator can then monitor progress using `CLUSTER MIGRATION STATUS ID <task-id>` . When the task’s state field changes to `completed`, the migration has finished successfully. Please see [New Commands](#new-commands) section for the output sample. 2. After creating the migration task, the destination node will request replication of slots by using the internal command `CLUSTER SYNCSLOTS`. 3. Once the source node accepts the request, the destination node establishes another separate connection(similar to rdbchannel replication) so snapshot data and incremental changes can be transmitted in parallel. 4. Source node forks, starts delivering snapshot content (as per-key RESTORE commands) from one connection and incremental changes from the other connection. The destination master starts applying commands from the snapshot connection and accumulates incremental changes. Applied commands are also propagated to the destination replicas via replication backlog. Note: Only commands of related slots are delivered to the destination node. This is done by writing them to the migration client’s output buffer, which serves as the replication stream for the migration operation. 5. Once the source node finishes delivering the snapshot and determines that the destination node has caught up (remaining repl stream to consume went under a configured limit), it pauses write traffic for the entire server. After pausing the writes, the source node forwards any remaining write commands to the destination node. 6. Once the destination consumes all the writes, it bumps up cluster config epoch and changes the configuration. New config is published via cluster bus. 7. When the source node receives the new configuration, it can redirect clients and it begins trimming the migrated slots, while also resuming write traffic on the server. ### Internal slots synchronization state machine  1. The destination node performs authentication using the cluster secret introduced in #13763 , and transmits its node ID information. 2. The destination node sends `CLUSTER SYNCSLOTS SYNC <task-id> <start-slot> <end-slot>` to initiate a slot synchronization request and establish the main channel. The source node responds with `+RDBCHANNELSYNCSLOTS`, indicating that the destination node should establish an RDB channel. 3. The destination node then sends `CLUSTER SYNCSLOTS RDBCHANNEL <task-id>` to establish the RDB channel, using the same task-id as in the previous step to associate the two connections as part of the same ASM task. The source node replies with `+SLOTSSNAPSHOT`, and `fork` a child process to transfer slot snapshot. 4. The destination node applies the slot snapshot data received over the RDB channel, while proxying the command stream to replicas. At the same time, the main channel continues to read and buffer incremental commands in memory. 5. Once the source node finishes sending the slot snapshot, it notifies the destination node using the `CLUSTER SYNCSLOTS SNAPSHOT-EOF` command. The destination node then starts streaming the buffered commands while continuing to read and buffer incremental commands sent from the source. 6. The destination node periodically sends `CLUSTER SYNCSLOTS ACK <offset>` to inform the source of the applied data offset. When the offset gap meets the threshold, the source node pauses write operations. After all buffered data has been drained, it sends `CLUSTER SYNCSLOTS STREAM-EOF` to the destination node to hand off slots. 7. Finally, the destination node takes over slot ownership, updates the slot configuration and bumps the epoch, then broadcasts the updates via cluster bus. Once the source node detects the updated slot configuration, the slot migration process is complete. ### Error handling - If the connection between the source and destination is lost (due to disconnection, output buffer overflow, OOM, or timeout), the destination node automatically restarts the migration from the beginning. The destination node will retry the operation until it is explicitly cancelled using the CLUSTER MIGRATION CANCEL <task-id> command. - If a replica connection drops during migration, it can later resume with PSYNC, since the imported slot data is also written to the replication backlog. - During the write pause phase, the source node sets a timeout. If the destination node fails to drain remaining replication data and update the config during that time, the source node assumes the destination has failed and automatically resumes normal writes for the migrating slots. - On any error, the destination node triggers a trim operation to discard any partially imported slot data. - If node crashes during importing, unowned keys are deleted on start up. ### <a name="slot-snapshot-format-considerations"></a> Slot Snapshot Format Considerations When the source node forks to deliver slot content, in theory, there are several possible formats for transmitting the snapshot data: **Mini RDB**:A compact RDB file containing only the keys from the migrating slots. This format is efficient for transmission, but it cannot be easily forwarded to destination-side replicas. **AOF format**: The source node can generate commands in AOF form (e.g., SET x y, HSET h f v) and stream them. Individual commands are easily appended to the replication stream and propagated to replicas. Large keys can also be split into multiple commands (incrementally reconstructing the value), similar to the AOF rewrite process. **RESTORE commands**: Each key is serialized and sent as a `RESTORE` command. These can be appended directly to the destination’s replication stream, though very large keys may make serialization and transmission less efficient. We chose the `RESTORE` command as default approach for the following reasons: - It can be easily propagated to replicas. - It is more efficient than AOF for most cases, and some module keys do not support the AOF format. - For large **non-module** keys that are not string, ASM automatically switches to the AOF-based key encoding as an optimization when the key’s cardinality exceeds 512. This approach allows the key to be transferred in chunks rather than as a single large payload, reducing memory pressure and improving migration efficiency. In future versions, the RESTORE command may be enhanced to handle large keys more efficiently. Some details: - For RESTORE commands, normally by default Redis compresses keys. We disable compression while delivering RESTORE commands as compression comes with a performance hit. Without compression, replication is several times faster. - For string keys, we still prefer AOF format, e.g. SET commands as it is currently more efficient than RESTORE, especially for big keys. ### <a name="trimming-the-keys"></a> Trimming the keys When a migration completes successfully, the source node deletes the migrated keys from its local database. Since the migrated slots may contain a large number of keys, this trimming process must be efficient and non-blocking. In cluster mode, Redis maintains per-slot data structures for keys, expires, and subexpires. This organization makes it possible to efficiently detach all data associated with a given slot in a single step. During trimming, these slot-specific data structures are handed off to a background I/O (BIO) thread for asynchronous cleanup—similar to how FLUSHALL or FLUSHDB operate. This mechanism is referred to as background trimming, and it is the preferred and default method for ASM, ensuring that the main thread remains unblocked. However, unlike Redis itself, some modules may not maintain per-slot data structures and therefore cannot drop related slots data in a single operation. To support these cases, Redis introduces active trimming, where key deletion occurs in the main thread instead. This is not a blocking operation, trimming runs concurrently in the main thread, periodically removing keys during the cron loop. Each deletion triggers a keyspace notification so that modules can react to individual key removals. While active trim is less efficient, it ensures backward compatibility for modules during the transition period. Before starting the trim, Redis checks whether any module is subscribed to newly added `REDISMODULE_NOTIFY_KEY_TRIMMED` keyspace event. If such subscribers exist, active trimming is used; otherwise, background trimming is triggered. Going forward, modules are expected to adopt background trimming to take advantage of its performance and scalability benefits, and active trimming will be phased out once modules migrate to the new model. Redis also prefers active trimming if there is any client that is using client tracking feature (see [client-side caching](https://redis.io/docs/latest/develop/reference/client-side-caching/)). In the current client tracking protocol, when a database is flushed (e.g., via the FLUSHDB command), a null value is sent to tracking clients to indicate that they should invalidate all locally cached keys. However, there is currently no mechanism to signal that only specific slots have been flushed. Iterating over all keys in the slots to be trimmed would be a blocking operation. To avoid this, if there is any client that is using client tracking feature, Redis automatically switches to active trimming mode. In the future, the client tracking protocol can be extended to support slot-based invalidation, allowing background trimming to be used in these cases as well. Finally, trimming may also be triggered after a migration failure. In such cases, the operation ensures that any partially imported or inconsistent slot data is cleaned up, maintaining cluster consistency and preventing stale keys from remaining in the source or destination nodes. Note about active trim: Subsequent migrations can complete while a prior trim is still running. In that case, the new migration’s trim job is queued and will start automatically after the current trim finishes. This does not affect slot ownership or client traffic—it only serializes the background cleanup. ### <a name="replica-handling"></a> Replica handling - During importing, new keys are propagated to destination side replica. Replica will check slot ownership before replying commands like SCAN, KEYS, DBSIZE not to include these unowned keys in the reply. Also, when an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This is done by the internal `CLUSTER SYNCSLOTS CONF ASM-TASK` command in the replication stream. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support full sync with RDB delivery scenarios, a new AUX field is also added to the RDB: `cluster-asm-task`. It's value is a string in the format of `task_id:source_node:dest_node:operation:state:slot_ranges`. - After a successful migration or on a failed import, master will trim the keys. In that case, master will propagate a new command to the replica: `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ... ` . So, the replica will start trimming once this command is received. ### <a name="propagating-data-outside-the-keyspace"></a> Propagating data outside the keyspace When the destination node is newly added to the cluster, certain data outside the keyspace may need to be propagated first. A common example is functions. Previously, redis-cli handled this by transferring functions when a new node was added. With ASM, Redis now automatically dumps and sends functions to the destination node using `FUNCTION RESTORE ..REPLACE` command — done purely for convenience to simplify setup. Additionally, modules may also need to propagate their own data outside the keyspace. To support this, a new API has been introduced: `RM_ClusterPropagateForSlotMigration()`. See the [Module Support](#module-support) section for implementation details. ### Limitations 1. Single migration at a time: Only one ASM migration operation is allowed at a time. This limitation simplifies the current design but can be extended in the future. 2. Large key handling: For large keys, ASM switches to AOF encoding to deliver key data in chunks. This mechanism currently applies only to non-module keys. In the future, the RESTORE command may be extended to support chunked delivery, providing a unified solution for all key types. See [Slot Snapshot Format Considerations](#slot-snapshot-format-considerations) for details. 3. There are several cases that may cause an Atomic Slot Migration (ASM) to be aborted (can be retried afterwards): - FLUSHALL / FLUSHDB: These commands introduce complexity during ASM. For example, if executed on the migrating node, they must be propagated only for the migrating slots. However, when combined with active trimming, their execution may need to be deferred until it is safe to proceed, adding further complexity to the process. - FAILOVER: The replica cannot resume the migration process. Migration should start from the beginning. - Module propagates cross-slot command during ASM via RM_Replicate(): If this occurs on the migrating node, Redis cannot split the command to propagate only the relevant slots to the ASM destination. To keep the logic simple and consistent, ASM is cancelled in this case. Modules should avoid propagating cross-slot commands during migration. - CLIENT PAUSE: The import task cannot progress during a write pause, as doing so would violate the guarantee that no writes occur during migration. To keep things simple, the ASM task is aborted when CLIENT PAUSE is active. - Manual Slot Configuration Changes: If slot configuration is modified manually during ASM (for example, when legacy migration methods are mixed with ASM), the process is aborted. Note: This situation is highly unexpected — users should not combine ASM with legacy migration methods. 4. When active trimming is enabled, a node must not re-import the same slots while trimming for those slots is still in progress. Otherwise, it can’t distinguish newly imported keys from pre-existing ones, and the trim cron might delete the incoming keys by mistake. In this state, the node rejects IMPORT operation for those slots until trimming completes. If the master has finished trimming but a replica is still trimming, master may still start the import operation for those slots. So, the replica checks whether the master is sending commands for those slots; if so, it blocks the master’s client connection until trimming finishes. This is a corner case, but we believe the behavior is reasonable for now. In the worst case, the master may drop the replica (e.g., buffer overrun), triggering a new full sync. # API Changes ## <a name="new-commands"></a> New Commands ### Public commands 1. **Syntax:** `CLUSTER MIGRATION IMPORT <start-slot> <end-slot> [<start-slot> <end-slot>]...` **Args:** Slot ranges **Reply:** - String task ID - -ERR <message> on failure (e.g. invalid slot range) **Description:** Executes on the destination master. Accepts multiple slot ranges and triggers atomic migration for the specified ranges. Returns a task ID that can be used to monitor the status of the task. In CLUSTER MIGRATION STATUS output, “state” field will be `completed` on a successful operation. 2. **Syntax:** `CLUSTER MIGRATION CANCEL [ID <id> | ALL]` **Args:** Task ID or ALL **Reply:** Number of cancelled tasks **Description:** Cancels an ongoing migration task by its ID or cancels all tasks if ALL is specified. Note: Cancelling a task on the source node does not stop the migration on the destination node, which will continue retrying until it is also cancelled there. 3. **Syntax:** `CLUSTER MIGRATION STATUS [ID <id> | ALL]` **Args:** Task ID or ALL - **ID:** If provided, returns the status of the specified migration task. - **ALL:** Lists the status of all migration tasks. **Reply:** - A list of migration task details (both ongoing and completed ones). - Empty list if the given task ID does not exist. **Description:** Displays the status of all current and completed atomic slot migration tasks. If a specific task ID is provided, it returns detailed information for that task only. **Sample output:** ``` 127.0.0.1:5001> cluster migration status all 1) 1) "id" 2) "24cf41718b20f7f05901743dffc40bc9b15db339" 3) "slots" 4) "0-1000" 5) "source" 6) "1098d90d9ba2d1f12965442daf501ef0b6667bec" 7) "dest" 8) "b3b5b426e7ea6166d1548b2a26e1d5adeb1213ac" 9) "operation" 10) "migrate" 11) "state" 12) "completed" 13) "last_error" 14) "" 15) "retries" 16) "0" 17) "create_time" 18) "1759694528449" 19) "start_time" 20) "1759694528449" 21) "end_time" 22) "1759694528464" 23) "write_pause_ms" 24) "10" ``` ### Internal commands 1. **Syntax:** `CLUSTER SYNCSLOTS <arg> ...` **Args:** Internal messaging operations **Reply:** +OK or -ERR <message> on failure (e.g. invalid slot range) **Description:** Used for internal communication between source and destination nodes. e.g. handshaking, establishing multiple channels, triggering handoff. 2. **Syntax:** `TRIMSLOTS RANGES <numranges> <start-slot> <end-slot> ...` **Args:** Slot ranges to trim **Reply:** +OK **Description:** Master propagates it to replica so that replica can trim unowned keys after a successful migration or on a failed import. ## New configs - `cluster-slot-migration-max-archived-tasks`: To list in `CLUSTER MIGRATION STATUS ALL` output, Redis keeps last n migration tasks in memory. This config controls maximum number of archived ASM tasks. Default value: 32, used as a hidden config - `cluster-slot-migration-handoff-max-lag-bytes`: After the slot snapshot is completed, if the remaining replication stream size falls below this threshold, the source node pauses writes to hand off slot ownership. A higher value may trigger the handoff earlier but can lead to a longer write pause, since more data remains to be replicated. A lower value can result in a shorter write pause, but it may be harder to reach the threshold if there is a steady flow of incoming writes. Default value: 1MB - `cluster-slot-migration-write-pause-timeout`: The maximum duration (in milliseconds) that the source node pauses writes during ASM handoff. After pausing writes, if the destination node fails to take over the slots within this timeout (for example, due to a cluster configuration update failure), the source node assumes the migration has failed and resumes writes to prevent indefinite blocking. Default value: 10 seconds - `cluster-slot-migration-sync-buffer-drain-timeout`: Timeout in milliseconds for sync buffer to be drained during ASM. After the destination applies the accumulated buffer, the source continues sending commands for migrating slots. The destination keeps applying them, but if the gap remains above the acceptable limit (see `slot-migration-handoff-max-lag-bytes`), which may cause endless synchronization. A timeout check is required to handle this case. The timeout is calculated as **the maximum of two values**: - A configurable timeout (slot-migration-sync-buffer-drain-timeout) to avoid false positives. - A dynamic timeout based on the time that the destination took to apply the slot snapshot and the accumulated buffer during slot snapshot delivery. The destination should be able to drain the remaining sync buffer in less time than this. We multiply it by 2 to be more conservative. Default value: 60000 millliseconds, used as a hidden config ## New flag in CLIENT LIST - the client responsible for importing slots is marked with the `o` flag. - the client responsible for migrating slots is marked with the `g` flag. ## New INFO fields - `mem_cluster_slot_migration_output_buffer`: Memory usage of the migration client’s output buffer. Redis writes incoming changes to this buffer during the migration process. - `mem_cluster_slot_migration_input_buffer`: Memory usage of the accumulated replication stream buffer on the importing node. - `mem_cluster_slot_migration_input_buffer_peak`: Peak accumulated repl buffer size on the importing side ## New CLUSTER INFO fields - `cluster_slot_migration_active_tasks`: Number of in-progress ASM tasks. Currently, it will be 1 or 0. - `cluster_slot_migration_active_trim_running`: Number of active trim jobs in progress and scheduled - `cluster_slot_migration_active_trim_current_job_keys`: Number of keys scheduled for deletion in the current trim job. - `cluster_slot_migration_active_trim_current_job_trimmed`: Number of keys already deleted in the current trim job. - `cluster_slot_migration_stats_active_trim_started`: Total number of trim jobs that have started since the process began. - `cluster_slot_migration_stats_active_trim_completed`: Total number of trim jobs completed since the process began. - `cluster_slot_migration_stats_active_trim_cancelled`: Total number of trim jobs cancelled since the process began. ## Changes in RDB format A new aux field is added to RDB: `cluster-asm-task`. When an import operation begins, the master now propagates an internal command through the replication stream, allowing replicas to recognize that an ASM operation is in progress. This enables replicas to trigger the relevant module events so that modules can adapt their behavior — for example, filtering out unowned keys from read-only requests during ASM operations. To be able to support RDB delivery scenarios, a new field is added to the RDB. See [replica handling](#replica-handling) ## Bug fix - Fix memory leak when processing forgetting node type message - Fix data race of writing reply to replica client directly when enabling multi-threading We don't plan to back point them into old versions, since they are very rare cases. ## Keys visibility When performing atomic slot migration, during key importing on the destination node or key trimming on the source/destination, these keys will be filtered out in the following commands: - KEYS - SCAN - RANDOMKEY - CLUSTER GETKEYSINSLOT - DBSIZE - CLUSTER COUNTKEYSINSLOT The only command that will reflect the increasing number of keys is: - INFO KEYSPACE ## <a name="module-support"></a> Module Support **NOTE:** Please read [trimming](#trimming-the-keys) section and see how does ASM decide about trimming method when there are modules in use. ### New notification: ```c #define REDISMODULE_NOTIFY_KEY_TRIMMED (1<<17) ``` When a key is deleted by the active trim operation, this notification will be sent to subscribed modules. Also, ASM will automatically choose the trimming method depending on whether there are any subscribers to this new event. Please see the further details here: [trimming](#trimming-the-keys) ### New struct in the API: ```c typedef struct RedisModuleSlotRange { uint16_t start; uint16_t end; } RedisModuleSlotRange; typedef struct RedisModuleSlotRangeArray { int32_t num_ranges; RedisModuleSlotRange ranges[]; } RedisModuleSlotRangeArray; ``` ### New Events #### 1. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION (RedisModuleEvent_ClusterSlotMigration) These events notify modules about different stages of Active Slot Migration (ASM) operations such as when import or migration starts, fails, or completes. Modules can use these notifications to track cluster slot movements or perform custom logic during ASM transitions. ```c #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_STARTED 0 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_FAILED 1 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_IMPORT_COMPLETED 2 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_STARTED 3 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_FAILED 4 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_COMPLETED 5 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE 6 ``` Parameter to these events: ```c typedef struct RedisModuleClusterSlotMigrationInfo { uint64_t version; /* Not used since this structure is never passed from the module to the core right now. Here for future compatibility. */ char source_node_id[REDISMODULE_NODE_ID_LEN + 1]; char destination_node_id[REDISMODULE_NODE_ID_LEN + 1]; const char *task_id; RedisModuleSlotRangeArray* slots; } RedisModuleClusterSlotMigrationInfoV1; #define RedisModuleClusterSlotMigrationInfo RedisModuleClusterSlotMigrationInfoV1 ``` #### 2. REDISMODULE_EVENT_CLUSTER_SLOT_MIGRATION_TRIM (RedisModuleEvent_ClusterSlotMigrationTrim) These events inform modules about the lifecycle of ASM key trimming operations. Modules can use them to detect when trimming starts, completes, or is performed asynchronously in the background. ```c #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_STARTED 0 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_COMPLETED 1 #define REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_TRIM_BACKGROUND 2 ``` Parameter to these events: ```c typedef struct RedisModuleClusterSlotMigrationTrimInfo { uint64_t version; /* Not used since this structure is never passed from the module to the core right now. Here for future compatibility. */ RedisModuleSlotRangeArray* slots; } RedisModuleClusterSlotMigrationTrimInfoV1; #define RedisModuleClusterSlotMigrationTrimInfo RedisModuleClusterSlotMigrationTrimInfoV1 ``` ### New functions ```c /* Returns 1 if keys in the specified slot can be accessed by this node, 0 otherwise. * * This function returns 1 in the following cases: * - The slot is owned by this node or by its master if this node is a replica * - The slot is being imported under the old slot migration approach (CLUSTER SETSLOT <slot> IMPORTING ..) * - Not in cluster mode (all slots are accessible) * * Returns 0 for: * - Invalid slot numbers (< 0 or >= 16384) * - Slots owned by other nodes */ int RM_ClusterCanAccessKeysInSlot(int slot); /* Propagate commands along with slot migration. * * This function allows modules to add commands that will be sent to the * destination node before the actual slot migration begins. It should only be * called during the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event. * * This function can be called multiple times within the same event to * replicate multiple commands. All commands will be sent before the * actual slot data migration begins. * * Note: This function is only available in the fork child process just before * slot snapshot delivery begins. * * On success REDISMODULE_OK is returned, otherwise * REDISMODULE_ERR is returned and errno is set to the following values: * * * EINVAL: function arguments or format specifiers are invalid. * * EBADF: not called in the correct context, e.g. not called in the REDISMODULE_SUBEVENT_CLUSTER_SLOT_MIGRATION_MIGRATE_MODULE_PROPAGATE event. * * ENOENT: command does not exist. * * ENOTSUP: command is cross-slot. * * ERANGE: command contains keys that are not within the migrating slot range. */ int RM_ClusterPropagateForSlotMigration(RedisModuleCtx *ctx, const char *cmdname, const char *fmt, ...); /* Returns the locally owned slot ranges for the node. * * An optional `ctx` can be provided to enable auto-memory management. * If cluster mode is disabled, the array will include all slots (0–16383). * If the node is a replica, the slot ranges of its master are returned. * * The returned array must be freed with RM_ClusterFreeSlotRanges(). */ RedisModuleSlotRangeArray *RM_ClusterGetLocalSlotRanges(RedisModuleCtx *ctx); /* Frees a slot range array returned by RM_ClusterGetLocalSlotRanges(). * Pass the `ctx` pointer only if the array was created with a context. */ void RM_ClusterFreeSlotRanges(RedisModuleCtx *ctx, RedisModuleSlotRangeArray *slots); ``` ## ASM API for alternative cluster implementations Following https://github.com/redis/redis/pull/12742, Redis cluster code was restructured to support alternative cluster implementations. Redis uses cluster_legacy.c implementation by default. This PR adds a generic ASM API so alternative implementations can initiate and coordinate Atomic Slot Migration (ASM) while Redis executes the data movement and emits state changes. Documentation rests in `cluster.h`: ```c There are two new functions: /* Called by cluster implementation to request an ASM operation. (cluster impl --> redis) */ int clusterAsmProcess(const char *task_id, int event, void *arg, char **err); /* Called when an ASM event occurs to notify the cluster implementation. (redis --> cluster impl) */ int clusterAsmOnEvent(const char *task_id, int event, void *arg); ``` ```c /* API for alternative cluster implementations to start and coordinate * Atomic Slot Migration (ASM). * * These two functions drive ASM for alternative cluster implementations. * - clusterAsmProcess(...) impl -> redis: initiates/advances/cancels ASM operations * - clusterAsmOnEvent(...) redis -> impl: notifies state changes * * Generic steps for an alternative implementation: * - On destination side, implementation calls clusterAsmProcess(ASM_EVENT_IMPORT_START) * to start an import operation. * - Redis calls clusterAsmOnEvent() when an ASM event occurs. * - On the source side, Redis will call clusterAsmOnEvent(ASM_EVENT_HANDOFF_PREP) * when slots are ready to be handed off and the write pause is needed. * - Implementation stops the traffic to the slots and calls clusterAsmProcess(ASM_EVENT_HANDOFF) * - On the destination side, Redis calls clusterAsmOnEvent(ASM_EVENT_TAKEOVER) * when destination node is ready to take over the slot, waiting for ownership change. * - Cluster implementation updates the config and calls clusterAsmProcess(ASM_EVENT_DONE) * to notify Redis that the slots ownership has changed. * * Sequence diagram for import: * - Note: shows only the events that cluster implementation needs to react. * * ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ * │ Destination │ │ Destination │ │ Source │ │ Source │ * │ Cluster impl │ │ Master │ │ Master │ │ Cluster impl │ * └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ * │ │ │ │ * │ ASM_EVENT_IMPORT_START │ │ │ * ├─────────────────────────────►│ │ │ * │ │ CLUSTER SYNCSLOTS <arg> │ │ * │ ├────────────────────────►│ │ * │ │ │ │ * │ │ SNAPSHOT(restore cmds) │ │ * │ │◄────────────────────────┤ │ * │ │ Repl stream │ │ * │ │◄────────────────────────┤ │ * │ │ │ ASM_EVENT_HANDOFF_PREP │ * │ │ ├────────────────────────────►│ * │ │ │ ASM_EVENT_HANDOFF │ * │ │ │◄────────────────────────────┤ * │ │ Drain repl stream │ │ * │ │◄────────────────────────┤ │ * │ ASM_EVENT_TAKEOVER │ │ │ * │◄─────────────────────────────┤ │ │ * │ │ │ │ * │ ASM_EVENT_DONE │ │ │ * ├─────────────────────────────►│ │ ASM_EVENT_DONE │ * │ │ │◄────────────────────────────┤ * │ │ │ │ */ #define ASM_EVENT_IMPORT_START 1 /* Start a new import operation (destination side) */ #define ASM_EVENT_CANCEL 2 /* Cancel an ongoing import/migrate operation (source and destination side) */ #define ASM_EVENT_HANDOFF_PREP 3 /* Slot is ready to be handed off to the destination shard (source side) */ #define ASM_EVENT_HANDOFF 4 /* Notify that the slot can be handed off (source side) */ #define ASM_EVENT_TAKEOVER 5 /* Ready to take over the slot, waiting for config change (destination side) */ #define ASM_EVENT_DONE 6 /* Notify that import/migrate is completed, config is updated (source and destination side) */ #define ASM_EVENT_IMPORT_PREP 7 /* Import is about to start, the implementation may reject by returning C_ERR */ #define ASM_EVENT_IMPORT_STARTED 8 /* Import started */ #define ASM_EVENT_IMPORT_FAILED 9 /* Import failed */ #define ASM_EVENT_IMPORT_COMPLETED 10 /* Import completed (config updated) */ #define ASM_EVENT_MIGRATE_PREP 11 /* Migrate is about to start, the implementation may reject by returning C_ERR */ #define ASM_EVENT_MIGRATE_STARTED 12 /* Migrate started */ #define ASM_EVENT_MIGRATE_FAILED 13 /* Migrate failed */ #define ASM_EVENT_MIGRATE_COMPLETED 14 /* Migrate completed (config updated) */ ``` ------ Co-authored-by: Yuan Wang <yuan.wang@redis.com> --------- Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
b1eb9ba861
|
Change ps command options to work on Solaris (#14351)
Fix https://github.com/redis/redis/issues/14304 `ps` command options in tcl tests are adjusted to work on both Linux and Solaris. |
||
|
|
60adba48aa
|
Introduce DEBUG_DEFRAG compilation option to allow run test with activedefrag when allocator is not jemalloc (#14326)
This PR is based on https://github.com/valkey-io/valkey/pull/1303 This PR introduces a DEBUG_DEFRAG compilation option that enables activedefrag functionality even when the allocator is not jemalloc, and always forces defragmentation regardless of the amount or ratio of fragmentation. ## Using ``` make SANITIZER=address DEBUG_DEFRAG=<force|fully> ./runtest --debug-defrag ``` * DEBUG_DEFRAG=force * Ignore the threshold for defragmentation to ensure that defragmentation is always triggered. * Always reallocate pointers to probe for correctness issues in pointer reallocation. * DEBUG_DEFRAG=fully * Includes everything in the option `force`. * Additionally performs a full defrag on every defrag cycle, which is significantly slower but more accurate. --------- Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: oranagra <oran@redislabs.com> |
||
|
|
fe3f0aa252
|
Fix some daily CI issues (#14217)
1) Fix the timeout of `Active defrag big keys: standalone` Using a pipe to write commands may cause the write to block if the read buffer becomes full. 2) Fix the failure of `Main db not affected when fail to diskless load` test If the master was killed in slow environment, then after `cluster-node-timeout` (3s in our test), running keyspace commands on the replica will get a CLUSTERDOWN error. 3) Fix the failure of `Test shutdown hook` test ASAN can intercept a signal, so I guess that when we send SIGCONT after SIGTERM to kill the server, it might start doing some work again, causing the process to close very slowly. |
||
|
|
0d8e750883
|
Add CLUSTER SLOT-STATS command (#14039)
Add CLUSTER SLOT-STATS command for key count, cpu time and network IO
per slot currently.
The command has the following syntax
CLUSTER SLOT-STATS SLOTSRANGE start-slot end-slot
or
CLUSTER SLOT-STATS ORDERBY metric [LIMIT limit] [ASC/DESC]
where metric can currently be one of the following
key-count -- Number of keys in a given slot
cpu-usec -- Amount of CPU time (in microseconds) spent on a given slot
network-bytes-in -- Amount of network ingress (in bytes) received for
given slot
network-bytes-out -- Amount of network egress (in bytes) sent out for
given slot
This PR is based on:
valkey-io/valkey#351
valkey-io/valkey#709
valkey-io/valkey#710
valkey-io/valkey#720
valkey-io/valkey#840
Co-authored-by: Kyle Kim <kimkyle@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Harkrishn Patro <harkrisp@amazon.com>
---------
Co-authored-by: Kyle Kim <kimkyle@amazon.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
|
||
|
|
fa040a72c0
|
Add XDELEX and XACKDEL commands for stream (#14130)
## Summary and detailed design for new stream command ## XDELEX ### Syntax ``` XDELEX key [KEEPREF | DELREF | ACKED] IDS numids id [id ...] ``` ### Description The `XDELEX` command extends the Redis Streams `XDEL` command, offering enhanced control over message entry deletion with respect to consumer groups. It accepts optional `DELREF` or `ACKED` parameters to modify its behavior: - **KEEPREF:** Deletes the specified entries from the stream, but preserves existing references to these entries in all consumer groups' PEL. This behavior is similar to XDEL. - **DELREF:** Deletes the specified entries from the stream and also removes all references to these entries from all consumer groups' pending entry lists, effectively cleaning up all traces of the messages. - **ACKED:** Only trims entries that were read and acknowledged by all consumer groups. **Note:** The `IDS` block can appear at any position in the command, consistent with other commands. ### Reply Array reply, for each `id`: - `-1`: No such `id` exists in the provided stream `key`. - `1`: Entry was deleted from the stream. - `2`: Entry was not deleted, but there are still dangling references. (ACKED option) ## XACKDEL ### Syntax ``` XACKDEL key group [KEEPREF | DELREF | ACKED] IDS numids id [id ...] ``` ### Description The `XACKDEL` command combines `XACK` and `XDEL` functionalities in Redis Streams. It acknowledges specified message IDs in the given consumer group and attempts to delete corresponding stream entries. It accepts optional `DELREF` or `ACKED` parameters: - **KEEPREF:** Acknowledges the messages in the specified consumer group and deletes the entries from the stream, but preserves existing references to these entries in all consumer groups' PEL. - **DELREF:** Acknowledges the messages in the specified consumer group, deletes the entries from the stream, and also removes all references to these entries from all consumer groups' pending entry lists, effectively cleaning up all traces of the messages. - **ACKED:** Acknowledges the messages in the specified consumer group and only trims entries that were read and acknowledged by all consumer groups. ### Reply Array reply, for each `id`: - `-1`: No such `id` exists in the provided stream `key`. - `1`: Entry was acknowledged and deleted from the stream. - `2`: Entry was acknowledged but not deleted, but there are still dangling references. (ACKED option) # Redis Streams Commands Extension ## XTRIM ### Syntax ``` XTRIM key <MAXLEN | MINID> [= | ~] threshold [LIMIT count] [KEEPREF | DELREF | ACKED] ``` ### Description The `XTRIM` command trims a stream by removing entries based on specified criteria, extended to include optional `DELREF` or `ACKED` parameters for consumer group handling: - **KEEPREF:** Trims the stream according to the specified strategy (MAXLEN or MINID) regardless of whether entries are referenced by any consumer groups, but preserves existing references to these entries in all consumer groups' PEL. - **DELREF:** Trims the stream according to the specified strategy and also removes all references to the trimmed entries from all consumer groups' PEL. - **ACKED:** Only trims entries that were read and acknowledged by all consumer groups. ### Reply No change. ## XADD ### Syntax ``` XADD key [NOMKSTREAM] [<MAXLEN | MINID> [= | ~] threshold [LIMIT count]] [KEEPREF | DELREF | ACKED] <* | id> field value [field value ...] ``` ### Description The `XADD` command appends a new entry to a stream and optionally trims it in the same operation, extended to include optional `DELREF` or `ACKED` parameters for trimming behavior: - **KEEPREF:** When trimming, removes entries from the stream according to the specified strategy (MAXLEN or MINID), regardless of whether they are referenced by any consumer groups, but preserves existing references to these entries in all consumer groups' PEL. - **DELREF:** When trimming, removes entries from the stream according to the specified strategy and also removes all references to these entries from all consumer groups' PEL. - **ACKED:** When trimming, only removes entries that were read and acknowledged by all consumer groups. Note that if the number of referenced entries is bigger than MAXLEN, we will still stop. ### Reply No change. ## Key implementation Since we currently have no simple way to track the association between an entry and consumer groups without iterating over all groups, we introduce two mechanisms to establish this link. This allows us to determine whether an entry has been seen by all consumer groups, and to identify which groups are referencing it. With this links, we can break the association when the entry is either acknowledged or deleted. 1) Added reference tracking between stream messages and consumer groups using `cgroups_ref` The cgroups_ref is implemented as a rax that maps stream message IDs to lists of consumer groups that reference those messages, and streamNACK stores the corresponding nodes of this list, so that the corresponding groups can be deleted during `ACK`. In this way, we can determine whether an entry has been seen but not ack. 2) Store a cache minimum last_id in the stream structure. The reason for doing this is that there is a situation where an entry has never been seen by the consume group. In this case, we think this entry has not been consumed either. If there is an "ACKED" option, we cannot directly delete this entry either. When a consumer group updates its last_id, we don’t immediately update the cached minimum last_id. Instead, we check whether the group’s previous last_id was equal to the current minimum, or whether the new last_id is smaller than the current minimum (when using `XGROUP SETID`). If either is true, we mark the cached minimum last_id as invalid, and defer the actual update until the next time it’s needed. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: moticless <moticless@github.com> Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com> Co-authored-by: Yuan Wang <yuan.wang@redis.com> |
||
|
|
b7c6755b1b
|
Add thread sanitizer run to daily CI (#13964)
Add thread sanitizer run to daily CI. Few tests are skipped in tsan runs for two reasons: * Stack trace producing tests (oom, `unit/moduleapi/crash`, etc) are tagged `tsan:skip` because redis calls `backtrace()` in signal handler which turns out to be signal-unsafe since it might allocate memory (e.g. glibc 2.39 does it through a call to `_dl_map_object_deps()`). * Few tests become flaky with thread sanitizer builds and don't finish in expected deadlines because of the additional tsan overhead. Instead of skipping those tests, this can improved in the future by allowing more iterations when waiting for tsan builds. Deadlock detection is disabled for now because of tsan limitation where max 64 locks can be taken at once. There is one outstanding (false-positive?) race in jemalloc which is suppressed in `tsan.sup`. Fix few races thread sanitizer reported having to do with writes from signal handlers. Since in multi-threaded setting signal handlers might be called on any thread (modulo pthread_sigmask) while the main thread is running, `volatile sig_atomic_t` type is not sufficient and atomics are used instead. |
||
|
|
871d4c4004
|
Test: check always for memory leaks on MacOS. (#14060)
When running the Redis test on MacOS, the test detects that the
operating system is able to use "leaks" to test for memory leaks and
executes this check after every server spinned is terminated.
While we have the ability to run the test in environments able to detect
memory issues, the fact it is possible to check for leaks at every run
baasically for free is very valuable, and allows to fix leaks
immediately in your laptop before submitting a PR.
However, the feature avoided to run leaks when no test was run: this
check was added in the early stage of Redis, when all the tests were
like:
server {
test { ... }
}
So the check counts for the number of tests ran, and if no test is
executed, no leaks detection is performed. However now we have certain
tests that are in the form:
test {
server { ... }
}
For instance just loading a corrupted RDB or alike. In this case, the
leaks test is not executed. This commit removes the check so that the
leaks test is always executed.
|
||
|
|
fdbf88032c
|
Add MSan and integrate it with CI (#13916)
## Description Memory sanitizer (MSAN) is used to detect use-of-uninitialized memory issues. While Address Sanitizer catches a wide range of memory safety issues, it doesn't specifically detect uninitialized memory usage. Therefore, Memory Sanitizer complements Address Sanitizer. This PR adds MSAN run to the daily build, with the possibility of incorporating it into the ci.yml workflow in the future if needed. Changes in source files fix false-positive issues and they should not introduce any runtime implications. Note: Valgrind performs similar checks to both ASAN and MSAN but sanitizers run significantly faster. ## Limitations - Memory sanitizer is only supported by Clang. - MSAN documentation states that all dependencies, including the standard library, must be compiled with MSAN. However, it also mentions there are interceptors for common libc functions, so compiling the standard library with the MSAN flag is not strictly necessary. Therefore, we are not compiling libc with MSAN. --------- Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com> |
||
|
|
a46624e10e
|
[Vector sets] RDB IO errors handling (#13978)
This PR adds support for REDISMODULE_OPTIONS_HANDLE_IO_ERRORS. and tests for short read and corrupted RESTORE payload. Please: note that I also removed the comment about async loading support since we should be already covered. No manipulation of global data structures in Vector Sets, if not for the unique ID used to create new vector sets with different IDs. |
||
|
|
d65102861f
|
Adding AGPLv3 as a license option to Redis! (#13997)
Read more about [the new license option](http://redis.io/blog/agplv3/) and [the Redis 8 release](http://redis.io/blog/redis-8-ga/). |
||
|
|
3cdb8c6046
|
Improve replication buffering on replica and fix a related bug (#13904)
With RDB channel replication, we introduced parallel replication stream and RDB delivery to the replica during a full sync. Currently, after the replica loads the RDB and begins streaming the accumulated buffer to the database, it does not read from the master connection during this period. Although streaming the local buffer is generally a fast operation, it can take some time if the buffer is large. This PR introduces buffering during the streaming of the local buffer. One important consideration is ensuring that we consume more than we read during this operation; otherwise, it could take indefinitely. To guarantee that it will eventually complete, we limit the read to at most half of what we consume, e.g. read at most 1 mb once we consume at least 2 mb. **Additional changes** **Bug fix** - Currently, when replica starts draining accumulated buffer, we call protectClient() for the master client as we occasionally yield back to event loop via processEventsWhileBlocked(). So, it prevents freeing the master client. While we are in this loop, if replica receives "replicaof newmaster" command, we call replicaSetMaster() which expects to free the master client and trigger a new connection attempt. As the client object is protected, its destruction will happen asynchronously. Though, a new connection attempt to new master will be made immediately. Later, when the replication buffer is drained, we realize master client was marked as CLOSE_ASAP, and freeing master client triggers another connection attempt to the new master. In most cases, we realize something is wrong in the replication state machine and abort the second attempt later. So, the bug may go undetected. Fix is not calling protectClient() for the master client. Instead, trying to detect if master client is disconnected during processEventsWhileBlocked() and if so, breaking the loop immediately. **Related improvement:** - Currently, the replication buffer is a linked list of buffers, each of which is 1 MB in size. While consuming the buffer, we process one buffer at a time and check if we need to yield back to `processEventsWhileBlocked()`. However, if `loading-process-events-interval-bytes` is set to less than 1 MB, this approach doesn't handle it well. To improve this, I've modified the code to process 16KB at a time and check `loading-process-events-interval-bytes` more frequently. This way, depending on the configuration, we may yield back to networking more often. - In replication.c, `disklessLoadingRio` will be set before a call to `emptyData()`. This change should not introduce any behavioral change but it is logically more correct as emptyData() may yield to networking and we may need to call rioAbort() on disklessLoadingRio. Otherwise, failure of main channel may go undetected until a failure on rdb channel on a corner case. **Config changes** - The default value for the `loading-process-events-interval-bytes` configuration is being lowered from 2MB to 512KB. This configuration primarily used for testing and controls the frequency of networking during the loading phase, specifically when loading the RDB or applying accumulated buffers during a full sync on the replica side. Before the introduction of RDB channel replication, the 2MB value was sufficient for occasionally yielding to networking, mainly to reply -loading to the clients. However, with RDB channel replication, during a full sync on the replica side (either while loading the RDB or applying the accumulated buffer), we need to yield back to networking more frequently to continue accumulating the replication stream. If this doesn’t happen often enough, the replication stream can accumulate on the master side, which is undesirable. To address this, we’ve decided to lower the default value to 512KB. One concern with frequent yielding to networking is the potential performance impact, as each call to processEventsWhileBlocked() involves 4 syscalls, which could slow down the RDB loading phase. However, benchmarking with various configuration values has shown that using 512KB or higher does not negatively impact RDB loading performance. Based on these results, 512KB is now selected as the default value. **Test changes** - Added improved version of a replication test which checks memory usage on master during full sync. --------- Co-authored-by: Oran Agra <oran@redislabs.com> |
||
|
|
7f5f588232
|
AOF offset info (#13773)
### Background AOF is often used as an effective data recovery method, but now if we have two AOFs from different nodes, it is hard to learn which one has latest data. Generally, we determine whose data is more up-to-date by reading the latest modification time of the AOF file, but because of replication delay, even if both master and replica write to the AOF at the same time, the data in the master is more up-to-date (there are commands that didn't arrive at the replica yet, or a large number of commands have accumulated on replica side ), so we may make wrong decision. ### Solution The replication offset always increments when AOF is enabled even if there is no replica, we think replication offset is better method to determine which one has more up-to-date data, whoever has a larger offset will have newer data, so we add the start replication offset info for AOF, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 ``` And if we close gracefully the AOF file, not a crash, such as `shutdown`, `kill signal 15` or `config set appendonly no`, we will add the end replication offset, as bellow. ``` file appendonly.aof.2.base.rdb seq 2 type b file appendonly.aof.2.incr.aof seq 2 type i startoffset 224 endoffset 532 ``` #### Things to pay attention to - For BASE AOF, we do not add `startoffset` and `endoffset` info, since we could not know the start replication replication of data, and it is useless to help us to determine which one has more up-to-date data. - For AOFs from old version, we also don't add `startoffset` and `endoffset` info, since we also don't know start replication replication of them. If we add the start offset from 0, we might make the judgment even less accurate. For example, if the master has just rewritten the AOF, its INCR AOF will inevitably be very small. However, if the replica has not rewritten AOF for a long time, its INCR AOF might be much larger. By applying the following method, we might make incorrect decisions, so we still just check timestamp instead of adding offset info - If the last INCR AOF has `startoffset` or `endoffset`, we need to restore `server.master_repl_offset` according to them to avoid the rollback of the `startoffset` of next INCR AOF. If it has `endoffset`, we just use this value as `server.master_repl_offset`, and a very important thing is to remove this information from the manifest file to avoid the next time we load the manifest file with wrong `endoffset`. If it only has `startoffset`, we calculate `server.master_repl_offset` by the `startoffset` plus the file size. ### How to determine which one has more up-to-date data If AOF has a larger replication offset, it will have more up-to-date data. The following is how to get AOF offset: Read the AOF manifest file to obtain information about **the last INCR AOF** 1. If the last INCR AOF has `endoffset` field, we can directly use the `endoffset` to present the replication offset of AOF 2. If there is no `endoffset`(such as redis crashes abnormally), but there is `startoffset` filed of the last INCR AOF, we can get the replication offset of AOF by `startoffset` plus the file size 3. Finally, if the AOF doesn’t have both `startoffset` and `endoffset`, maybe from old version, and new version redis has not rewritten AOF yet, we still need to check the modification timestamp of the last INCR AOF ### TODO Fix ping causing inconsistency between AOF size and replication offset in the future PR. Because we increment the replication offset when sending PING/REPLCONF to the replica but do not write data to the AOF file, this might cause the starting offset of the AOF file plus its size to be inconsistent with the actual replication offset. |
||
|
|
870b6bd487
|
Added a shared secret over Redis cluster. (#13763)
The PR introduces a new shared secret that is shared over all the nodes on the Redis cluster. The main idea is to leverage the cluster bus to share a secret between all the nodes such that later the nodes will be able to authenticate using this secret and send internal commands to each other (see #13740 for more information about internal commands). The way the shared secret is chosen is the following: 1. Each node, when start, randomly generate its own internal secret. 2. Each node share its internal secret over the cluster ping messages. 3. If a node gets a ping message with secret smaller then his current secret, it embrace it. 4. Eventually all nodes should embrace the minimal secret The converges of the secret is as good as the topology converges. To extend the ping messages to contain the secret, we leverage the extension mechanism. Nodes that runs an older Redis version will just ignore those extensions. Specific tests were added to verify that eventually all nodes see the secrets. In addition, a verification was added to the test infra to verify the secret on `cluster_config_consistent` and to `assert_cluster_state`. |
||
|
|
73a9b916c9
|
Rdb channel replication (#13732)
This PR is based on: https://github.com/redis/redis/pull/12109 https://github.com/valkey-io/valkey/pull/60 Closes: https://github.com/redis/redis/issues/11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: https://github.com/valkey-io/valkey/pull/837 https://github.com/valkey-io/valkey/pull/1173 https://github.com/valkey-io/valkey/pull/804 https://github.com/valkey-io/valkey/pull/945 https://github.com/valkey-io/valkey/pull/989 --------- Co-authored-by: Yuan Wang <wangyuancode@163.com> Co-authored-by: debing.sun <debing.sun@redis.com> Co-authored-by: Moti Cohen <moticless@gmail.com> Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: Amit Nagler <58042354+naglera@users.noreply.github.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Co-authored-by: Ping Xie <pingxie@outlook.com> Co-authored-by: Ran Shidlansik <ranshid@amazon.com> Co-authored-by: ranshid <88133677+ranshid@users.noreply.github.com> Co-authored-by: xbasel <103044017+xbasel@users.noreply.github.com> |
||
|
|
64a40b20d9
|
Async IO Threads (#13695)
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.
- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.
- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540
Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.
## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.
the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.
### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.
Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.
### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.
### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.
**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.
**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.
**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.
**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.
**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.
**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.
**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.
### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```
Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.
`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`
## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.
For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.
For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.
### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.
To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.
However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.
### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
They don't even use SDS, so there is no reference issues.
But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
the key names will be kept in `c->db->blocking_keys`.
3. watch command
the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
channel name will be kept in `serverPubSubChannels`.
5. script load command
script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`
Those issues will be handled in other PRs.
## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.
### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.
## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.
When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.
Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.
---------
Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
|
||
|
|
79fd255828
|
Add Lua VM memory to memory overhead, now that it's part of zmalloc (#13660)
To complement the work done in #13133. it added the script VMs memory to be counted as part of zmalloc, but that means they should be also counted as part of the non-value overhead. this commit contains some refactoring to make variable names and function names less confusing. it also adds a new field named `script.VMs` into the `MEMORY STATS` command. additionally, clear scripts and stats between tests in external mode (which is related to how this issue was discovered) |
||
|
|
6c5e263d7b
|
Temporarily hide the new SFLUSH command by marking it as experimental (#13600)
- Add a new 'EXPERIMENTAL' command flag, which causes the command generator to skip over it and make the command to be unavailable for execution - Skip experimental tests by default - Move the SFLUSH tests from the old framework to the new one --------- Co-authored-by: YaacovHazan <yaacov.hazan@redislabs.com> |
||
|
|
3fcddfb61f
|
testsuite --dump-logs works on servers started before the test (#13500)
so far ./runtest --dump-logs used work for servers started within the test proc. now it'll also work on servers started outside the test proc scope. the downside is that these logs can be huge if they served many tests and not just the failing one. but for some rare failures, we rather have that than nothing. this feature isn't enabled y default, but is used by our GH actions. |
||
|
|
e750c619b2
|
Fix some test failures caused by key being deleted due to premature expiration (#13453)
1. Fix fuzzer test failure when the key was deleted due to expiration
before sending random traffic for the key.
After HFE, when all fields in a hash are expired, the hash might be
deleted due to expiration.
If the key was expired in the mid of `RESTORE` command and sending rand
trafic, `fuzzer` test will fail in the following code because the 'TYPE
key' will return `none` and then throw an exception because it cannot be
found in `$commands`
|
||
|
|
a331978583
|
Fix external test hang in redis-cli test when run in a certain order (#13423)
When the tests are run against an external server in this order: `--single unit/introspection --single unit/moduleapi/blockonbackground --single integration/redis-cli` the test would hang when the "ASK redirect test" test attempts to create a listening socket (it fails, and then redis-cli itself hangs waiting for a non-responsive socket created by the introspection test). the reasons are: 1. the blockedbackground test includes util.tcl and resets the `::last_port_attempted` variable 2. the test in introspection didn't close the listening server, so it's still alive. 3. find_available_port doesn't properly detect the busy port, and it thinks that the port is free even though it's busy. fixing all 3 of these problems, even though fixing just one would be enough to let the test pass. |
||
|
|
fa46aa4d85
|
Test infra adjustments for external CI runs (#13421)
- when uploading server logs, make sure they don't overwrite each other. - sort the test units to get consistent order between them (following #13220) - backup and restore the entire server configuration, to protect one unit from config changes another unit performs |
||
|
|
69b480cb7a
|
Hide user data from log (#13400)
This PR is based on the commits from PR #11747. In the event of an assertion failure, hide command arguments from the operator. In some cases, private client information can be voluntarily exposed when a redis instance crashes due to an assertion failure. This commit prevent וnintentional client info exposure. Operators can still access the hidden data, but they must actively request it. Any of the client info commands remains the unchanged. ### Config Add a new config `hide-user-data-from-log` to turn this feature on and off, default off. --------- Co-authored-by: naglera <anagler123@gmail.com> Co-authored-by: naglera <58042354+naglera@users.noreply.github.com> |
||
|
|
a84cc20aef
|
HFE - Fix statistic to count also lazy expired and rename INFO params (#13372)
* INFO command : rename `hashes_with_expiry_fields` to `subexpiry` * INFO command : rename `expired_hash_fields` to `expired_subkeys` * Fix statistic of `expired_subkeys` to count also lazy expired * Remove TODOs comments leftover in TCL * Fix potential flaky test of rdb load of hash-field-expiration |
||
|
|
5a3534f9b5
|
dynamically list test files (#13220)
**Related issue** https://github.com/redis/redis/issues/13219 **Motivation** Currently we have to manually update the all_tests variable when introducing new test files. **Modification** I have modified it to list test files dynamically, but instead of modifying it to add all test files, I have modified it to only add only test files from the following 4 paths - unit - unit/type - unit/cluster - integration so that it doesn't deviate too much from what we already do **Result** - dynamically list test files to all_tests variable - close issue https://github.com/redis/redis/issues/13219 **Additional information** - removed `list-common.tcl` file and added `generate_largevalue_test_array` proc in `util.tcl`. because `list-common.tcl` is not a test file - There is an order dependency. So I added a code to the "Is a ziplist encoded Hash promoted on big payload?" test that resets hash-max-listpack-value to the default (64). --------- Signed-off-by: jonghoonpark <dev@jonghoonpark.com> Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
33fc0fbfae
|
HFE to support AOF and replicas (#13285)
* For replica sake, rewrite commands `H*EXPIRE*` , `HSETF`, `HGETF` to have absolute unix time in msec. * On active-expiration of field, propagate HDEL to replica (`propagateHashFieldDeletion()`) * On lazy-expiration, propagate HDEL to replica (`hashTypeGetValue()` now calls `hashTypeDelete()`. It also takes care to call `propagateHashFieldDeletion()`). * Fix `H*EXPIRE*` command such that if it gets flag `LT` and it doesn’t have any expiration on the field then it will considered as valid condition. Note, replicas doesn’t make any active expiration, and should avoid lazy expiration. On `hashTypeGetValue()` it doesn't check expiration (As long as the master didn’t request to delete the field, it is valid) TODO: * Attach `dbid` to HASH metadata. See [here](https://github.com/redis/redis/pull/13209#discussion_r1593385850) --------- Co-authored-by: debing.sun <debing.sun@redis.com> |
||
|
|
36c3cec6d1
|
Fix hfe RDB tests by adding FIELDS keyword to hexpire commands (#13277)
FIELDS keyword was added as part of [#13270](https://github.com/redis/redis/pull/13270). It was missing in [#13243](https://github.com/redis/redis/pull/13243) |
||
|
|
323be4d699
|
Hfe serialization listpack (#13243)
Add RDB de/serialization for HFE This PR adds two new RDB types: `RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_TTL` to save HFE data. When the hash RAM encoding is dict, it will be saved in the former, and when it is listpack it will be saved in the latter. Both formats just add the TTL value for each field after the data that was previously saved, i.e HASH_METADATA will save the number of entries and, for each entry, key, value and TTL, whereas listpack is saved as a blob. On read, the usual dict <--> listpack conversion takes place if required. In addition, when reading a hash that was saved as a dict fields are actively expired if expiry is due. Currently this slao holds for listpack encoding, but it is supposed to be removed. TODO: Remove active expiry on load when loading from listpack format (unless we'll decide to keep it) |
||
|
|
0b34396924
|
Change license from BSD-3 to dual RSALv2+SSPLv1 (#13157)
[Read more about the license change here](https://redis.com/blog/redis-adopts-dual-source-available-licensing/) Live long and prosper 🖖 |
||
|
|
3c2ea1ea95
|
Fix wathced client test timing issue caused by late close (#13062)
There is a timing issue in the test, close may arrive late, or in freeClientAsync we will free the client in async way, which will lead to errors in watching_clients statistics, since we will only unwatch all keys when we truly freeClient. Add a wait here to avoid this problem. Also fixed some outdated comments i saw. The test was introduced in #12966. |
||
|
|
b3aaa0a136
|
When one shard, sole primary node marks potentially failed replica as FAIL instead of PFAIL (#12824)
Fixes issue where a single primary cannot mark a replica as failed in a single-shard cluster. |
||
|
|
8bb9a2895e
|
Address some failures with new tests for improving debug report (#12915)
Fix a daily test failure because alpine doesn't support stack traces and add in an extra assertion related to making sure the stack trace was printed twice. |
||
|
|
c85a9b7896
|
Fix delKeysInSlot server events are not executed inside an execution unit (#12745)
This is a follow-up fix to #12733. We need to apply the same changes to delKeysInSlot. Refer to #12733 for more details. This PR contains some other minor cleanups / improvements to the test suite and docs. It uses the postnotifications test module in a cluster mode test which revealed a leak in the test module (fixed). |
||
|
|
0270abda82
|
Replace cluster metadata with slot specific dictionaries (#11695)
This is an implementation of https://github.com/redis/redis/issues/10589 that eliminates 16 bytes per entry in cluster mode, that are currently used to create a linked list between entries in the same slot. Main idea is splitting main dictionary into 16k smaller dictionaries (one per slot), so we can perform all slot specific operations, such as iteration, without any additional info in the `dictEntry`. For Redis cluster, the expectation is that there will be a larger number of keys, so the fixed overhead of 16k dictionaries will be The expire dictionary is also split up so that each slot is logically decoupled, so that in subsequent revisions we will be able to atomically flush a slot of data. ## Important changes * Incremental rehashing - one big change here is that it's not one, but rather up to 16k dictionaries that can be rehashing at the same time, in order to keep track of them, we introduce a separate queue for dictionaries that are rehashing. Also instead of rehashing a single dictionary, cron job will now try to rehash as many as it can in 1ms. * getRandomKey - now needs to not only select a random key, from the random bucket, but also needs to select a random dictionary. Fairness is a major concern here, as it's possible that keys can be unevenly distributed across the slots. In order to address this search we introduced binary index tree). With that data structure we are able to efficiently find a random slot using binary search in O(log^2(slot count)) time. * Iteration efficiency - when iterating dictionary with a lot of empty slots, we want to skip them efficiently. We can do this using same binary index that is used for random key selection, this index allows us to find a slot for a specific key index. For example if there are 10 keys in the slot 0, then we can quickly find a slot that contains 11th key using binary search on top of the binary index tree. * scan API - in order to perform a scan across the entire DB, the cursor now needs to not only save position within the dictionary but also the slot id. In this change we append slot id into LSB of the cursor so it can be passed around between client and the server. This has interesting side effect, now you'll be able to start scanning specific slot by simply providing slot id as a cursor value. The plan is to not document this as defined behavior, however. It's also worth nothing the SCAN API is now technically incompatible with previous versions, although practically we don't believe it's an issue. * Checksum calculation optimizations - During command execution, we know that all of the keys are from the same slot (outside of a few notable exceptions such as cross slot scripts and modules). We don't want to compute the checksum multiple multiple times, hence we are relying on cached slot id in the client during the command executions. All operations that access random keys, either should pass in the known slot or recompute the slot. * Slot info in RDB - in order to resize individual dictionaries correctly, while loading RDB, it's not enough to know total number of keys (of course we could approximate number of keys per slot, but it won't be precise). To address this issue, we've added additional metadata into RDB that contains number of keys in each slot, which can be used as a hint during loading. * DB size - besides `DBSIZE` API, we need to know size of the DB in many places want, in order to avoid scanning all dictionaries and summing up their sizes in a loop, we've introduced a new field into `redisDb` that keeps track of `key_count`. This way we can keep DBSIZE operation O(1). This is also kept for O(1) expires computation as well. ## Performance This change improves SET performance in cluster mode by ~5%, most of the gains come from us not having to maintain linked lists for keys in slot, non-cluster mode has same performance. For workloads that rely on evictions, the performance is similar because of the extra overhead for finding keys to evict. RDB loading performance is slightly reduced, as the slot of each key needs to be computed during the load. ## Interface changes * Removed `overhead.hashtable.slot-to-keys` to `MEMORY STATS` * Scan API will now require 64 bits to store the cursor, even on 32 bit systems, as the slot information will be stored. * New RDB version to support the new op code for SLOT information. --------- Co-authored-by: Vitaly Arbuzov <arvit@amazon.com> Co-authored-by: Harkrishn Patro <harkrisp@amazon.com> Co-authored-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com> |
||
|
|
f0c1c730d4
|
test suite: clean server pids after server crashed (#12639)
when a server in the test suite crashes and is restarted by redstart_server, we didn't clean it's pid from the list. we can see that when the corrupt-dump-fuzzer hangs, it has a long list of servers to lean, but in fact they're all already dead. |
||
|
|
2e0f6724e0
|
Stabilization and improvements around aof tests (#12626)
In some tests, the code manually searches for a log message, and it uses tail -1 with a delay of 1 second, which can miss the expected line. Also, because the aof tests use start_server_aof and not start_server, the test name doesn't log into the server log. To fix the above, I made the following changes: - Change the start_server_aof to wrap the start_server. This will add the created aof server to the servers list, and make srv() and wait_for_log_messages() available for the tests. - Introduce a new option for start_server. 'wait_ready' - an option to let the caller start the test code without waiting for the server to be ready. useful for tests on a server that is expected to exit on startup. - Create a new start_server_aof_ex. The new proc also accept options as argument and make use of the new 'short_life' option for tests that are expected to exit on startup because of some error in the aof file(s). Because of the above, I had to change many lines and replace every local srv variable (a server config) usage with the srv(). |
||
|
|
6abb3c4038
|
change log match to line match in tcl sanitizer_errors_from_file. (#12446)
In the tcl foreach loop, the function should compare line rather than the whole file. |