Commit graph

164 commits

Author SHA1 Message Date
debing.sun
fa040a72c0
Add XDELEX and XACKDEL commands for stream (#14130)
## Summary and detailed design for new stream command

## XDELEX

### Syntax
```
XDELEX key [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```

### Description
The `XDELEX` command extends the Redis Streams `XDEL` command, offering
enhanced control over message entry deletion with respect to consumer
groups. It accepts optional `DELREF` or `ACKED` parameters to modify its
behavior:

- **KEEPREF:** Deletes the specified entries from the stream, but
preserves existing references to these entries in all consumer groups'
PEL. This behavior is similar to XDEL.
- **DELREF:** Deletes the specified entries from the stream and also
removes all references to these entries from all consumer groups'
pending entry lists, effectively cleaning up all traces of the messages.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.

**Note:** The `IDS` block can appear at any position in the command,
consistent with other commands.

### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was deleted from the stream.
- `2`: Entry was not deleted, but there are still dangling references.
(ACKED option)

## XACKDEL

### Syntax
```
XACKDEL key group [KEEPREF | DELREF | ACKED] IDS numids id [id ...]
```

### Description
The `XACKDEL` command combines `XACK` and `XDEL` functionalities in
Redis Streams. It acknowledges specified message IDs in the given
consumer group and attempts to delete corresponding stream entries. It
accepts optional `DELREF` or `ACKED` parameters:

- **KEEPREF:** Acknowledges the messages in the specified consumer group
and deletes the entries from the stream, but preserves existing
references to these entries in all consumer groups' PEL.
- **DELREF:** Acknowledges the messages in the specified consumer group,
deletes the entries from the stream, and also removes all references to
these entries from all consumer groups' pending entry lists, effectively
cleaning up all traces of the messages.
- **ACKED:** Acknowledges the messages in the specified consumer group
and only trims entries that were read and acknowledged by all consumer
groups.


### Reply
Array reply, for each `id`:
- `-1`: No such `id` exists in the provided stream `key`.
- `1`: Entry was acknowledged and deleted from the stream.
- `2`: Entry was acknowledged but not deleted, but there are still
dangling references. (ACKED option)

# Redis Streams Commands Extension

## XTRIM

### Syntax
```
XTRIM key <MAXLEN | MINID> [= | ~] threshold [LIMIT count] [KEEPREF | DELREF | ACKED]
```

### Description
The `XTRIM` command trims a stream by removing entries based on
specified criteria, extended to include optional `DELREF` or `ACKED`
parameters for consumer group handling:

- **KEEPREF:** Trims the stream according to the specified strategy
(MAXLEN or MINID) regardless of whether entries are referenced by any
consumer groups, but preserves existing references to these entries in
all consumer groups' PEL.
- **DELREF:** Trims the stream according to the specified strategy and
also removes all references to the trimmed entries from all consumer
groups' PEL.
- **ACKED:** Only trims entries that were read and acknowledged by all
consumer groups.

### Reply
No change.

## XADD

### Syntax
```
XADD key [NOMKSTREAM] [<MAXLEN | MINID> [= | ~] threshold [LIMIT count]] [KEEPREF | DELREF | ACKED] <* | id> field value [field value ...]
```

### Description
The `XADD` command appends a new entry to a stream and optionally trims
it in the same operation, extended to include optional `DELREF` or
`ACKED` parameters for trimming behavior:

- **KEEPREF:** When trimming, removes entries from the stream according
to the specified strategy (MAXLEN or MINID), regardless of whether they
are referenced by any consumer groups, but preserves existing references
to these entries in all consumer groups' PEL.
- **DELREF:** When trimming, removes entries from the stream according
to the specified strategy and also removes all references to these
entries from all consumer groups' PEL.
- **ACKED:** When trimming, only removes entries that were read and
acknowledged by all consumer groups. Note that if the number of
referenced entries is bigger than MAXLEN, we will still stop.

### Reply
No change.

## Key implementation

Since we currently have no simple way to track the association between
an entry and consumer groups without iterating over all groups, we
introduce two mechanisms to establish this link. This allows us to
determine whether an entry has been seen by all consumer groups, and to
identify which groups are referencing it. With this links, we can break
the association when the entry is either acknowledged or deleted.

1) Added reference tracking between stream messages and consumer groups
using `cgroups_ref`
The cgroups_ref is implemented as a rax that maps stream message IDs to
lists of consumer groups that reference those messages, and streamNACK
stores the corresponding nodes of this list, so that the corresponding
groups can be deleted during `ACK`.
In this way, we can determine whether an entry has been seen but not
ack.
2) Store a cache minimum last_id in the stream structure.
The reason for doing this is that there is a situation where an entry
has never been seen by the consume group. In this case, we think this
entry has not been consumed either. If there is an "ACKED" option, we
cannot directly delete this entry either.
When a consumer group updates its last_id, we don’t immediately update
the cached minimum last_id. Instead, we check whether the group’s
previous last_id was equal to the current minimum, or whether the new
last_id is smaller than the current minimum (when using `XGROUP SETID`).
If either is true, we mark the cached minimum last_id as invalid, and
defer the actual update until the next time it’s needed.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: moticless <moticless@github.com>
Co-authored-by: Ozan Tezcan <ozantezcan@gmail.com>
Co-authored-by: Slavomir Kaslev <slavomir.kaslev@gmail.com>
Co-authored-by: Yuan Wang <yuan.wang@redis.com>
2025-07-01 21:00:42 +08:00
debing.sun
5ff81f68a3
Fix XPENDING reply schema for empty reply (#14129)
When the PEL is empty, the reply of `XPENDING` without `start` option
will be:
```
1) (integer) 0
2) (nil)
3) (nil)
4) (nil)
```

It is not an empty array, so we need to create an individual reply
schema for it.
2025-07-01 17:35:09 +08:00
Mincho Paskalev
8dfb823c51
Implement DIFF, DIFF1, ANDOR and ONE for BITOP (#13898)
This PR adds 4 new operators to the `BITOP` command - `DIFF`, `DIFF1`,
`ANDOR` and `ONE`. They enable redis clients to atomically do
non-trivial logical operations that are useful for checking membership
of a bitmap against a group of bitmaps.
 
* **DIFF**
    `BITOP DIFF dest srckey1 srckey2 [key...]`

    **Description**
DIFF(*X*, *A1*, *A2*, *...*, *AN*) = *X* ∧ ¬(*A1* ∨ *A2* ∨ *...* ∨
*AN*), i.e the set bits of *X* that are not set in any of *A1*, *A2*,
*…*, *AN*

    **NOTE**
    Command expects at least 2 source keys.

* **DIFF1**
    `BITOP DIFF1 dest srckey1 srckey2 [key...]`

    **Description**
DIFF1(*X*, *A1*, *A2*, *...*, *AN*) = ¬*X* ∧ (*A1* ∨ *A2* ∨ *...* ∨
*AN*), i.e the bits set in one or more of *A1*, *A2*, *…*, *AN* that are
not set in *X*

    **NOTE**
    Command expects at least 2 source keys.

* **ANDOR**
    `BITOP ANDOR dest srckey1 srckey2 [key...]`

    **Description**
ANDOR(*X*, *A1*, *A2*, *...*, *AN*) = *X* ∧ (*A1* ∨ *A2* ∨ *...* ∨
*AN*), i.e the set bits of X that are also set in *A1*, *A2*, *…*, *AN*

    **NOTE**
    Command expects at least 2 source keys.

* **ONE**
    `BITOP ONE dest key [key...]`

    **Description**
    ONE(*A1*, *A2*, *...*, *AN*) = *X*, where 
if *X[i]* is the *i*-th bit of *X* then *X[i] = 1* if and only if there
is m such that *A_m[i] = 1* and *An[i] = 0* for all *n != m*, i.e bit
*X[i]* is set only if it set in exactly one of *A1*, *A2*, *...*, *AN*

**Return value**
As in all other `BITOP` operators return value for all the new ones is
the number bytes of the longest key.

EDIT:
Besides adding the new commands couple more changes were made:
- Added AVX2 path for more optimized computation of the BITOP operations
(including the new ones)
- Removed the hard limit of max 16 source keys for the fast path to be
used - now no matter the number of keys we can enter the fast path given
keys are long enough.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-05-20 10:45:50 +03:00
debing.sun
658424fc83
Revert "Update history for ban-list propagation (#13749)" (#13827)
As discussed in
https://github.com/redis/redis/pull/13749#issuecomment-2673612941.
After #10398 we should record only the arguments and output changes in
the command history, while placing all others in the redis-doc, so
revert #13749.
2025-02-24 17:40:25 +08:00
Ozan Tezcan
6c202f495c
Remove DENYOOM flag from hexpire command (#13800)
Remove DENYOOM flag from hexpire / hexpireat / hpexpire / hpexpireat
commands.

h(p)expire(at) commands may allocate some memory but it is not that big.
Similary, we don't have DENYOOM flag for EXPIRE command. This change
will align EXPIRE and HEXPIRE commands in this manner.
2025-02-16 20:07:29 +03:00
Ozan Tezcan
e2608478b6
Add HGETDEL, HGETEX and HSETEX hash commands (#13798)
This PR adds three new hash commands: HGETDEL, HGETEX and HSETEX. These
commands enable user to do multiple operations in one step atomically
e.g. set a hash field and update its TTL with a single command.
Previously, it was only possible to do it by calling hset and hexpire
commands subsequently.

- **HGETDEL command**

  ```
  HGETDEL <key> FIELDS <numfields> field [field ...]
  ```
  
  **Description**  
  Get and delete the value of one or more fields of a given hash key
  
  **Reply**  
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HGETEX command**

  ```
   HGETEX <key>  
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
     FIELDS <numfields> field [field ...]
  ```

  **Description**
Get the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  PERSIST: Remove the time to live associated with the field.

  **Reply** 
Array reply: list of the value associated with each field or nil if the
field doesn’t exist.

- **HSETEX command**

  ```
  HSETEX <key>
     [FNX | FXX]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
     FIELDS <numfields> field value [field value...]
  ```
  **Description**
Set the value of one or more fields of a given hash key, and optionally
set their expiration

  **Options:**
  FNX: Only set the fields if all do not already exist.
  FXX: Only set the fields if all already exist.

  EX seconds: Set the specified expiration time, in seconds.
  PX milliseconds: Set the specified expiration time, in milliseconds.
EXAT timestamp-seconds: Set the specified Unix time at which the field
will expire, in seconds.
PXAT timestamp-milliseconds: Set the specified Unix time at which the
field will expire, in milliseconds.
  KEEPTTL: Retain the time to live associated with the field.

  
Note: If no option is provided, any associated expiration time will be
discarded similar to how SET command behaves.

  **Reply**
  Integer reply: 0 if no fields were set
  Integer reply: 1 if all the fields were set
2025-02-14 17:13:35 +03:00
Ozan Tezcan
09f8a2f374
Start AOFRW before streaming repl buffer during fullsync (#13758)
During fullsync, before loading RDB on the replica, we stop aof child to
prevent copy-on-write disaster.
Once rdb is loaded, aof is started again and it will trigger aof
rewrite. With https://github.com/redis/redis/pull/13732 , for rdbchannel
replication, this behavior was changed. Currently, we start aof after
replication buffer is streamed to db. This PR changes it back to start
aof just after rdb is loaded (before repl buffer is streamed)

Both approaches may have pros and cons. If we start aof before streaming
repl buffers, we may still face with copy-on-write issues as repl
buffers potentially include large amount of changes. If we wait until
replication buffer drained, it means we are delaying starting aof
persistence.

Additional changes are introduced as part of this PR:

- Interface change:
Added `mem_replica_full_sync_buffer` field to the `INFO MEMORY` command
reply. During full sync, it shows total memory consumed by accumulated
replication stream buffer on replica. Added same metric to `MEMORY
STATS` command reply as `replica.fullsync.buffer` field.
  
  
- Fixes: 
- Count repl stream buffer size of replica as part of 'memory overhead'
calculation for fields in "INFO MEMORY" and "MEMORY STATS" outputs.
Before this PR, repl buffer was not counted as part of memory overhead
calculation, causing misreports for fields like `used_memory_overhead`
and `used_memory_dataset` in "INFO STATS" and for `overhead.total` field
in "MEMORY STATS" command reply.
- Dismiss replication stream buffers memory of replica in the fork to
reduce COW impact during a fork.
- Fixed a few time sensitive flaky tests, deleted a noop statement,
fixed some comments and fail messages in rdbchannel tests.
2025-02-04 21:40:18 +03:00
Mason
f5e046a730
Update history for ban-list propagation (#13749)
Update CLUSTER FORGET docs for changes in
https://github.com/redis/redis/pull/10869

Docs PR:
https://github.com/redis/docs/pull/1057

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2025-01-27 21:05:37 +08:00
Yuan Wang
64a40b20d9
Async IO Threads (#13695)
## Introduction
Redis introduced IO Thread in 6.0, allowing IO threads to handle client
request reading, command parsing and reply writing, thereby improving
performance. The current IO thread implementation has a few drawbacks.
- The main thread is blocked during IO thread read/write operations and
must wait for all IO threads to complete their current tasks before it
can continue execution. In other words, the entire process is
synchronous. This prevents the efficient utilization of multi-core CPUs
for parallel processing.

- When the number of clients and requests increases moderately, it
causes all IO threads to reach full CPU utilization due to the busy wait
mechanism used by the IO threads. This makes it challenging for us to
determine which part of Redis has reached its bottleneck.

- When IO threads are enabled with TLS and io-threads-do-reads, a
disconnection of a connection with pending data may result in it being
assigned to multiple IO threads simultaneously. This can cause race
conditions and trigger assertion failures. Related issue:
redis#12540

Therefore, we designed an asynchronous IO threads solution. The IO
threads adopt an event-driven model, with the main thread dedicated to
command processing, meanwhile, the IO threads handle client read and
write operations in parallel.

## Implementation
### Overall
As before, we did not change the fact that all client commands must be
executed on the main thread, because Redis was originally designed to be
single-threaded, and processing commands in a multi-threaded manner
would inevitably introduce numerous race and synchronization issues. But
now each IO thread has independent event loop, therefore, IO threads can
use a multiplexing approach to handle client read and write operations,
eliminating the CPU overhead caused by busy-waiting.

the execution process can be briefly described as follows:
the main thread assigns clients to IO threads after accepting
connections, IO threads will notify the main thread when clients
finish reading and parsing queries, then the main thread processes
queries from IO threads and generates replies, IO threads handle
writing reply to clients after receiving clients list from main thread,
and then continue to handle client read and write events.

### Each IO thread has independent event loop
We now assign each IO thread its own event loop. This approach
eliminates the need for the main thread to perform the costly
`epoll_wait` operation for handling connections (except for specific
ones). Instead, the main thread processes requests from the IO threads
and hands them back once completed, fully offloading read and write
events to the IO threads.

Additionally, all TLS operations, including handling pending data, have
been moved entirely to the IO threads. This resolves the issue where
io-threads-do-reads could not be used with TLS.

### Event-notified client queue
To facilitate communication between the IO threads and the main thread,
we designed an event-notified client queue. Each IO thread and the main
thread have two such queues to store clients waiting to be processed.
These queues are also integrated with the event loop to enable handling.
We use pthread_mutex to ensure the safety of queue operations, as well
as data visibility and ordering, and race conditions are minimized, as
each IO thread and the main thread operate on independent queues,
avoiding thread suspension due to lock contention. And we implemented an
event notifier based on `eventfd` or `pipe` to support event-driven
handling.

### Thread safety
Since the main thread and IO threads can execute in parallel, we must
handle data race issues carefully.

**client->flags**
The primary tasks of IO threads are reading and writing, i.e.
`readQueryFromClient` and `writeToClient`. However, IO threads and the
main thread may concurrently modify or access `client->flags`, leading
to potential race conditions. To address this, we introduced an io-flags
variable to record operations performed by IO threads, thereby avoiding
race conditions on `client->flags`.

**Pause IO thread**
In the main thread, we may want to operate data of IO threads, maybe
uninstall event handler, access or operate query/output buffer or resize
event loop, we need a clean and safe context to do that. We pause IO
thread in `IOThreadBeforeSleep`, do some jobs and then resume it. To
avoid thread suspended, we use busy waiting to confirm the target
status. Besides we use atomic variable to make sure memory visibility
and ordering. We introduce these functions to pause/resume IO Threads as
below.
```
pauseIOThread, resumeIOThread
pauseAllIOThreads, resumeAllIOThreads
pauseIOThreadsRange, resumeIOThreadsRange
```
Testing has shown that `pauseIOThread` is highly efficient, allowing the
main thread to execute nearly 200,000 operations per second during
stress tests. Similarly, `pauseAllIOThreads` with 8 IO threads can
handle up to nearly 56,000 operations per second. But operations
performed between pausing and resuming IO threads must be quick;
otherwise, they could cause the IO threads to reach full CPU
utilization.

**freeClient and freeClientAsync**
The main thread may need to terminate a client currently running on an
IO thread, for example, due to ACL rule changes, reaching the output
buffer limit, or evicting a client. In such cases, we need to pause the
IO thread to safely operate on the client.

**maxclients and maxmemory-clients updating**
When adjusting `maxclients`, we need to resize the event loop for all IO
threads. Similarly, when modifying `maxmemory-clients`, we need to
traverse all clients to calculate their memory usage. To ensure safe
operations, we pause all IO threads during these adjustments.

**Client info reading**
The main thread may need to read a client’s fields to generate a
descriptive string, such as for the `CLIENT LIST` command or logging
purposes. In such cases, we need to pause the IO thread handling that
client. If information for all clients needs to be displayed, all IO
threads must be paused.

**Tracking redirect**
Redis supports the tracking feature and can even send invalidation
messages to a connection with a specified ID. But the target client may
be running on IO thread, directly manipulating the client’s output
buffer is not thread-safe, and the IO thread may not be aware that the
client requires a response. In such cases, we pause the IO thread
handling the client, modify the output buffer, and install a write event
handler to ensure proper handling.

**clientsCron**
In the `clientsCron` function, the main thread needs to traverse all
clients to perform operations such as timeout checks, verifying whether
they have reached the soft output buffer limit, resizing the
output/query buffer, or updating memory usage. To safely operate on a
client, the IO thread handling that client must be paused.
If we were to pause the IO thread for each client individually, the
efficiency would be very low. Conversely, pausing all IO threads
simultaneously would be costly, especially when there are many IO
threads, as clientsCron is invoked relatively frequently.
To address this, we adopted a batched approach for pausing IO threads.
At most, 8 IO threads are paused at a time. The operations mentioned
above are only performed on clients running in the paused IO threads,
significantly reducing overhead while maintaining safety.

### Observability
In the current design, the main thread always assigns clients to the IO
thread with the least clients. To clearly observe the number of clients
handled by each IO thread, we added the new section in INFO output. The
`INFO THREADS` section can show the client count for each IO thread.
```
# Threads
io_thread_0:clients=0
io_thread_1:clients=2
io_thread_2:clients=2
```

Additionally, in the `CLIENT LIST` output, we also added a field to
indicate the thread to which each client is assigned.

`id=244 addr=127.0.0.1:41870 laddr=127.0.0.1:6379 ... resp=2 lib-name=
lib-ver= io-thread=1`

## Trade-off
### Special Clients
For certain special types of clients, keeping them running on IO threads
would result in severe race issues that are difficult to resolve.
Therefore, we chose not to offload these clients to the IO threads.

For replica, monitor, subscribe, and tracking clients, main thread may
directly write them a reply when conditions are met. Race issues are
difficult to resolve, so we have them processed in the main thread. This
includes the Lua debug clients as well, since we may operate connection
directly.

For blocking client, after the IO thread reads and parses a command and
hands it over to the main thread, if the client is identified as a
blocking type, it will be remained in the main thread. Once the blocking
operation completes and the reply is generated, the client is
transferred back to the IO thread to send the reply and wait for event
triggers.

### Clients Eviction
To support client eviction, it is necessary to update each client’s
memory usage promptly during operations such as read, write, or command
execution. However, when a client operates on an IO thread, it is not
feasible to update the memory usage immediately due to the risk of data
races. As a result, memory usage can only be updated either in the main
thread while processing commands or in the `ClientsCron` periodically.
The downside of this approach is that updates might experience a delay
of up to one second, which could impact the precision of memory
management for eviction.

To avoid incorrectly evicting clients. We adopted a best-effort
compensation solution, when we decide to eviction a client, we update
its memory usage again before evicting, if the memory used by the client
does not decrease or memory usage bucket is not changed, then we will
evict it, otherwise, not evict it.

However, we have not completely solved this problem. Due to the delay in
memory usage updates, it may lead us to make incorrect decisions about
the need to evict clients.

### Defragment
In the majority of cases we do NOT use the data from argv directly in
the db.
1. key names
We store a copy that we allocate in the main thread, see `sdsdup()` in
`dbAdd()`.
2. hash key and value
We store key as hfield and store value as sds, see `hfieldNew()` and
`sdsdup()` in `hashTypeSet()`.
3. other datatypes
   They don't even use SDS, so there is no reference issues.

But in some cases client the data from argv may be retain by the main
thread.
As a result, during fragmentation cleanup, we need to move allocations
from the IO thread’s arena to the main thread’s arena. We always
allocate new memory in the main thread’s arena, but the memory released
by IO threads may not yet have been reclaimed. This ultimately causes
the fragmentation rate to be higher compared to creating and allocating
entirely within a single thread.
The following cases below will lead to memory allocated by the IO thread
being kept by the main thread.
1. string related command: `append`, `getset`, `mset` and `set`.
If `tryObjectEncoding()` does not change argv, we will keep it directly
in the main thread, see the code in `tryObjectEncoding()`(specifically
`trimStringObjectIfNeeded()`)
2. block related command.
    the key names will be kept in `c->db->blocking_keys`.
3. watch command
    the key names will be kept in `c->db->watched_keys`.
4. [s]subscribe command
    channel name will be kept in `serverPubSubChannels`.
5. script load command
    script will be kept in `server.lua_scripts`.
7. some module API: `RM_RetainString`, `RM_HoldString`

Those issues will be handled in other PRs.

## Testing
### Functional Testing
The commit with enabling IO Threads has passed all TCL tests, but we did
some changes:
**Client query buffer**: In the original code, when using a reusable
query buffer, ownership of the query buffer would be released after the
command was processed. However, with IO threads enabled, the client
transitions from an IO thread to the main thread for processing. This
causes the ownership release to occur earlier than the command
execution. As a result, when IO threads are enabled, the client's
information will never indicate that a shared query buffer is in use.
Therefore, we skip the corresponding query buffer tests in this case.
**Defragment**: Add a new defragmentation test to verify the effect of
io threads on defragmentation.
**Command delay**: For deferred clients in TCL tests, due to clients
being assigned to different threads for execution, delays may occur. To
address this, we introduced conditional waiting: the process proceeds to
the next step only when the `client list` contains the corresponding
commands.

### Sanitizer Testing
The commit passed all TCL tests and reported no errors when compiled
with the `fsanitizer=thread` and `fsanitizer=address` options enabled.
But we made the following modifications: we suppressed the sanitizer
warnings for clients with watched keys when updating `client->flags`, we
think IO threads read `client->flags`, but never modify it or read the
`CLIENT_DIRTY_CAS` bit, main thread just only modifies this bit, so
there is no actual data race.

## Others
### IO thread number
In the new multi-threaded design, the main thread is primarily focused
on command processing to improve performance. Typically, the main thread
does not handle regular client I/O operations but is responsible for
clients such as replication and tracking clients. To avoid breaking
changes, we still consider the main thread as the first IO thread.

When the io-threads configuration is set to a low value (e.g., 2),
performance does not show a significant improvement compared to a
single-threaded setup for simple commands (such as SET or GET), as the
main thread does not consume much CPU for these simple operations. This
results in underutilized multi-core capacity. However, for more complex
commands, having a low number of IO threads may still be beneficial.
Therefore, it’s important to adjust the `io-threads` based on your own
performance tests.

Additionally, you can clearly monitor the CPU utilization of the main
thread and IO threads using `top -H -p $redis_pid`. This allows you to
easily identify where the bottleneck is. If the IO thread is the
bottleneck, increasing the `io-threads` will improve performance. If the
main thread is the bottleneck, the overall performance can only be
scaled by increasing the number of shards or replicas.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
Co-authored-by: oranagra <oran@redislabs.com>
2024-12-23 14:16:40 +08:00
Oran Agra
79fd255828
Add Lua VM memory to memory overhead, now that it's part of zmalloc (#13660)
To complement the work done in #13133.
it added the script VMs memory to be counted as part of zmalloc, but
that means they
should be also counted as part of the non-value overhead.

this commit contains some refactoring to make variable names and
function names less confusing.
it also adds a new field named `script.VMs` into the `MEMORY STATS`
command.

additionally, clear scripts and stats between tests in external mode
(which is related to how this issue was discovered)
2024-11-21 08:22:17 +02:00
YaacovHazan
6c5e263d7b
Temporarily hide the new SFLUSH command by marking it as experimental (#13600)
- Add a new 'EXPERIMENTAL' command flag, which causes the command
generator to skip over it and make the command to be unavailable for
execution
- Skip experimental tests by default
- Move the SFLUSH tests from the old framework to the new one

---------

Co-authored-by: YaacovHazan <yaacov.hazan@redislabs.com>
2024-10-15 11:02:51 +03:00
Moti Cohen
d092d64d7a
Add new SFLUSH command to cluster for slot-based FLUSH (#13564)
This PR introduces a new `SFLUSH` command to cluster mode that allows
partial flushing of nodes based on specified slot ranges. Current
implementation is designed to flush all slots of a shard, but future
extensions could allow for more granular flushing.

**Command Usage:**
`SFLUSH <start-slot> <end-slot> [<start-slot> <end-slot>]* [SYNC|ASYNC]`

This command removes all data from the specified slots, either
synchronously or asynchronously depending on the optional SYNC/ASYNC
argument.

**Functionality:**
Current imp of `SFLUSH` command verifies that the provided slot ranges
are valid and cover all of the node's slots before proceeding. If slots
are partially or incorrectly specified, the command will fail and return
an error, ensuring that all slots of a node must be fully covered for
the flush to proceed.

The function supports both synchronous (default) and asynchronous
flushing. In addition, if possible, SFLUSH SYNC will be run as blocking
ASYNC as an optimization.
2024-09-29 09:13:21 +03:00
Filipe Oliveira (Redis)
00a8e72cfc
Created specific SMEMBERS command logic which avoids sinterGenericCommand, and minimizes processing and memory overhead (#13499)
This PR introduces a dedicated implementation for the SMEMBERS command
that avoids using the more generalized sinterGenericCommand function.
By tailoring the logic specifically for SMEMBERS, we reduce unnecessary
processing and memory overheads that were previously incurred by
handling more complex cases like set intersections.

---------

Co-authored-by: debing.sun <debing.sun@redis.com>
2024-09-03 18:32:43 +08:00
Moti Cohen
4dd8b1faa9
Fix HTTL/HPTTL to be NONDETERMINISTIC_OUTPUT (#13461)
H[P]TTL should be marked as NONDETERMINISTIC_OUTPUT just like [P]TTL.
2024-08-04 17:42:50 +03:00
Jo
871c985919
Update FIELDS argument to block type for HFE commands schema (#13339)
I reviewed `XREAD` command syntax:
```
XREAD [COUNT count] [BLOCK milliseconds] STREAMS key [key ...] id [id ...]
```

Here’s the structure for `XREAD`:
```json
"arguments": [
            {
                "token": "COUNT",
                "name": "count",
                "type": "integer",
                "optional": true
            },
            {
                "token": "BLOCK",
                "name": "milliseconds",
                "type": "integer",
                "optional": true
            },
            {
                "name": "streams",
                "token": "STREAMS",
                "type": "block",
                "arguments": [
                    {
                        "name": "key",
                        "type": "key",
                        "key_spec_index": 0,
                        "multiple": true
                    },
                    {
                        "name": "ID",
                        "type": "string",
                        "multiple": true
                    }
                ]
            }
]
```

Now, consider the `HEXPIRE` syntax:
```
HEXPIRE key seconds [NX | XX | GT | LT] FIELDS numfields field [field ...]
```

Since the `FIELDS` token functions similarly to `STREAMS`, and given that `STREAMS` is defined as a block, I believe the `FIELDS` in `hepxire` should also be defined as a block.
2024-06-14 13:51:49 +08:00
debing.sun
7b9e960690
Hash Field Expiration (#13303)
## Background

This PR introduces support for field-level expiration in Redis hashes. Previously, Redis supported expiration only at the key level, but this enhancement allows setting expiration times for individual fields within a hash.

## New commands
* HEXPIRE
* HEXPIREAT
* HEXPIRETIME
* HPERSIST
* HPEXPIRE
* HPEXPIREAT
* HPEXPIRETIME
* HPTTL
* HTTL

## Short example
from @moticless
```sh
127.0.0.1:6379>  hset myhash f1 v1 f2 v2 f3 v3                                                   
(integer) 3
127.0.0.1:6379>  hpexpire myhash 10000 NX fields 2 f2 f3                                         
1) (integer) 1
2) (integer) 1
127.0.0.1:6379>  hpttl myhash fields 3 f1 f2 f3                                                                                                                                                                         
1) (integer) -1
2) (integer) 9997
3) (integer) 9997
127.0.0.1:6379>  hgetall myhash  
1) "f3"
2) "v3"
3) "f2"
4) "v2"
5) "f1"
6) "v1"

... after 10 seconds ...

127.0.0.1:6379>  hgetall myhash  
1) "f1"
2) "v1"
127.0.0.1:6379>
```

## Expiration strategy
1. Integrate active
    Redis periodically performs active expiration and deletion of hash keys that contain expired fields, with a maximum attempt limit.
3. Lazy expiration
    When a client touches fields within a hash, Redis checks if the fields are expired. If a field is expired, it will be deleted. However, we do not delete expired fields during a traversal, we implicitly skip over them.

## RDB changes
Add two new rdb type s`RDB_TYPE_HASH_METADATA` and `RDB_TYPE_HASH_LISTPACK_EX`.

## Notification
1. Add `hpersist` notification for `HPERSIST` command.
5. Add `hexpire` notification for `HEXPIRE`, `HEXPIREAT`, `HPEXPIRE` and `HPEXPIREAT` commands.

## Internal
1. Add new data structure `ebuckets`, which is used to store TTL and keys, enabling quick retrieval of keys based on TTL.
2. Add new data structure `mstr` like sds, which is used to store a string with TTL.

This work was done by @moticless, @tezc, @ronen-kalish, @sundb, I just release it.
2024-05-30 15:26:19 +08:00
Ozan Tezcan
f0389f2823
Fix position of numfields in H(P)EXPIRE json files (#13301)
Fix position of numfields in H(P)EXPIRE json files
2024-05-29 16:35:47 +03:00
Ozan Tezcan
e2918705c8
Fix hfe reply schemas (#13295)
In https://github.com/redis/redis/pull/13291, we've changed that hfe
commands to return empty array if the key does not exist.
Forgot to update json schemas.
2024-05-27 16:07:01 +03:00
debing.sun
2d1bb42cba
Update version references from 8.0 to 7.4 for upcoming release (#13294) 2024-05-27 16:47:23 +08:00
Ozan Tezcan
2f34f6f0b9
Delete hsetf and hgetf (#13291)
Changes:
- Delete hsetf and hgetf commands
- Hfe commands will return empty array instead of nil.
---------

Co-authored-by: Moti Cohen <moticless@gmail.com>
2024-05-26 13:30:45 +03:00
Moti Cohen
71676513dd
Fix commands H*EXPIRE* and H*TTL to include FIELDS constant (#13270)
The same goes to: HPEXPIRE, HEXPIREAT, HPEXPIREAT, HEXPIRETIME,
HPEXPIRETIME, HPTTL, HTTL, HPERSIST
2024-05-16 19:35:58 +03:00
Ozan Tezcan
5066e6e9cd
Fix hgetf/hsetf reply type by returning string (#13263)
If encoding is listpack, hgetf and hsetf commands reply field value type
as integer.
This PR fixes it by returning string.

Problematic cases:
```
127.0.0.1:6379> hset hash one 1
(integer) 1
127.0.0.1:6379> hgetf hash fields 1 one
1) (integer) 1
127.0.0.1:6379> hsetf hash GETOLD fvs 1 one 2
1) (integer) 1
127.0.0.1:6379> hsetf hash DOF GETNEW fvs 1 one 2
1) (integer) 2
```

Additional fixes:
- hgetf/hsetf command description text

Fixes #13261, #13262
2024-05-13 11:09:49 +03:00
debing.sun
7010f41c96
Add notification support for HFE (#13237)
1. Add `hpersist` notification for `hpersist` command.
2. Add `pexpire` notification for `hexpire`, `hexpireat` and `hpexpire`.
2024-05-09 22:23:00 +08:00
Ozan Tezcan
ca4ed48db6
Add listpack support, hgetf and hsetf commands (#13209)
**Changes:**
- Adds listpack support to hash field expiration 
- Implements hgetf/hsetf commands

**Listpack support for hash field expiration**

We keep field name and value pairs in listpack for the hash type. With
this PR, if one of hash field expiration command is called on the key
for the first time, it converts listpack layout to triplets to hold
field name, value and ttl per field. If a field does not have a TTL, we
store zero as the ttl value. Zero is encoded as two bytes in the
listpack. So, once we convert listpack to hold triplets, for the fields
that don't have a TTL, it will be consuming those extra 2 bytes per
item. Fields are ordered by ttl in the listpack to find the field with
minimum expiry time efficiently.

**New command implementations as part of this PR:** 

- HGETF command

For each specified field get its value and optionally set the field's
expiration time in sec/msec /unix-sec/unix-msec:
  ```
  HGETF key 
    [NX | XX | GT | LT]
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | PERSIST]
    <FIELDS count field [field ...]>
  ```

- HSETF command

For each specified field value pair: set field to value and optionally
set the field's expiration time in sec/msec /unix-sec/unix-msec:
  ```
  HSETF key 
    [DC] 
    [DCF | DOF] 
    [NX | XX | GT | LT] 
    [GETNEW | GETOLD] 
[EX seconds | PX milliseconds | EXAT unix-time-seconds | PXAT
unix-time-milliseconds | KEEPTTL]
    <FVS count field value [field value …]>
  ```

Todo:
- Performance improvement.
- rdb load/save
- aof
- defrag
2024-05-08 23:11:32 +03:00
debing.sun
03cd525ffa
Fix reply schema for hfe related commands (#13238) 2024-05-03 11:11:41 +08:00
Moti Cohen
c18ff05665
Hash Field Expiration - Basic support
- Add ebuckets & mstr data structures
- Integrate active & lazy expiration
- Add most of the commands 
- Add support for dict (listpack is missing)
TODOs:  RDB, notification, listpack, HSET, HGETF, defrag, aof
2024-04-18 16:06:30 +03:00
Chen Tianjie
4cae99e785
Add overhead of all DBs and rehashing dict count to info. (#12913)
Sometimes we need to make fast judgement about why Redis is suddenly
taking more memory. One of the reasons is main DB's dicts doing
rehashing.

We may use `MEMORY STATS` to monitor the overhead memory of each DB, but
there still lacks a total sum to show an overall trend. So this PR adds
the total overhead of all DBs to `INFO MEMORY` section, together with
the total count of rehashing DB dicts, providing some intuitive metrics
about main dicts rehashing.

This PR adds the following metrics to INFO MEMORY
* `mem_overhead_db_hashtable_rehashing` - only size of ht[0] in
dictionaries we're rehashing (i.e. the memory that's gonna get released
soon)

and a similar ones to MEMORY STATS:
* `overhead.db.hashtable.lut` (complements the existing
`overhead.hashtable.main` and `overhead.hashtable.expires` which also
counts the `dictEntry` structs too)
* `overhead.db.hashtable.rehashing` - temporary rehashing overhead.
* `db.dict.rehashing.count` - number of top level dictionaries being
rehashed.

---------

Co-authored-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2024-03-01 13:41:24 +08:00
Binbin
f17381a38d
Fix propagation of entries_read by calling streamPropagateGroupID unconditionally (#12898)
In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate
entries-read, entries-read will be inconsistent between master and
replicas.
I.e. if no entries were claimed, it would have propagated correctly, but
if some
were claimed, then the entries-read field would be inconsistent on the
replica.

The fix was suggested by guybe7, call streamPropagateGroupID
unconditionally,
so that we will normalize entries_read on the replicas. In the past, we
would
only set propagate_last_id when NOACK was specified. And in #9127,
XCLAIM did
not propagate entries_read in ACK, which would cause entries_read to be
inconsistent between master and replicas.

Another approach is add another arg to XCLAIM and let it propagate
entries_read,
but we decided not to use it. Because we want minimal damage in case
there's an
old target and new source (in the worst case scenario, the new source
doesn't
recognize XGROUP SETID ... ENTRIES READ and the lag is lost. If we
change XCLAIM,
the damage is much more severe).

In this patch, now if the user uses XREADGROUP .. COUNT 1 there will be
an additional
overhead of MULTI, EXEC and XGROUPSETID. We assume the extra command in
case of
COUNT 1 (4x factor, changing from one XCLAIM to
MULTI+XCLAIM+XSETID+EXEC), is probably
ok since reading just one entry is in any case very inefficient (a
client round trip
per record), so we're hoping it's not a common case.

Issue was introduced in #9127.
2024-02-29 09:48:20 +02:00
guybe7
820a4e45f1
Edit the history field of xinfo-consumers (#13078)
Now it matches the information in xinfo-stream.json
2024-02-22 09:44:29 +02:00
Binbin
5b9fc46523
Add new allocator.muzzy field to memory-stats reply schema (#13076)
This field was added in #12996 but forgot to add it in json file.
This also causes reply-schemas-validator to fail.
2024-02-21 08:35:10 +02:00
Binbin
ca5cac998e
xinfo-stream add minimum to seen-time, skip logreqres in fuzzer (#13056)
Recently I saw in CI that reply-schemas-validator fails here:
```
Failed validating 'minimum' in schema[1]['properties']['groups']['items']['properties']['consumers']['items']['properties']['active-time']:
    {'description': 'Last time this consumer was active (successful '
                    'reading/claiming).',
     'minimum': 0,
     'type': 'integer'}

On instance['groups'][0]['consumers'][0]['active-time']:
    -1729380548878722639
```

The reason is that in fuzzer, we may restore corrupted active-time,
which will cause the reply schema CI to fail.

The fuzzer can cause corrupt the state in many places, which will
bugs that mess up the reply, so we decided to skip logreqres.

Also, seen-time is the same type as active-time, adding the minimum.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
2024-02-20 12:21:10 +02:00
guybe7
6df42df291
Adds a README to the command JSON files (#13066)
Add readme about the command json folder, what it does, and who should
(not) use it.
see discussion
https://github.com/redis/redis/issues/9359#issuecomment-1936420698

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
2024-02-19 18:49:31 +02:00
Daz
02a87885e6
Add missing structural API changes to JSON file (#12434)
The JSON file lacks the following structural API changes:

- GEORADIUSBYMEMBER: add the ANY option for COUNT since 6.2.0.
- GEORADIUSBYMEMBER_RO: add the ANY option for COUNT since 6.2.0.
- GEORADIUS_RO: Added support for uppercase unit names since 7.0.0.
- GEORADIUSBYMEMBER_RO: Added support for uppercase unit names since
7.0.0.

---------

Signed-off-by: daz-3ux <daz-3ux@proton.me>
Co-authored-by: bodong.ybd <bodong.ybd@alibaba-inc.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Co-authored-by: yangpengda.333 <yangpengda.333@bytedance.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2024-02-04 08:42:15 +02:00
Chen Tianjie
f469dd8ca6
Add novalues option to command HSCAN. (#12765)
Add a way to HSCAN a hash key, and get only the filed names.
Command syntax is now:
```
HSCAN key cursor [MATCH pattern] [COUNT count] [NOVALUES]
```
when `NOVALUES` is on, the command will only return keys in the hash.

---------

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
2024-01-30 20:32:58 +02:00
Slava Koyfman
24f6d08b3f
Implement CLIENT KILL MAXAGE <maxage> (#12299)
Adds an ability to kill clients older than a specified age.

Also, fixed the age calculation in `catClientInfoString` to use
`commandTimeSnapshot`
instead of the old `server.unixtime`, and added missing documentation
for
`CLIENT KILL ID` to output of `CLIENT help`.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
2024-01-30 20:24:36 +02:00
Binbin
85c31e0cff
Allow running WAITAOF in scripts, remove NOSCRIPT flag (#12977)
In #11568 we removed the NOSCRIPT flag from commands, e.g. removing
NOSCRIPT flag from WAIT. Aiming to allow them in scripts and let them
implicitly behave in the non-blocking way.

This PR remove NOSCRIPT flag from WAITAOF just like WAIT (to be
symmetrical)).
And this PR also add BLOCKING flag for WAIT and WAITAOF.
2024-01-23 15:19:41 +02:00
debing.sun
4730563e93
Change destination key's key-spec flag from RW to OW for SINTERSTORE command (#12917)
In #10122, we set the destination key's flag of SINTERSTORE to `RW`, 
however, this command doesn't actually read or modify the destination
key, just overwrites it.
Therefore, we change it to `OW` similarly to all other *STORE commands.
2024-01-08 10:17:13 +02:00
Binbin
7410d985bc
Remove overhead.hashtable.slot-to-keys from memory-stats reply_schema (#12784)
overhead.hashtable.slot-to-keys was added in 7.0 in #10017, then removed
in #11695. Now remove it from reply_schema.
2023-12-10 09:46:21 +02:00
zhaozhao.zz
77a65e82b2
support XREAD[GROUP] with BLOCK option in scripts (#12596)
In #11568 we removed the NOSCRIPT flag from commands and keep the BLOCKING flag.
Aiming to allow them in scripts and let them implicitly behave in the non-blocking way.

In that sense, the old behavior was to allow LPOP and reject BLPOP, and the new behavior,
is to allow BLPOP too, and fail it only in case it ends up blocking.
So likewise, so far we allowed XREAD and rejected XREAD BLOCK, and we will now allow
that too, and only reject it if it ends up blocking.
2023-10-12 10:54:50 +03:00
Binbin
8d92f7f2b7
Support NO ONE block in REPLICAOF command json (#12633)
The current commands.json doesn't mention the special NO ONE arguments.
This change is also applied to SLAVEOF
2023-10-10 11:10:40 +03:00
Binbin
4031a18732
Fix that slot return in CLUSTER SHARDS should be integer (#12561)
An unintentional change was introduced in #10536, we used
to use addReplyLongLong and now it is addReplyBulkLonglong, 
revert it back the previous behavior.
2023-09-09 23:33:00 -07:00
nihohit
90e9fc387c
Update command tips on more admin / configuration commands (#12545)
Updated the command tips for ACL SAVE / SETUSER / DELUSER, CLIENT SETNAME / SETINFO, and LATENCY RESET.
The tips now match CONFIG SET, since there's a similar behavior for all of these commands - the
user expects to update the various configurations & states on all nodes, not only on a single, random node.
For LATENCY RESET the response tip is now agg_sum.

Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
2023-09-04 21:30:42 +03:00
Binbin
9ce8c54d74
Update sort_ro reply_schema to mention the null reply (#12534)
Also added a test to cover this case, so this can
cover the reply schemas check.
2023-08-31 06:36:35 +03:00
nihohit
4b281ce519
Align CONFIG RESETSTAT/REWRITE tips with SET. (#12530)
Since the three commands have similar behavior (change config, return
OK), the tips that govern how they should behave should be similar.

Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
2023-08-30 21:49:02 +03:00
Binbin
f4549d1cf4
Fix CLUSTER REPLICAS time complexity, should be O(N) (#12477)
We iterate over all replicas to get the result, the time complexity
should be O(N), like CLUSTER NODES complexity is O(N).
2023-08-14 20:57:55 -07:00
Binbin
7af9f4b36e
Fix GEOHASH / GEODIST / GEOPOS time complexity, should be O(1) (#12445)
GEOHASH / GEODIST / GEOPOS use zsetScore to get the score, in skiplist encoding,
we use dictFind to get the score, which is O(1), same as ZSCORE command.
It is not clear why these commands had O(Log(N)), and O(N) until now.
2023-08-05 07:29:24 +03:00
nihohit
9f512017aa
Update request/response policies. (#12417)
changing the response and request policy of a few commands,
see https://redis.io/docs/reference/command-tips

1. RANDOMKEY used to have no response policy, which means
  that when sent to multiple shards, the responses should be aggregated.
  this normally applies to commands that return arrays, but since RANDOMKEY
  replies with a simple string, it actually requires a SPECIAL response policy
  (for the client to select just one)
2. SCAN used to have no response policy, but although the key names part of
  the response can be aggregated, the cursor part certainly can't.
3. MSETNX had a request policy of MULTI_SHARD and response policy of AGG_MIN,
  but in fact the contract with MSETNX is that when one key exists, it returns 0
  and doesn't set any key, routing it to multiple shards would mean that if one failed
  and another succeeded, it's atomicity is broken and it's impossible to return a valid
  response to the caller.

Co-authored-by: Shachar Langbeheim <shachlan@amazon.com>
Co-authored-by: Oran Agra <oran@redislabs.com>
2023-07-25 10:21:23 +03:00
Binbin
d306d86146
Fix ZRANK/ZREVRANK reply_schema description (#12331)
The parameter name is WITHSCORE instead of WITHSCORES.
2023-06-20 11:15:40 +03:00
Binbin
b510624978
Optimize PSUBSCRIBE and PUNSUBSCRIBE from O(N*M) to O(N) (#12298)
In the original implementation, the time complexity of the commands
is actually O(N*M), where N is the number of patterns the client is
already subscribed and M is the number of patterns to subscribe to.
The docs are all wrong about this.

Specifically, because the original client->pubsub_patterns is a list,
so we need to do listSearchKey which is O(N). In this PR, we change it
to a dict, so the search becomes O(1).

At the same time, both pubsub_channels and pubsubshard_channels are dicts.
Changing pubsub_patterns to a dictionary improves the readability and
maintainability of the code.
2023-06-19 16:31:18 +03:00
Harkrishn Patro
a9e32767f7
Allow cluster slots/shards api to respond during loading (#12269)
It would be helpful for clients to get cluster slots/shards information during a node failover and is loading data.
2023-06-13 18:16:32 +03:00