BN_num_bits() returns 0 when passed NULL and a negative value on
internal error. The OpenSSL wrappers stored the result in a size_t,
so a 0 return falsely satisfied the bit-length check and a negative
return wrapped to a huge value. Capture the int return, reject
non-positive values, then compare against the limit.
SipHash-2-4 was designed as a conservative PRF/MAC with extra rounds
against future attacks. For hash tables, where outputs are never
exposed, SipHash-1-3 provides sufficient collision resistance with
fewer rounds. As the SipHash author noted: "I would be very surprised
if SipHash-1-3 introduced weaknesses for hash tables."
DNS cookies continue to use SipHash-2-4 since cookie values are sent
on the wire and must resist online attacks.
Until now, the dispatcher silently dropped UDP responses from the
expected peer that carried the wrong DNS message id and kept listening
for the correct id to arrive within the read timeout. An off-path
attacker who knows the destination address and source port of an
outgoing fetch could exploit that quiet retry window to flood the
resolver with guessed responses; with a gigabit link the per-query
success probability grows linearly with the number of guesses that
arrive before the legitimate answer or the timeout.
Treat any such mismatch as a possible spoofing attempt and let the
resolver immediately retry the same query over TCP, the same control
path the truncation handler already uses.
Add a resolver statistics counter - exposed as 'queries retried over TCP
after a response with mismatched query id' in rndc stats and
'MismatchTCP' in the statistics channel
Assisted-by: Claude:claude-opus-4-7
After the send callback completes, the UV request is freed but
the HTTP/2 socket's write buffer still points to the freed memory.
If nghttp2 subsequently needs to send frames (e.g. SETTINGS ACK),
the server_read_callback reads from the dangling buffer.
Clear the write buffer before freeing the UV request.
Replace the hysteretic hi_water/lo_water switch with a stochastic
check: always false below lo_water, always true at or above hi_water,
linearly ramped probability in between. This spreads cache cleaning
across many inserts instead of triggering a thundering herd once the
hi_water mark is crossed (which causes every addrdataset to enter the
LRU purge path simultaneously and serializes lookups behind the node
write locks).
The is_overmem atomic and its stores are no longer needed and are
removed. The existing tests that asserted specific hysteretic state
transitions are simplified to check only the deterministic boundaries.
isc__ratelimiter_tick() and isc_ratelimiter_shutdown() each pulled
events out of rl->pending into a function-local list, dropped the
mutex, and then iterated. ISC_LIST_APPEND leaves the link in the
LINKED state, so a concurrent isc_ratelimiter_dequeue() saw an
event as still queued, called ISC_LIST_UNLINK against rl->pending —
which patched the prev/next of the local list — and freed the
event before dispatch finished, producing either an INSIST in the
unlink macro or a use-after-free in the dispatch loop.
isc_async_run() is a non-blocking wfcq enqueue, so there is no
benefit to dropping the mutex around it. Unlink each event and
hand it to isc_async_run() while still holding rl->lock; the
existing ISC_LINK_LINKED check in dequeue then correctly
distinguishes "still queued and cancellable" from "already taken".
Assisted-by: Claude:claude-opus-4-7
The function existence-checked the target with stat() and then opened
the same path without O_NOFOLLOW, so a symlink at the target path
passed the regular-file test against the link's destination and the
open() that followed truncated and wrote through the link.
rndc-confgen -a is typically run as root and writes the keyfile under
a directory that service accounts may have write access to, so a stray
symlink there would silently redirect the truncate, fchown, and
overwrite to whatever file the link pointed at.
Switch the existence check to lstat() and use S_ISREG() so a symlink's
S_IFLNK mode is detected directly (a plain bitmask of S_IFREG matches
both, since S_IFLNK shares its high bit). Add O_NOFOLLOW to both
open() flag sets to close the lstat/open TOCTOU window. Hardening
against unexpected symlinks on intermediate path components is out of
scope.
Assisted-by: Claude:claude-opus-4-7
OPENSSL_cleanup() in OpenSSL 4 doesn't free the memory, and that is
not compatible with BIND 9's memory leak detection code. Don't use
custom allocation/deallocation functions for OpenSSL's internal memory
management in the ossl3.c module.
See https://github.com/openssl/openssl/pull/29721
Enable jemalloc background threads and reduce dirty page decay time from
10s to 1s so that unused memory is returned to the OS sooner. As an
additional safety net, trigger a decay pass after every 16 MiB of frees
(rate-limited to once per second) to handle bursts that the background
thread might not catch in time. On glibc, fall back to malloc_trim(0)
with the same volume-based trigger.
The <sys/endian.h> header has existed in macOS since around ~26. This
causes the `htobeNN`/`htoleNN` macros to be redefined in <isc/endian.h>
in terms of <libkern/OSByteOrder.h> when other system headers include
<sys/endian.h>.
Fix this issue by using checking for the existence of <sys/endian.h> in
meson and including it according to the probe result.
Add an `isc_netaddrlink_t` type wrapping an `isc_netaddr_t` and an
`ISC_LINK`. This enable to build list of `isc_netaddr_t` without
increasing the memory footprint of existing usages of `isc_netaddr_t`
(which doesn't require to be linked).
In the current statistics counter implementation, the statistics are
backed by an array of counters, which are updated via atomic operations.
This leads to contention, especially on high core count
machines.
This commit introduces a new isc_statsmulti_t counter that keeps a
separate array per thread. These counters are then aggregated only when
statistics are queried, shifting work off the critical path.
These changes lead to a ~2% improvement in perflab.
A helper macro that returns the current value of a pointer and sets
it to NULL in one expression, useful for transferring ownership in
designated initializers.
The liburcu rcu_cmpxchg_pointer() uses CMM_RELAXED ordering on the CAS
failure path. When a thread loses the CAS and gets another thread's
pointer back, reading fields through that pointer is a data race on
weakly-ordered architectures (ARM, POWER) because the failing load has
no acquire semantics.
Override rcu_cmpxchg_pointer() and rcu_xchg_pointer() to use standard
__atomic builtins with __ATOMIC_ACQ_REL (success) and __ATOMIC_ACQUIRE
(failure) ordering. This fixes the race on all architectures and is
natively visible to ThreadSanitizer.
isc_buffer_init() is given MAX_DNS_MESSAGE_SIZE (65535) as capacity but
only h2->content_length bytes are allocated. This makes the buffer
believe it has more space than actually allocated. A secondary bounds
check (new_bufsize <= h2->content_length) prevents actual overflow, but
the buffer invariant is violated.
Pass h2->content_length as the capacity to match the allocation.
The previous code was incorrectly clearing errno after calling
strtol but before testing the result rather than clearing it and
then calling strtol so that changes to errno can be correctly
determined.
We return DNS_R_NOVALIDSIG if we detected a deadlock. Then in
'validate_async_done()', this result value is used to check if we
need to fall back to insecure. As part of that we create a new fetch
but that fails because of the detected deadlock. This results in a loop
of deadlock detected, fallback to insecure, deadlock detected, ...
Add a new result value, ISC_R_DEADLOCK, and return this instead when
we have detected a deadlock. This will be treated as a generic error,
as there is no special handling for this result value.
Starting from OpenSSL 4 the the X509_get_subject_name() function
returns a 'const' pointer to a name instead of a regular pointer.
Duplicate the name before operating on it, then free it.
The INSIST in isc_radix_insert() checks node->data[RADIX_V4] and
node->node_num[RADIX_V4] twice due to a copy-paste error, never
verifying the RADIX_V6 fields.
Fix the second pair to check RADIX_V6.
Add a REQUIRE(isc_loop() == loop) assertion to isc_work_enqueue()
to strictly enforce that work is enqueued from the loop it is
assigned to. This loudly prohibits cross-thread queue manipulation
before it inevitably turns into a concurrency debugging nightmare.
Use reference counting for isc_histomulti module so that it's
possible to attach/detach to/from the objects when used in the
statistics channel in the coming commits.
NSEC3 hashes are required to fit within a single DNS label. Since there
are 5 bits per label byte without pad characters, the maximum hash size
is floor(63*5/8) (39 bytes).
This patch enforces this maximum length for unknown algorithms, while
strictly enforcing the exact expected digest length for known algorithms
like SHA-1.
For Linux >= 6.8:
Since 2023, Linux has introduced a change to the IP_LOCAL_PORT_RANGE
socket option that eliminates the need for the random window
shifting (implemented as a fallback in the next commit).
By setting IP_LOCAL_PORT_RANGE option, we tell the kernel to use better
approach to the source port selection.
For Linux << 6.8:
This implement selecting port by random shifting range leveraging the
IP_LOCAL_PORT_RANGE socket option. The network manager is initialized
with the ephemeral port range (on startup and on reconfig) and then for
every outgoing TCP connection, we define a custom port range (1000
ports) and then randomly shift the custom range within the system range.
This helps the kernel to reduce the search space to the custom window
between <random_offset, random_offset + 1000>.
Reference:
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/#kernel
Since 2015, Linux has introduced a new socket option to overcome TCP
limitations: When an application needs to force a source IP on an active
TCP socket it has to use bind(IP, port=x). As most applications do not
want to deal with already used ports, x is often set to 0, meaning the
kernel is in charge to find an available port. But kernel does not know
yet if this socket is going to be a listener or be connected. This
IP_BIND_ADDRESS_NO_PORT socket option ask the kernel to ignore the 0
port provided by application in bind(IP, port=0) and only remember the
given IP address. The port will be automatically chosen at connect()
time, in a way that allows sharing a source port as long as the 4-tuples
are unique.
Enable IP_BIND_ADDRESS_NO_PORT on the outgoing TCP sockets to overcome
this TCP limitation.
A lingering `sizeof` from the prototype era of !11094 caused the
key-wipe in `isc_hmac_key_destroy` to use `sizeof(key->len)` instead of
`key->len` for the length argument of `isc_safe_memwipe`.
This results in a buffer overflow of zero bytes in HMAC keys that are
less than 4 bytes. As such, the overflow can only be visibile in keys
that are less than 32-bits, which is beyond broken and creating such
keys are only possible in testing.
Therefore, this change is *not* a security fix since the conditions are
never reachable in any imaginable deployment scenario.
Builds that use OpenSSL >=3.0 are unaffected as the `sizeof` was only
remaining in pre-3.0 builds.
This is a bit of a namespace convention violation but it fits the spirit of
this header since it is exposing OpenSSL-isms to others.
Further work is needed to make sure the exposed EVP_MD isn't needed
anymore.
Using `EVP_SIGNATURE` explicit algoritms for signatures have been added
in OpenSSL 3.4 and so is skipped for the initial OpenSSL version
specific code splitting.
Using `EVP_SIGNATURE` explicit algoritms for signatures have been added
in OpenSSL 3.4 and so is skipped for the initial OpenSSL version
specific code splitting.
While being the best place at the time, the tlserr2result doesn't belong
inside TLS code since it is generic to OpenSSL and mostly used in the
dst interface. The newly created ossl_wrap interface is the idea place
for flushing the OpenSSL thread error queue.
Instead of the `EVP_MD_CTX` based functions, use either the new
`EVP_MAC` or the old `HMAC_CTX` based functions.
`EVP_MAC` is the recommended way using using MAC functions in post-3.0
while `HMAC_CTX` is used internally by `EVP_MD_CTX`, making the latter
redundant.
Get rid of the OpenSSL-isms that plague the codebase where the hash type
is `EVP_MD *`
By using a proper enum, alongside the cleanup, we also get the ability
to use constants for known hash sizes instead of having a function call
every time.
`EVP_MD_CTX_get0_md` has been removed instead of being adapted since it
wasn't used anymore.
Dealing with OpenSSL has been rapidly turning into an unwieldy situation
as post-3.0 changes turn the library into a different beast.
Start treating pre and post-3.0 versions differently for easier
maintenance.
This adds the following enum isc_one_or_more and isc_zero_or_more
which specify if one or more or zeror or more bytes are required
when reading the unbounded base64 / hex encoded data.