Commit graph

45664 commits

Author SHA1 Message Date
Michal Nowak
2420b9364b fix: test: Make deleg cleanuptests memory assertions 32-bit-safe
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
Each address entry stored by dns_delegset_addaddr() is an
isc_netaddrlink_t, whose size depends on sizeof(void *) via the
ISC_LINK macro (24 bytes of address + two prev/next pointers): 40
bytes on 64-bit, 32 bytes on 32-bit. The hardcoded 4 MB / 8 MB
ranges only held on 64-bit, so dns_deleg_cleanuptests failed on
armv7l with isc_mem_inuse() returning ~3.2 MB.

Express the expected ranges in terms of sizeof(isc_netaddrlink_t)
so they scale with pointer width, and pull the 99999 entry count
out into a NENTRIES macro.

Close isc-projects/bind9#6012

Merge branch 'mnowak/armv7l-fix-dns_deleg_cleanuptests' into 'main'

See merge request isc-projects/bind9!12061
2026-05-20 18:55:34 +02:00
Michal Nowak
4623873e58 Make deleg cleanuptests memory assertions 32-bit-safe
Each address entry stored by dns_delegset_addaddr() is an
isc_netaddrlink_t, whose size depends on sizeof(void *) via the
ISC_LINK macro (24 bytes of address + two prev/next pointers): 40
bytes on 64-bit, 32 bytes on 32-bit. The hardcoded 4 MB / 8 MB
ranges only held on 64-bit, so dns_deleg_cleanuptests failed on
armv7l with isc_mem_inuse() returning ~3.2 MB.

Express the expected ranges in terms of sizeof(isc_netaddrlink_t)
so they scale with pointer width, and pull the 99999 entry count
out into a NENTRIES macro.

Assisted-by: Claude:claude-opus-4-7
2026-05-20 13:29:22 +00:00
Andoni Duarte
6cae1d10ca Merge tag 'v9.21.22'
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
2026-05-20 10:26:28 +00:00
Ondřej Surý
61b1e53a70 fix: nil: Properly handle BN_num_bits() return value
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
BN_num_bits() returns 0 on NULL input and a negative value on internal
error.  The error return value is now properly handled.

Merge branch 'ondrej/fix-BN_num_bits-return-value' into 'main'

See merge request isc-projects/bind9!12057
2026-05-19 21:05:59 +02:00
Ondřej Surý
965995c66a
Properly handle the return value of BN_num_bits()
BN_num_bits() returns 0 when passed NULL and a negative value on
internal error.  The OpenSSL wrappers stored the result in a size_t,
so a 0 return falsely satisfied the bit-length check and a negative
return wrapped to a huge value.  Capture the int return, reject
non-positive values, then compare against the limit.
2026-05-19 19:21:49 +02:00
Ondřej Surý
78ececa6bd fix: usr: Reject RRSIG records covering meta-types
A recursive resolver could accept and cache an RRSIG record whose
Type-Covered field names a meta-type (ANY, AXFR, IXFR, MAILA, MAILB),
even though no real RRset of those types ever exists. Such records
are now rejected by the DNS message parser.

Closes #6002

Merge branch '6002-reject-rrsig-covering-meta-types' into 'main'

See merge request isc-projects/bind9!12048
2026-05-19 15:00:39 +02:00
Ondřej Surý
c28ba9c3c6
Reject malformed RRSIG records
A signature cannot cover a meta-type (NONE, ANY, AXFR, IXFR, MAILB,
MAILA, OPT, TSIG, TKEY); previously such records were cached by the
recursive resolver and collided with negative-cache entries on the
same owner name, corrupting the QP-trie cache.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 13:21:48 +02:00
Matthijs Mekking
3b45b43600 fix: dev: Don't remove corresponding RRSIG in the same loop
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
The `dns_db_deleterdataset()` removing the corresponding signature within the iterator is wrong, because it mutates an rdataset that is not the current one.  This has been fixed.

Merge branch 'matthijs-fix-evict-cname-other' into 'main'

See merge request isc-projects/bind9!12047
2026-05-19 09:48:15 +00:00
Matthijs Mekking
1abd977f43
Don't remove corresponding RRSIG in the same loop
The dns_db_deleterdataset() removing the corresponding signature
within the iterator is wrong, because it mutates an rdataset
that is not the current one.
2026-05-19 11:19:47 +02:00
Ondřej Surý
e90a828307 fix: usr: Fix TCP fallback after repeated UDP timeouts
When an authoritative server failed to respond to two consecutive
UDP queries in a fetch, named was supposed to retry the next attempt
over TCP but in fact still sent it over UDP.  The resolver now
properly switches the transport to TCP on the third attempt to
the same server.

Closes #5529

Merge branch '5529-fix-tcp-fallback-after-udp-timeouts' into 'main'

See merge request isc-projects/bind9!12022
2026-05-19 11:19:04 +02:00
Ondřej Surý
08295d004e
Skip EDNS UDP-size hint on TCP retries
The hint feeds the EDNS OPT UDP-size field, which has no effect on TCP
transport.  Avoid the dns_adb_getudpsize() lookup when the query is
already pinned to TCP.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
db28b2127a
Raise the per-server recursive-clients ceiling in fetchlimit
With the resolver now legitimately escalating to TCP after repeated
UDP timeouts to the same authoritative, each lame-server lookup
takes ~50% longer to fail.  The recursive-client backlog therefore
peaks a little higher before the fetches-per-server auto-tune drops
the quota below 200.

Bump the upper bound for the burst-against-lame-server and recovery
steps from 200 to 250 to absorb that extra latency.  The lower bound
and the final post-recovery target (clients <= 20) are unchanged.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
0c0e905615
Add pytest serve_stale TCP-fallback regression tests
The serve_stale shell suite uses a UDP-only perl mock as its
authoritative server.  Now that the resolver escalates to TCP after
repeated UDP timeouts, three steps in serve_stale/tests.sh that
exercise resolver-query-timeout behaviour no longer reach the
timeout — the TCP fallback short-circuits to SERVFAIL via
`connection refused` on the perl mock.

Move those scenarios to a new system test directory
`bin/tests/system/serve_stale_tcp/` that uses a
ControllableAsyncDnsServer mock listening on both UDP and TCP, so
the resolver's TCP path is exercised end-to-end and the original
timing semantics are preserved.  Remove the corresponding shell
steps from serve_stale/tests.sh.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
308c370796
Allow either UDP or TCP queries in flight in statistics test
The "active sockets" and "queries in progress" assertions previously
required exactly one extra UDP/IPv4 socket and exactly one UDP query in
progress, with no TCP counterpart.  That shape held only because the
broken TCP-fallback path left the resolver retrying UDP indefinitely.

With the fix in place, after two UDP timeouts to the same authority the
resolver legitimately escalates to TCP, and a stats snapshot taken
during recursion may catch the in-flight query on either transport.
Count the UDP and TCP counters together so the test reflects the new
correct behaviour.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
a0db3d6505
Tighten serve_stale dig timeouts and inter-step sleeps
With the TCP fallback now actually firing after repeated UDP timeouts,
the resolver covers more retry transitions in the same wall-clock
window, and the original 3-second budgets in two steps of the
serve_stale test left no margin: the dig client at +timeout=3 and the
"sleep 3" before re-enabling the upstream both straddled the moment at
which the resolver switched transport, making the asserted outcome
race-prone.

Drop the dig timeout to 2s and the sleep to 1s so each step lands
firmly on one side of the transport switch.

Co-authored-by: Evan Hunt <each@isc.org>
Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
a9283c08c2
Emit EDE 22 when the resolver runs out of usable addresses
Two exits from fctx_try() landed at DNS_R_SERVFAIL without attaching
DNS_EDE_NOREACHABLEAUTH: when fctx_getaddresses() returned a non-success,
non-wait status, and when every candidate addrinfo was unusable
(over-quota or filtered) after a restart.

With the new TCP fallback actually firing, those paths are now reached
by serve-stale and similar scenarios in which the auth is unreachable.
Attach the EDE so SERVFAIL responses keep carrying the same operator
signal that the timeout-based exit paths already produce.

Co-authored-by: Evan Hunt <each@isc.org>
Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
1af37e24b2
Open the stale-refresh-time window on any resolver failure
The TCP-fallback fix in the previous commits means a query that would
previously have timed out on UDP now actually escalates to TCP, and a
TCP-side failure surfaces a non-ISC_R_TIMEDOUT result code to
query_usestale().  The trigger for DNS_DBFIND_STALESTART was previously
narrowed to ISC_R_TIMEDOUT, so the stale-refresh-time window stopped
opening for those clients.

Broaden the condition to any failure that has already cleared the
upstream DUPLICATE/DROP filtering in query_usestale() — the spirit of
the window is "the resolver tried and could not get a fresh answer",
not "the resolver timed out specifically".

Co-authored-by: Evan Hunt <each@isc.org>
Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
59c00a6f31
Force TCP after repeated UDP timeouts to the same authoritative
Make the decision in fctx_query() before the dispatch is bound so the
chosen transport and the DNS_FETCHOPT_TCP flag agree.  The previous
location in resquery_send() ran after the UDP dispatch had already been
attached, so the flag flip had no effect on the wire.

Moving the decision earlier also means FCTX_ADDRINFO_NOEDNS0 servers,
previously exempt, now escalate to TCP too.  TCP works regardless of
EDNS state, so this is the intended behaviour.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:30 +02:00
Ondřej Surý
01523a078a
Temporarily remove TCP fallback after UDP timeouts
The retry path in resquery_send() that flipped DNS_FETCHOPT_TCP on a
query whose dispatch had already been bound as UDP in fctx_query() had
no effect on the transport actually used, but did leave a stale TCP
bit visible to downstream consumers (dnstap framing, cookie checks,
the AUTHORITY-NS spoofability guard).

The ineffective code has been removed from resquery_send().  The
TCP fallback functionality will be corrected and restored in the next
commit.

Assisted-by: Claude:claude-opus-4-7
2026-05-19 11:18:16 +02:00
Ondřej Surý
54f5210463 chg: usr: named could crash on concurrent TKEY DELETE for the same key
Some checks failed
CodeQL / Analyze (push) Has been cancelled
SonarCloud / Build and analyze (push) Has been cancelled
On a server configured with tkey-gssapi-keytab (or tkey-gssapi-credential),
an authenticated peer could crash named by sending two TKEY DELETE requests
for the same dynamic key in rapid succession.  This has been fixed.

Closes #6001

Merge branch '6001-tsig-tkey-delete-uaf' into 'main'

See merge request isc-projects/bind9!12041
2026-05-18 06:48:58 +02:00
Ondřej Surý
5c8dcd4419
Fix use-after-free in concurrent dns_tsigkey_delete()
Two TSIG-authenticated TKEY DELETE queries for the same dynamic key,
arriving on different worker loops, could each enter
dns_tsigkey_delete() and cause over-decrementing the key refcount.

This has been fixed by making dns_tsigkey_delete() idempotent.
2026-05-17 17:14:08 +02:00
Matthijs Mekking
9f84037814 fix: usr: The resolver now removes other RRsets at the same name when caching a CNAME
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
When an RRset is in stale cache, and the authoritative server changes the record type to CNAME, the resolver fails to refresh the stale cache. This has been fixed.

Closes #5302

Merge branch '5302-serve-stale-cname-to-a' into 'main'

See merge request isc-projects/bind9!11758
2026-05-17 09:56:20 +00:00
Matthijs Mekking
69a560fff1 When caching names, check for CNAME RRsets
CNAME and other record types cannot coexist. DNSSEC records are the
exceptions to this rule.

If the answer contains a name with a CNAME, remove existing RRsets at
the same name from the cache.

If the answer contains a name without a CNAME, remove the CNAME RRset
at the same name from the cache.
2026-05-17 08:42:05 +00:00
Matthijs Mekking
4ee526cb6d Add serve-stale test case for CNAME to A
Add a serve-stale system test case where the authority changes a
CNAME RRset to A (at cname2.stale.test). The CNAME that is in the
cache is stale and should be refreshed. The target A record (at
a2.stale.test) has a longer TTL and is also still in the cache. The
next query should return the refreshed A RRset to the client.

Then the authority changes back the A RRset to CNAME. The A RRset
has become stale and should be refreshed. The next query should
return the refreshed CNAME RRset plus the already cached
a2.stale.test A record.

This test requires ns1 to allow dynamic updates to stale.test, and
prefetch to be disabled. The latter is to ensure the record is not
prefetched, but only refreshed when stale (and logs the expected
"an attempt to refresh the RRset" messages).
2026-05-17 08:42:05 +00:00
Matthijs Mekking
c95128ed47 Remove duplicate check in serve-stale test 2026-05-17 08:42:05 +00:00
Ondřej Surý
abe0369436 fix: nil: More changes to PR-Agent CI job
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
Merge branch 'ondrej/use-claude-opus-4-6' into 'main'

See merge request isc-projects/bind9!12037
2026-05-17 00:04:49 +02:00
Ondřej Surý
ee5e933933
Add both Claude 4.6 and ChatGPT in two separate job pipelines 2026-05-16 22:26:05 +02:00
Ondřej Surý
dae0820f80
Allow failure to not block pipelines for the PR-Agent CI job 2026-05-16 17:53:20 +02:00
Ondřej Surý
99194aec84
Change the PR-Agent configuration to use Claude 4.6 2026-05-16 17:50:42 +02:00
Ondřej Surý
a44b91eae1 fix: nil: Properly use other_checks_jobs template for pr-agent CI job
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
Merge branch 'ondrej/pr-agent-other-check' into 'main'

See merge request isc-projects/bind9!12035
2026-05-16 13:29:30 +02:00
Ondřej Surý
5550fb84ae
Properly use other_checks_jobs template for pr-agent CI job 2026-05-16 13:22:45 +02:00
Ondřej Surý
be727bd5e2 fix: dev: Run PR-Agent only when manually triggered
Merge branch 'ondrej/run-pr-agent-only-manually' into 'main'

See merge request isc-projects/bind9!12033
2026-05-16 12:50:19 +02:00
Ondřej Surý
4257454262
Run PR-Agent only when manually triggered 2026-05-16 12:49:54 +02:00
Ondřej Surý
9755fb6455 new: dev: Enable PR-Agent reviews on merge requests
Adds a CI job that runs PR-Agent against each merge request opened
from the canonical repository, posting an automated review and
code-improvement suggestions as MR comments. The job is gated to
same-project source branches so the OpenAI key and personal access
token are not exposed to fork pipelines.

Merge branch 'ondrej/add-pr-agent' into 'main'

See merge request isc-projects/bind9!12032
2026-05-16 12:30:01 +02:00
Ondřej Surý
07345b25d9
Add PR-Agent job to GitLab CI for merge-request review
Run PR-Agent's `review` and `improve` commands against each merge
request from the canonical repository, posting an automated review
and code-improvement suggestions as MR comments. The rule restricts
the job to MRs whose source project matches CI_PROJECT_PATH so the
OpenAI key and GitLab personal access token are never exposed to
fork pipelines.
2026-05-16 12:14:33 +02:00
Ondřej Surý
fce9f32367 chg: dev: Allow any valid DNS name as a TSIG/RNDC key name
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
The key-generation tools (tsig-keygen, rndc-confgen) now accept any valid DNS name for key names.

Merge branch 'ondrej/allow-all-valid-keynames' into 'main'

See merge request isc-projects/bind9!12029
2026-05-15 11:00:31 +02:00
Ondřej Surý
85f854b076 Allow any valid DNS name as a key name
TSIG key names need to be any valid DNS name so that update-policy
"self" rules work with arbitrary names.  Replace the
alnum+'.'+'-'+'_' charset filter in the key-generation tools with a
dns_name_fromstring() validity check.
2026-05-15 10:14:46 +02:00
Ondřej Surý
c708d694fe chg: dev: Use SipHash-1-3 for hash tables, keep SipHash-2-4 for cookies
SipHash-2-4 was designed as a conservative PRF/MAC with extra rounds
against future attacks.  For hash tables, where outputs are never
exposed, SipHash-1-3 provides sufficient collision resistance with
fewer rounds.  As the SipHash author noted: "I would be very surprised
if SipHash-1-3 introduced weaknesses for hash tables."

DNS cookies continue to use SipHash-2-4 since cookie values are sent
on the wire and must resist online attacks.

Merge branch 'ondrej/siphash-1-3' into 'main'

See merge request isc-projects/bind9!11787
2026-05-15 09:33:09 +02:00
Ondřej Surý
6175577210
Use SipHash-1-3 for hash tables, keep SipHash-2-4 for cookies
SipHash-2-4 was designed as a conservative PRF/MAC with extra rounds
against future attacks.  For hash tables, where outputs are never
exposed, SipHash-1-3 provides sufficient collision resistance with
fewer rounds.  As the SipHash author noted: "I would be very surprised
if SipHash-1-3 introduced weaknesses for hash tables."

DNS cookies continue to use SipHash-2-4 since cookie values are sent
on the wire and must resist online attacks.
2026-05-15 08:15:59 +02:00
Ondřej Surý
62f1672609 fix: test: Fix flaky reclimit test
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
The max-types-per-name cache eviction tests were flaky because two test steps were missing a sleep between queries, causing TTL-based cache verification to fail when both queries completed within the same second.

Merge branch 'ondrej/fix-flaky-reclimit' into 'main'

See merge request isc-projects/bind9!11782
2026-05-15 08:03:16 +02:00
Ondřej Surý
80f04a9ee5
Fix flaky reclimit test by adding missing sleep
The cache verification in steps 11 and 15 checks that the TTL has
decreased from its initial value to confirm the response was served
from cache, but the sleep between the two queries was missing. Both
queries could complete within the same second, leaving the TTL
unchanged and causing the test to incorrectly conclude the entry was
not cached.
2026-05-15 08:02:56 +02:00
Ondřej Surý
c7b53348fc chg: dev: Skip in-domain nameservers that have no glue
A referral that names a nameserver inside the delegated zone but
provides no address for it leaves the resolver unable to reach that
server. named now logs "missing mandatory glue for <name>" at notice
level and skips the nameserver.

Merge branch 'ondrej/dont-store-missing-in-domain-glue-ns' into 'main'

See merge request isc-projects/bind9!11971
2026-05-15 07:48:26 +02:00
Ondřej Surý
28483b3b73
Drop in-domain NS without glue from the delegation set
Pull the dns_message_findname() lookups into cache_delegglue() and
cache_delegglue6() so each helper now owns its glue lookup and returns
the number of addresses cached.  cache_delegns() splits referrals into
two cases: in-domain (the NS name is below the delegation point) and
sibling/in-bailiwick.

An in-domain NS without glue is unresolvable by definition - the
resolver would have to ask the very server it's trying to find.  Log
"missing mandatory glue" at notice level and skip the deleg entirely
rather than leaving an unusable entry in the set.  A new
dns_delegset_freedeleg() undoes a fresh dns_delegset_allocdeleg() so
the rest of the delegation set is preserved.
2026-05-15 07:26:38 +02:00
Ondřej Surý
ef405bfa6d chg: usr: Fall back to TCP on a UDP response with a mismatched query id
BIND used to wait silently for the correct DNS message id on a UDP fetch
even after receiving a response from the expected server with the wrong
id, leaving room for off-path spoofing attempts to keep guessing within
that window.  The resolver now retries the fetch over TCP on the first
such response, and a new MismatchTCP statistics counter tracks how
often the fallback fires.

Closes #5449

Merge branch '5449-immediate-tcp-fallback-on-id-mismatch' into 'main'

See merge request isc-projects/bind9!12023
2026-05-15 06:57:00 +02:00
Ondřej Surý
11bca1051f
Switch UDP fetches to TCP on the first response with a wrong query id
Until now, the dispatcher silently dropped UDP responses from the
expected peer that carried the wrong DNS message id and kept listening
for the correct id to arrive within the read timeout.  An off-path
attacker who knows the destination address and source port of an
outgoing fetch could exploit that quiet retry window to flood the
resolver with guessed responses; with a gigabit link the per-query
success probability grows linearly with the number of guesses that
arrive before the legitimate answer or the timeout.

Treat any such mismatch as a possible spoofing attempt and let the
resolver immediately retry the same query over TCP, the same control
path the truncation handler already uses.

Add a resolver statistics counter - exposed as 'queries retried over TCP
after a response with mismatched query id' in rndc stats and
'MismatchTCP' in the statistics channel

Assisted-by: Claude:claude-opus-4-7
2026-05-14 15:56:18 +02:00
Ondřej Surý
29f0b07e8c fix: dev: Fix data race during rndc dumpdb or zone load
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
'rndc dumpdb' against a server with zones, and async zone load,
had a timing window where the operation's completion could fire
before the server had finished registering the operation,
occasionally leading to a possible crash.  The completion is now
delivered after the registration is in place.

Closes #5952

Merge branch '5952-fix-masterdump-async-ctx-race' into 'main'

See merge request isc-projects/bind9!11991
2026-05-14 08:52:58 +02:00
Ondřej Surý
8ae464d552
Fix data race in async master dump/load context publication
Bouncing the offload itself to the target loop let the after-work
callback fire on the target thread and run the user's done callback
before the calling thread had published *dctxp / *lctxp.  Enqueue on
the calling loop and bounce only the done callback instead, so the
publish is sequenced before the cross-thread hand-off by construction
and cannot be reintroduced by reordering the entry-point body.
2026-05-14 08:51:39 +02:00
Mark Andrews
2091d703ac fix: usr: Disable output escaping in bind9.xsl
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
The statistics charts where not displaying on some browsers.
This has been fixed.

Closes #5990

Merge branch '5990-disable-output-escaping-in-bind9-xsl' into 'main'

See merge request isc-projects/bind9!12018
2026-05-14 12:06:41 +10:00
Mark Andrews
9b6c018425 Disable output escaping in bind9.xsl
The statistics charts where not displaying on some browsers (e.g. Chrome)
due to '>' being escaped as '&gt;'.  Use disable-output-escaping="yes" to
turn this off.
2026-05-14 10:00:21 +10:00
Colin Vidal
b4e8e431eb fix: test: Fix cyclic glues (again)
Some checks are pending
CodeQL / Analyze (push) Waiting to run
SonarCloud / Build and analyze (push) Waiting to run
Previous fix `ed90d578b3a98f45eb8bc09966e9c4ab870a156d` uses
`wait_for_line()` by mistake, and the test aims to wait for two log
lines to be printed before continuing.

In principle, `wait_for_all()` should do, but `running` should always be
printed first, so `wait_for_sequence()` seems to be the right fit here.

Merge branch 'colin/fix-cyclic-glues-again' into 'main'

See merge request isc-projects/bind9!12013
2026-05-13 22:31:32 +02:00