A `dns_qmpulti_t` no longer needs to know about its loopmgr. We no
longer keep a linked list of `dns_qpmulti_t` that have reclamation
work, and we no longer mark chunks with the phase in which they are to
be reclaimed. Instead, empty chunks are listed in an array in a
`qp_rcu_t`, which is passed to call_rcu().
If the resolver received a FORMERR response to a request with
an DNS COOKIE option present that echoes the option back, resend
the request without an DNS COOKIE option present.
After the dns_xfrin was changed to use network manager, the maximum
global (max-transfer-time-in) and idle (max-transfer-idle-in) times for
incoming transfers were turned inoperational because of missing
implementation.
Restore this functionality by implementing the timers for the incoming
transfers.
It should be floor(DNS_NAME_MAXWIRE / 2) + 1 == 128
The mistake was introduced in c6bf51492d because:
* I was refactoring an existing `DNS_MAX_LABELS` defined as 127
* There was a longstanding bug in `dns_name_isvalid()` which
checked the number of labels against 127U instead of 128
* I mistakenly thought `dns_name_isvalid()` was correct and
`dns_name_countlabels()` was incorrect, but the reverse was true.
After this commit, occurrances of `DNS_NAME_MAXLABELS` with value
128 are consistent with the use of 127 or 128 before commit
c6bf51492d except for the mistake in `dns_name_isvalid()`.
This commit adds a test case that checks the MAXLABELS case
in `dns_name_fromtext()` and `dns_name_isvalid()`.
This change makes the zone table lock-free for reads. Previously, the
zone table used a red-black tree, which is not thread safe, so the hot
read path acquired both the per-view mutex and the per-zonetable
rwlock. (The double locking was to fix to cleanup races on shutdown.)
One visible difference is that zones are not necessarily shut down
promptly: it depends on when the qp-trie garbage collector cleans up
the zone table. The `catz` system test checks several times that zones
have been deleted; the test now checks for zones to be removed from
the server configuration, instead of being fully shut down. The catz
test does not churn through enough zones to trigger a gc, so the zones
are not fully detached until the server exits.
After this change, it is still possible to improve the way we handle
changes to the zone table, for instance, batching changes, or better
compaction heuristics.
Revert refcount debug tracing (commit a8b29f0365), there are better
ways to do it.
Use the dns_qpmethods_t typedef where appropriate.
Some stylistic improvements.
It is sometimes necessary to access a qp-trie outside an isc_loop,
such as in tests or an isc_work callback. The best option was to use
a `dns_qpmulti_write()` transaction, but that has overheads that are
not necessary for read-only access, such as committing a new version
of the trie even when nothing changed.
So this commit adds a `dns_qpmulti_read()` transaction, which is
nearly as lightweight as a query transaction, but it takes the mutex
like a write transaction.
This is the first of the "fancy" searches that know how the DNS
namespace maps on to the structure of a qp-trie. For example, it will
find the closest enclosing zone in the zone tree.
The dns_adbentry_overquota() was violating the layers accessing the
adbentry struct members directly. Change it to dns_adb_overquota() to
match the dns_adb API.
use the ISC_REFCOUNT implementation for dns_zone_attach() and
_detach(). (this applies only to external zone references, not
to dns_zone_iattach() and dns_zone_idetach().)
use dns_zone_ref() where previously a dummy zone object had been
used to increment the reference count.
My original idea had been that the core qp-trie code would be mostly
independent of the storage for keys, so I did not make it check at run
time that key lengths are sensible. However, the qp-trie search
routines need to get keys out of leaf objects, for which they provide
storage on the stack, which is particularly dangerous for unchecked
buffer overflows. So this change checks that key lengths are in bounds
at the API boundary between the qp-trie code and the rest of BIND, and
there is no more pretence that keys might be longer.
Add a new configuration option to set how the checkds method should
work. Acceptable values are 'yes', 'no', and 'explicit'.
When set to 'yes', the checkds method is to lookup the parental agents
by querying the NS records of the parent zone.
When set to 'no', no checkds method is enabled. Users should run
the 'rndc checkds' command to signal that DS records are published and
withdrawn.
When set to 'explicit', the parental agents are explicitly configured
with the 'parental-agents' configuration option.
This should have no functional effects.
The message size stats are specified by RSSAC002 so it's best not
to mess around with how they appear in the statschannel. But it's
worth changing the implementation to use general-purpose histograms,
to reduce code size and benefit from sharded counters.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
The isc_fsaccess API was created to hide the implementation details
between POSIX and Windows APIs. As we are not supporting the Windows
APIs anymore, it's better to drop this API used in the DST part.
Moreover, the isc_fsaccess was setting the permissions in an insecure
manner - it operated on the filename, and not on the file descriptor
which can lead to all kind of attacks if unpriviledged user has read (or
even worse write) access to key directory.
Replace the code that operates on the private keys with code that uses
mkstemp(), fchmod() and atomic rename() at the end, so at no time the
private key files have insecure permissions.
The only place where dns_name_hash() was being used is the old hash
table in the dns_badcache unit. Squash the dns_name_fullhash() and
dns_name_hash() into single dns_name_hash() function that's always
case-insensitive as it doesn't make to do case-sensitive hashing of the
domain names and we were not using this anywhere.
Instead of marking the unused entities with UNUSED(x) macro in the
function body, use a `ISC_ATTR_UNUSED` attribute macro that expans to
C23 [[maybe_unused]] or __attribute__((__unused__)) as fallback.
Change the isc_job_run() to not-make any allocations. The caller must
make sure that it allocates isc_job_t - usually as part of the argument
passed to the callback.
For simple jobs, using isc_async_run() is advised as it allocates its
own separate isc_job_t.
for testing purposes, we need to be able to specify a library path from
which to load the dnsrps implementation. this can now be done with the
"dnsrps-library" option.
DNSRPS can now be enabled in configure regardless of whether librpz.so
is currently installed on the system.
the new dns_view_addtrustedkey() function allows a view's trust
anchors to be updated directly. this code was formerly in
dns_client_addtrustedkey(), which is now a wrapper around
dns_view_addtrustedkey().
stop and restart the server in the 'tsiggss' test, in order
to confirm that GSS negotiated TSIG keys are saved and restored
when named loads.
added logging to dns_tsigkey_createfromkey() to indicate whether
a key has been statically configured, generated via GSS negotiation,
or restored from a file.
REQUIRE that rdata->type is dns_rdatatype_svcb to detect when
dns_rdata_checksvcb is called with the wrong rdata type. There are
no code paths that currently pass the wrong rdata to dns_rdata_checksvcb.
This was found by GCC 12 static analysis.
without diffie-hellman TKEY negotiation, some other code is
now effectively dead or unnecessary, and can be cleaned up:
- the rndc tsig-list and tsig-delete commands.
- a nonoperational command-line option to dnssec-keygen that
was documented as being specific to DH.
- the section of the ARM that discussed TKEY/DH.
- the functions dns_tkey_builddeletequery(), processdeleteresponse(),
and tkey_processgssresponse(), which are unused.
Completely remove the TKEY Mode 2 (Diffie-Hellman Exchanged Keying) from
BIND 9 (from named, named.conf and all the tools). The TKEY usage is
fringe at best and in all known cases, GSSAPI is being used as it should.
The draft-eastlake-dnsop-rfc2930bis-tkey specifies that:
4.2 Diffie-Hellman Exchanged Keying (Deprecated)
The use of this mode (#2) is NOT RECOMMENDED for the following two
reasons but the specification is still included in Appendix A in case
an implementation is needed for compatibility with old TKEY
implementations. See Section 4.6 on ECDH Exchanged Keying.
The mixing function used does not meet current cryptographic
standards because it uses MD5 [RFC6151].
RSA keys must be excessively long to achieve levels of security
required by current standards.
We might optionally implement Elliptic Curve Diffie-Hellman (ECDH) key
exchange mode 6 if the draft ever reaches the RFC status. Meanwhile the
insecure DH mode needs to be removed.
There are leftovers from the previous refactoring effort, which left
some function declarations and comments in the header file unchanged.
Finish the renaming.
This implements node reference tracing that passes all the internal
layers from dns_db API (and friends) to increment_reference() and
decrement_reference().
It can be enabled by #defining DNS_DB_NODETRACE in <dns/trace.h> header.
The output then looks like this:
incr:node:check_address_records:rootns.c:409:0x7f67f5a55a40->references = 1
decr:node:check_address_records:rootns.c:449:0x7f67f5a55a40->references = 0
incr:nodelock:check_address_records:rootns.c:409:0x7f67f5a55a40:0x7f68304d7040->references = 1
decr:nodelock:check_address_records:rootns.c:449:0x7f67f5a55a40:0x7f68304d7040->references = 0
There's associated python script to find the missing detach located at:
https://gitlab.isc.org/isc-projects/bind9/-/snippets/1038
Now that we can configure a different digest type, update the code
to honor the configuration. Update 'dns_dnssec_syncupdate' so that
the correct CDS record is published, and also when deleting CDS records,
ensure that all possible CDS records are removed from the zone.
In general, it's better to do one thorough compaction when a batch of
work is complete, which is the way that `update` transactions work.
Conversely, `write` transactions are designed so that lots of little
transactions are not too inefficient, but they need explicit
compaction. This changes `dns_qp_compact()` so that it is easier to
compact any time that makes sense, if there isn't a better way to
schedule compaction. And `dns_qpmulti_commit()` only recycles garbage
when there is enough to make it worthwhile.
Add some qp-trie tracing macros which can be enabled by a
developer. These print a message when a leaf is attached or
detached, indicating which part of the qp-trie implementation
did so. The refcount methods must now return the refcount value
so it can be printed by the trace macros.
The first working multi-threaded qp-trie was stuck with an unpleasant
trade-off:
* Use `isc_rwlock`, which has acceptable write performance, but
terrible read scalability because the qp-trie made all accesses
through a single lock.
* Use `liburcu`, which has great read scalability, but terrible
write performance, because I was relying on `rcu_synchronize()`
which is rather slow. And `liburcu` is LGPL.
To get the best of both worlds, we need our own scalable read side,
which we now have with `isc_qsbr`. And we need to modify the write
side so that it is not blocked by readers.
Better write performance requires an async cleanup function like
`call_rcu()`, instead of the blocking `rcu_synchronize()`. (There
is no blocking cleanup in `isc_qsbr`, because I have concluded
that it would be an attractive nuisance.)
Until now, all my multithreading qp-trie designs have been based
around two versions, read-only and mutable. This is too few to
work with asynchronous cleanup. The bare minimum (as in epoch
based reclamation) is three, but it makes more sense to support an
arbitrary number. Doing multi-version support "properly" makes
fewer assumptions about how safe memory reclamation works, and it
makes snapshots and rollbacks simpler.
To avoid making the memory management even more complicated, I
have introduced a new kind of "packed reader node" to anchor the
root of a version of the trie. This is simpler because it re-uses
the existing chunk lifetime logic - see the discussion under
"packed reader nodes" in `qp_p.h`.
I have also made the chunk lifetime logic simpler. The idea of a
"generation" is gone; instead, chunks are either mutable or
immutable. And the QSBR phase number is used to indicate when a
chunk can be reclaimed.
Instead of the `shared_base` flag (which was basically a one-bit
reference count, with a two version limit) the base array now has a
refcount, which replaces the confusing ad-hoc lifetime logic with
something more familiar and systematic.
Adjust the dns_qp_memusage() and dns_qp_compact() functions
to be more informative and flexible about handling fragmentation.
Avoid wasting space in runt chunks.
Switch from twigs_mutable() to cells_immutable() because that is the
sense we usually want.
Drop the redundant evacuate() function and rename evacuate_twigs() to
evacuate(). Move some chunk test functions closer to their point of
use.
Clarify compact_recursive(). Some small cleanups to comments.
Use isc_time_monotonic() for qp-trie timing stats.
Use #define constants to control debug logging.
Set up DNS name label offsets in dns_qpkey_fromname() so it is easier
to use in cases where the name is not fully hydrated.
A qp-trie is a kind of radix tree that is particularly well-suited to
DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's
crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/
This code incorporates some new ideas that I prototyped using
NLnet Labs NSD in 2020 (optimizations for DNS names as keys)
and 2021 (custom allocator and garbage collector).
https://dotat.at/cgi/git/nsd.git
The BIND version of my qp-trie code has a number of improvements
compared to the prototype developed for NSD.
* The main omission in the prototype was the very sketchy outline of
how locking might work. Now the locking has been implemented,
using a reader/writer lock and a mutex. However, it is designed to
benefit from liburcu if that is available.
* The prototype was designed for two-version concurrency, one
version for readers and one for the writer. The new code supports
multiversion concurrency, to provide a basis for BIND's dbversion
machinery, so that updates are not blocked by long-running zone
transfers.
* There are now two kinds of transaction that modify the trie: an
`update` aims to support many very small zones without wasting
memory; a `write` avoids unnecessary allocation to help the
performance of many small changes to the cache.
* There is also a single-threaded interface for situations where
concurrent access is not necessary.
* The API makes better use of types to make it more clear which
operations are permitted when.
* The lookup table used to convert a DNS name to a qp-trie key is
now initialized by a run-time constructor instead of a programmer
using copy-and-paste. Key conversion is more flexible, so the
qp-trie can be used with keys other than DNS names.
* There has been much refactoring and re-arranging things to improve
the terminology and order of presentation in the code, and the
internal documentation has been moved from a comment into a file
of its own.
Some of the required functionality has been stripped out, to be
brought back later after the basics are known to work.
* Garbage collector performance statistics are missing.
* Fancy searches are missing, such as longest match and
nearest match.
* Iteration is missing.
* Search for update is missing, for cases where the caller needs to
know if the value object is mutable or not.
Some qp-trie operations will need to know the maximum number of labels
in a name, so I wanted a standard macro definition with the right
value.
Replace DNS_MAX_LABELS from <dns/resolver.h with DNS_NAME_MAXLABELS in
<dns/name.h>, and add its counterpart DNS_NAME_LABELLEN.
Use these macros in `name.c` and `resolver.c`.
Fix an off-by-one error in an assertion in `dns_name_countlabels()`.
Instead of holding the catzs->lock the whole time we process the catz
update, only hold it for hash table lookup and then release it. This
should unblock any other threads that might be processing updates to
catzs triggered by extra incoming transfer.
Offload catalog zone processing so that the network manager threads
are not interrupted by a large catalog zone update.
Introduce a new 'updaterunning' state alongside with 'updatepending',
like it is done in the RPZ module.
Note that the dns__catz_update_cb() function currently holds the
catzs->lock during the whole process, which is far from being optimal,
but the issue is going to be addressed separately.
This change should make sure that catalog zone update processing
doesn't happen when the catalog zone is being shut down. This
should help avoid races when offloading the catalog zone updates
in the follow-up commit.
* Change 'dns_catz_new_zones()' function's prototype (the order of the
arguments) to synchronize it with the similar function in rpz.c.
* Rename 'refs' to 'references' in preparation of ISC_REFCOUNT_*
macros usage for reference tracking.
* Unify dns_catz_zone_t naming to catz, and dns_catz_zones_t naming to
catzs, following the logic of similar changes in rpz.c.
* Use C compound literals for structure initialization.
* Synchronize the "new zone version came too soon" log message with the
one in rpz.c.
* Use more of 'sizeof(*ptr)' style instead of the 'sizeof(type_t)' style
expressions when allocating or freeing memory for 'ptr'.
`libirs` used to be a reference implementation of `getaddrinfo` and
related modern resolver APIs. It was stripped down in BIND 9.18
leaving only the `irs_resconf` module, which parses
`/etc/resolv.conf`. I have kept its include path and namespace prefix,
so it remains a little fragment of libirs now embedded in libdns.
the dns_xfrin module was still using the network manager directly to
manage TCP connections and send and receive messages. this commit
changes it to use the dispatch manager instead.
use ISC_REFCOUNT_IMPL for dns_xfrin_ctx_t (which has been renamed
to dns_xfrin_t to keep the function names dns_xfrin_attach() and
dns_xfrin_detach() unchanged).
the 'dispatchmgr' member of the resolver object is used by both
the dns_resolver and dns_request modules, and may in the future
be used by others such as dns_xfrin. it doesn't make sense for it
to live in the resolver object; this commit moves it into dns_view.
removed references in code comments, doc/dev documentation, etc, to
isc_task, isc_timer_reset(), and isc_timertype_inactive. also removed a
coccinelle patch related to isc_timer_reset() that was no longer needed.
move all dns_sdb code into bin/named/builtin.c, which is the
only place from which it's called.
(note this is temporary: later we'll refactor builtin so that it's a
standalone dns_db implementation on its own instead of using SDB
as a wrapper.)
move database attach/detach functions to db.c, instead of
requiring them to be implemented for every database type.
instead, they must implement a 'destroy' function that is
called when references go to zero.
this enables us to use ISC_REFCOUNT_IMPL for databases,
with detailed tracing enabled by setting DNS_DB_TRACE to 1.
SDB is currently (and foreseeably) only used by the named
builtin databases, so it only needs as much of its API as
those databases use.
- removed three flags defined for the SDB API that were always
set the same by builtin databases.
- there were two different types of lookup functions defined for
SDB, using slightly different function signatures. since backward
compatibility is no longer a concern, we can eliminate the 'lookup'
entry point and rename 'lookup2' to 'lookup'.
- removed the 'allnodes' entry point and all database iterator
implementation code
- removed dns_sdb_putnamedrr() and dns_sdb_putnamedrdata() since
they were never used.
some dns_db functions would have crashed if the DB implementation failed
to implement them, requiring the implementations to add functions that
did nothing but return ISC_R_NOTIMPLEMENTED or some obvious default
value. we can just have the dns_db wrapper functions themselves return
those values, and clean up the implementations accordingly.
make the private isc__rdatalist_* functions public dns_rdatalist
functions so that all the rdatalist primitives can be used by
callers to libdns. (this will be needed later for moving SDB and
SDLZ out of libdns.)
as every validator function is loop-synchronized, it should no longer be
necessary to use a validator lock.
calling dns_validator_send(), dns_validator_cancel() or
dns_validator_destroy() from a thread other than the one on which the
validator is running will now cause an assertion failure; this should be
fine since the validator and resolver are tightly coupled, and the fetch
contexts and validators run in the same loops.
refactor validator so that the validation status object (previously
called dns_valstatus_t, which was derived from dns_validatorevent_t), is
now part of the dns_validator object. when calling validator callbacks,
the validator itself is now sent as the argument.
(note: this necessitates caution in the callback functions that are
internal to validator.c validators spawn other validators, and it can be
confusing at times whether we need to be looking at val, val->subvalidator,
or val->parent.)
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
change functions using isc_taskmgr_beginexclusive() to use
isc_loopmgr_pause() instead.
also, removed an unnecessary use of exclusive mode in
named_server_tcptimeouts().
most functions that were implemented as task events because they needed
to be running in a task to use exclusive mode have now been changed
into loop callbacks instead. (the exception is catz, which is being
changed in a separate commit because it's a particularly complex change.)
callback events from dns_resolver_createfetch() are now posted
using isc_async_run.
other modules which called the resolver and maintained task/taskmgr
objects for this purpose have been cleaned up.
The callbacks from dns_abd_createfind() are now posted using
isc_async_run() instead of isc_task_send(). ADB event types
have been replaced with a new dns_adbstatus_t type which is
included as find->status.
(The ADB still uses a task for dns_resolver_createfetch().)
dns_request_create() and _createraw() now take a 'loop' parameter
and run the callback event on the specified loop.
as the task manager is no longer used, it has been removed from
the dns_requestmgr structure. the dns_resolver_taskmgr() function
is also no longer used and has been removed.
- removed documentation of -S option from named man page
- removed documentation of reserved-sockets from ARM
- simplified documentation of dnssec-secure-to-insecure - it
now just says it's obsolete rather than describing what it
doesn't do anymore
- marked three formerly obsolete options as ancient:
parent-registration-delay, reserved-sockets, and
suppress-initial-notify
isc_bind9 was a global bool used to indicate whether the library
was being used internally by BIND or by an external caller. external
use is no longer supported, but the variable was retained for use
by dyndb, which needed it only when being built without libtool.
building without libtool is *also* no longer supported, so the variable
can go away.
Instead of using an extra rarely-used paramater to dns_clientinfo_init()
to set ECS information for a client, this commit adds a function
dns_clientinfo_setecs() which can be called only when ECS is needed.
* rbt node chains were sized to allow for bitstring labels, so they
had 256 levels; but in the absence of bistrings, 128 is enough.
* dns_byaddr_createptrname() had a redundant options argument,
and a very outdated doc comment.
* A number of comments referred to bitstring labels in a way that is
no longer helpful. (A few informative comments remain.)
Previously, the zone loading and dumping was effectively serialized by
the dns_io_t mechanism. In theory, more IO operations could be run in
parallel, but the zone manager .iolimit was set to 1 and never increased
as dns_zonemgr_setiolimit() was never ever called.
As the dns_master asynchronous load and dump was already offloaded to
non-worker threads with isc_work mechanism, drop the whole dns_io_t
and just rely on the isc_work to do the load and dump scheduling.
Instead of dumping the signed zone contents node by node during the
signing, dump the entire zone at the end. This was already done for the
raw zone format, but it shows that the IO is better utilized when the
zone dump is done in one single write rather than in small chunks.
A side effect of dumping node by node was that all names were printed
relative to the zone origin rather than being grouped under different
$ORIGINs as would normally be the case when dumping a zone. Also, state
was not maintained from one node to the next regarding whether the CLASS
has already been printed, so it was always included with the first
record of each node.
Since dnssec-signzone uses the dns_master_style_explicittl text format
style, and is the only application that does so, we can revise that
style and add a new DNS_STYLEFLAG_CLASS_PERNAME flag to get the output
back to what it was before this change.
Add support for loading and validating the 'tls' parameter from
the forwarders' configuration.
This prepares ground for adding support to forward queries to
DoT-enabled upstream servers.
DSCP has not been fully working since the network manager was
introduced in 9.16, and has been completely broken since 9.18.
This seems to have caused very few difficulties for anyone,
so we have now marked it as obsolete and removed the
implementation.
To ensure that old config files don't fail, the code to parse
dscp key-value pairs is still present, but a warning is logged
that the feature is obsolete and should not be used. Nothing is
done with configured values, and there is no longer any
range checking.
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:
A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.
When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.
With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.
Remove parsing the configuration options 'alt-transfer-source',
'alt-transfer-source-v6', and 'use-alt-transfer-source', and remove
the corresponding code that implements the feature.
The dns_remote_t structure is intended to replace the variables in
the structure that deals with remote server communication to primaries,
parental agents, forwarders, etc.
The dispatches are not thread-bound, and used freely between various
threads (see the dns_resolver and dns_request units for details).
This refactoring make sure that all non-const dns_dispatch_t and
dns_dispentry_t members are accessed under a lock, and both object now
track their internal state (NONE, CONNECTING, CONNECTED, CANCELED)
instead of guessing the state from the state of various struct members.
During the refactoring, the artificial limit DNS_DISPATCH_SOCKSQUOTA on
UDP sockets per dispatch was removed as the limiting needs to happen and
happens on in dns_resolver and limiting the number of UDP sockets
artificially in dispatch could lead to unpredictable behaviour in case
one dispatch has the limit exhausted by others are idle.
The TCP artificial limit of DNS_DISPATCH_MAXREQUESTS makes even less
sense as the TCP connections are only reused in the dns_request API
that's not a heavy user of the outgoing connections.
As a side note, the fact that UDP and TCP dispatch pretends to be same
thing, but in fact the connected UDP is handled from dns_dispentry_t and
dns_dispatch_t acts as a broker, but connected TCP is handled from
dns_dispatch_t and dns_dispatchmgr_t acts as a broker doesn't really
help the clarity of this unit.
This refactoring kept to API almost same - only dns_dispatch_cancel()
and dns_dispatch_done() were merged into dns_dispatch_done() as we need
to cancel active netmgr handles in any case to not leave dangling
connections around. The functions handling UDP and TCP have been mostly
split to their matching counterparts and the dns_dispatch_<function>
functions are now thing wrappers that call <udp|tcp>_dispatch_<function>
based on the socket type.
More debugging-level logging was added to the unit to accomodate for
this fact.
The dns_adb_getcookie() doesn't use the 'adb' parameter, remove it.
Refactor the dns_adb_getcookie() function to just return the size of
the cookie when the caller passes 'NULL' as the 'cookie' argument.
'DNS_DB_STALEOK' returns stale rdatasets as well as current rdatasets.
'DNS_DB_EXPIREDOK' returns expired rdatasets as well as current
rdatasets. This option is currently only set when DNS_DB_STALEOK is
also set.
dns_db_updatenotify_unregister needed to be called earlier to ensure
that listener->onupdate_arg always points to a valid object. The
existing lazy cleanup in rbtdb_free did not ensure that.
This is second in the series of fixing the usage of hashtables in the
dns_adb and the dns_resolver units.
Currently, the fetch buckets (used to hold the fetch context) and zone
buckets (used to hold per-domain counters) would never get cleaned from
the memory. Combined with the fact that the hashtable now grows as
needed (instead of using hashtable as buckets), the memory usage in the
resolver can just grow and it never drops down.
In this commit, the usage of hashtables (hashmaps) has been completely
rewritten, so there are no "buckets" and all the matching conditions are
directly mapped into the hashtable key:
1. For per-domain counter hashtable, this is simple as the lowercase
domain name is used directly as a counter.
2. For fetch context hashtable, this requires copying some extra flags
back and forth in the key.
As we don't hold the "buckets" forever, the cleaning mechanism has been
rewritten as well:
1. For per-domain counter hashtable, this is again much simpler, as we
only need to check whether the usage counter is still zero under the
lock and bail-out on cleaning if the counter is in use.
2. For fetch context hashtable, this is more complicated as the fetch
context cannot be reused after it has been finished. The algorithm
is different, the fetch context is always removed from the
hashtable, but if we find the fetch context that has been marked
as finished in the lookup function, we help with the cleaning from
the hashtable and try again.
Couple of additional changes have been implemented in this refactoring
as those were needed for correct functionality and could not be split
into individual commits (or would not make sense as seperate commits):
1. The dns_resolver_createfetch() has an option to create "unshared"
fetch. The "unshared" fetch will never get matched, so there's
little point in storing the "unshared" fetch in the hashtable.
Therefore the "unshared" fetches are now detached from the
hashtable and live just on their own.
2. Replace the custom reference counting with ISC_REFCOUNT_DECL/IMPL
macros for better tracing.
3. fctx_done_detach() is idempotent, it makes the "final" detach (the
one matching the create function) only once. But that also means
that it has to be called before the detach that kept the fetch
context alive in the callback. A new macro fctx_done_unref() has
been added to allow this code flow:
fctx_done_unref(fctx, result);
fctx_detach(&fctx);
Doing this the other way around could cause fctx to get destroyed in
the fctx_unref() first and fctx_done_detach() would cause UAF.
4. The resume_qmin() and resume_dslookup() callbacks have been
refactored for more readability and simpler code paths. The
validated() callback has also received some of the simplifications,
but it should be refactored in the future as it is bit of spaghetti
now.
The mechanism for associating a worker task to a database now
uses loops rather than tasks.
For this reason, the parameters to dns_cache_create() have been
updated to take a loop manager rather than a task manager.
After change 5995, zone transfers were using a small
compression context that only had space for the first
few dozen names in each message. They now use a large
compression context with enough space for every name.
The dns_adb unit has been refactored to be much simpler. Following
changes have been made:
1. Simplify the ADB to always allow GLUE and hints
There were only two places where dns_adb_createfind() was used - in
the dns_resolver unit where hints and GLUE addresses were ok, and in
the dns_zone where dns_adb_createfind() would be called without
DNS_ADBFIND_HINTOK and DNS_ADBFIND_GLUEOK set.
Simplify the logic by allowing hint and GLUE addresses when looking
up the nameserver addresses to notify. The difference is negligible
and would cause a difference in the notified addresses only when
there's mismatch between the parent and child addresses and we
haven't cached the child addresses yet.
2. Drop the namebuckets and entrybuckets
Formerly, the namebuckets and entrybuckets were used to reduced the
lock contention when accessing the double-linked lists stored in each
bucket. In the previous refactoring, the custom hashtable for the
buckets has been replaced with isc_ht/isc_hashmap, so only a single
item (mostly, see below) would end up in each bucket.
Removing the entrybuckets has been straightforward, the only matching
was done on the isc_sockaddr_t member of the dns_adbentry.
Removing the zonebuckets required GLUEOK and HINTOK bits to be
removed because the find could match entries with-or-without the bits
set, and creating a custom key that stores the
DNS_ADBFIND_STARTATZONE in the first byte of the key, so we can do a
straightforward lookup into the hashtable without traversing a list
that contains items with different flags.
3. Remove unassociated entries from ADB database
Previously, the adbentries could live in the ADB database even after
unlinking them from dns_adbnames. Such entries would show up as
"Unassociated entries" in the ADB dump. The benefit of keeping such
entries is little - the chance that we link such entry to a adbname
is small, and it's simpler to evict unlinked entries from the ADB
cache (and the hashtable) than create second LRU cleaning mechanism.
Unlinked ADB entries are now directly deleted from the hash
table (hashmap) upon destruction.
4. Cleanup expired entries from the hash table
When buckets were still in place, the code would keep the buckets
always allocated and never shrink the hash table (hashmap). With
proper reference counting in place, we can delete the adbnames from
the hash table and the LRU list.
5. Stop purging the names early when we hit the time limit
Because the LRU list is now time ordered, we can stop purging the
names when we find a first entry that doesn't fullfil our time-based
eviction criteria because no further entry on the LRU list will meet
the criteria.
Future work:
1. Lock contention
In this commit, the focus was on correctness of the data structure,
but in the future, the lock contention in the ADB database needs to
be addressed. Currently, we use simple mutex to lock the hash
tables, because we almost always need to use a write lock for
properly purging the hashtables. The ADB database needs to be
sharded (similar to the effect that buckets had in the past). Each
shard would contain own hashmap and own LRU list.
2. Time-based purging
The ADB names and entries stay intact when there are no lookups.
When we add separate shards, a timer needs to be added for time-based
cleaning in case there's no traffic hashing to the inactive shard.
3. Revisit the 30 minutes limit
The ADB cache is capped at 30 minutes. This needs to be revisited,
and at least the limit should be configurable (in both directions).
The dns_adb would serialize all fetches on a single task. Create a
per-thread task, so the fetches will stay local to the thread that
initiated the fetch.
The dns_rpz_zones structure was using .refs and .irefs for strong and
weak reference counting. Rewrite the unit to use just a single
reference counting + shutdown sequence (dns_rpz_destroy_rpzs) that must
be called by the creator of the dns_rpz_zones_t object. Remove the
reference counting from the dns_rpz_zone structure as it is not needed
because the zone objects are fully embedded into the dns_rpz_zones
structure and dns_rpz_zones_t object must never be destroyed before all
dns_rpz_zone_t objects.
The dns_rps_zones_t reference counting uses the new ISC_REFCOUNT_TRACE
capability - enable by defining DNS_RPZ_TRACE in the dns/rpz.h header.
Additionally, add magic numbers to the dns_rpz_zone and dns_rpz_zones
structures.
The dns_cache API contained a cache cleaning mechanism that would be
disabled for 'rbt' based cache. As named doesn't have any other cache
implementations, remove the cache cleaning mechanism from dns_cache API.
The aim is to do less work per byte:
* Check the bounds for each label, instead of checking the
bounds for each character.
* Instead of copying one character at a time from the wire to
the name, copy entire runs of sequential labels using memmove()
to make the most of its fast loop.
* To remember where the name ends, we only need to set the end
marker when we see a compression pointer or when we reach the
root label. There is no need to check if we jumped back and
conditionally update the counter for every character.
* To parse a compression pointer, we no longer take a diversion
around the outer loop in between reading the upper byte of the
pointer and the lower byte.
* The parser state machine is now implicit in the instruction
pointer, instead of being an explicit variable. Similarly,
when we reach the root label we break directly out of the loop
instead of setting a second state machine variable.
* DNS_NAME_DOWNCASE is never used with dns_name_fromwire() so
that option is no longer supported.
I have removed this comment which dated from January 1999 when
dns_name_fromwire() was first introduced:
/*
* Note: The following code is not optimized for speed, but
* rather for correctness. Speed will be addressed in the future.
*/
No functional change, apart from removing support for the unused
DNS_NAME_DOWNCASE option. The new code is about 2x faster than the
old code: best case 11x faster, worst case 1.4x faster.
There were a number of places where the zone table should have been
locked, but wasn't, when dns_zt_apply was called.
Added a isc_rwlocktype_t type parameter to dns_zt_apply and adjusted
all calls to using it. Removed locks in callers.
Instead of doing incremental zone loading with fixed quantum - 100
loaded lines per event, move the zone loading process to the offloaded
libuv threads using isc_work_enqueue() API.
This has the advantage that the thread scheduling is given back to the
operating system that understands blocking operations, and the zone
loading operation doesn't block the networking threads directly.
Incremental file loads now use loopmgr events instead of task events.
The dns_master_loadstreaminc(), _loadbufferinc(), _loadlexer() and
_loadlexerinc() functions were not used in BIND, and have been removed.
dns_rdata_checksvcb performs data entry checks on SVCB records.
In particular that _dns SVBC record have an 'alpn' and if that 'alpn'
parameter indicates HTTP is in use that 'dophath' is present.
All we need for compression is a very small hash set of compression
offsets, because most of the information we need (the previously added
names) can be found in the message using the compression offsets.
This change combines dns_compress_find() and dns_compress_add() into
one function dns_compress_name() that both finds any existing suffix,
and adds any new prefix to the table. The old split led to performance
problems caused by duplicate names in the compression context.
Compression contexts are now either small or large, which the caller
chooses depending on the expected size of the message. There is no
dynamic resizing.
There is a behaviour change: compression now acts on all the labels in
each name, instead of just the last few.
A small benchmark suggests this is about 2x faster.
sizeof(dns_name_t) did not change but the boolean attributes are now
separated as one-bit structure members. This allows debuggers to
pretty-print dns_name_t attributes without any special hacks, plus we
got rid of manual bit manipulation code.
Originally RBT node stored three lowest bits from dns_name_t attributes.
This had a curious side-effect noticed by Tony Finch:
If you create an rbt node from a DYNAMIC name then the flag will be
propagated through dns_rbt_namefromnode() ... if you subsequently call
dns_name_free() it will try to isc_mem_put() a piece of an rbt node ...
but dns_name_free() REQUIRE()s that the name is dynamic so in the usual
case where rbt nodes are created from non-dynamic names, this kind of
code will fail an assertion.
This is a bug it dates back to june 1999 when NAMEATTR_DYNAMIC was
invented.
Apparently it does not happen often :-)
I'm planning to get rid of DNS_NAMEATTR_ definitions and bit operations,
so removal of this "three-bit-subset" assignment is a first step.
We can keep only the ABSOLUTE flag in RBT node and nothing else because
names attached to rbt nodes are always readonly: The internal node_name()
function always sets the NAMEATTR_READONLY when making a dns_name that
refers to the node's name, so the READONLY flag will be set in the name
returned by dns_rbt_namefromnode().
Co-authored-by: Tony Finch <fanf@isc.org>
Getting the recorded value of 'edns-udp-size' from the resolver requires
strong attach to the dns_view because we are accessing `view->resolver`.
This is not the case in places (f.e. dns_zone unit) where `.udpsize` is
accessed. By moving the .udpsize field from `struct dns_resolver` to
`struct dns_view`, we can access the value directly even with weakly
attached dns_view without the need to lock the view because `.udpsize`
can be accessed after the dns_view object has been shut down.
The dns_view implements weak and strong reference counting. When strong
reference counting reaches zero, the adb, ntatable and resolver objects
are shut down and detached.
In dns_zone and dns_nta the dns_view was weakly attached, but the
view->resolver reference was accessed directly leading to dereferencing
the NULL pointer.
Add dns_view_getresolver() method which attaches to view->resolver
object under the lock (if it still exists) ensuring the dns_resolver
will be kept referenced until not needed.
The HMACs and GSSAPI are just using unallocated values.
Moving them around shouldn't cause issues.
Only the dnssec system test knew the internal number in use for hmacmd5.
When fuzzing it is useful for all signing operations to happen
at a specific time for reproducability. Add two variables to
the message structure (fuzzing and fuzztime) to specify if a
fixed time should be used and the value of that time.
Instead of always creating the trust anchor timer (dns_nta_t) on the
main loop, create the timer on the current loop and associate each
dns_nta_t object to the loop it was created on. This simplifies the
timer handling as everything is run on the associated loop.
During the change, the dns_nta_t structure was renamed to dns__nta_t
and changed to be fully internal to the nta.c compilation unit, and the
dns_ntatable_t structure was made opaque. This required no change to
code using the API as dns_nta_t never had any external users and the
dns_ntatable_t was properly accessed only by using function calls.
Instead of creating the response policy zone deferred update timer when
creating the response policy zone object, create it on demand on the
current loop and destroy it as soon as the timer has finished its job.
There's a side-effect - the processing of the response policy zone
update is now done on the current loop - previously, it was always on
the main loop.
Instead of creating the catalog zone deferred update timer when creating
the catalog zone object, create it on demand on the current loop and
destroy it as soon as the timer has finished its job. There's a
side-effect - the processing of the catalog zone update is now done on
the current loop - previously, it was always on the main loop.
This change prepares ground for sending DNS requests using DoT,
which, in particular, will be used for forwarding dynamic updates
to TLS-enabled primaries.
In order to make xfrin.c:get_create_tlsctx() reusable, move the function
into transport.c, and make changes into its prototype to not use the
'dns_xfrin_ctx_t' type, thus making it more universal.
This change prepares ground for adding transport support into the
dispatch manager.
Also, move the typedefs for 'dns_transport_t' and 'dns_transport_list_t'
from transport.h into types.h.
If there was a collision of key id across algorithms it was not
possible to determine where counter applies to which algorithm for
xml statistics while for json only one of the values was emitted.
The key names are now "<algorithm-number>+<id>" (e.g. "8+54274").
dns_request_create() was a front-end to dns_request_createvia() that
was only used by test binaries. dns_request_createvia() has been
renamed to dns_request_create(), and the test programs that formerly
used dns_request_create() have been updated to use the new parameters.
It is possible to bypass Response Rate Limiting (RRL)
`responses-per-second` limitation using specially crafted wildcard
names, because the current implementation, when encountering a found
DNS name generated from a wildcard record, just strips the leftmost
label of the name before making a key for the bucket.
While that technique helps with limiting random requests like
<random>.example.com (because all those requests will be accounted
as belonging to a bucket constructed from "example.com" name), it does
not help with random names like subdomain.<random>.example.com.
The best solution would have been to strip not just the leftmost
label, but as many labels as necessary until reaching the suffix part
of the wildcard record from which the found name is generated, however,
we do not have that information readily available in the context of RRL
processing code.
Fix the issue by interpreting all valid wildcard domain names as
the zone's origin name concatenated to the "*" name, so they all will
be put into the same bucket.
Implement the configuration option with its checking and parsing parts.
The option should be later used by BIND to set an extended error
code (EDE) for the queries modified in the result of RPZ processing.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
* isc_timer was rewritten using the uv_timer, and isc_timermgr_t was
completely removed; isc_timer objects are now directly created on the
isc_loop event loops.
* the isc_timer API has been simplified. the "inactive" timer type has
been removed; timers are now stopped by calling isc_timer_stop()
instead of resetting to inactive.
* isc_manager now creates a loop manager rather than a timer manager.
* modules and applications using isc_timer have been updated to use the
new API.
When doing a dnssec-policy reconfiguration from a zone with NSEC only
keys to a zone that uses NSEC3, figure out to wait with building the
NSEC3 chain.
Previously, BIND 9 would attempt to sign such a zone, but failed to
do so because the NSEC3 chain conflicted with existing DNSKEY records
in the zone that were not compatible with NSEC3.
There exists logic for detecting such a case in the functions
dnskey_sane() (in lib/dns/zone.c) and check_dnssec() (in
lib/ns/update.c). Both functions look very similar so refactor them
to use the same code and call the new function (called
dns_zone_check_dnskey_nsec3()).
Also update the dns_nsec_nseconly() function to take an additional
parameter 'diff' that, if provided, will be checked whether an
offending NSEC only DNSKEY will be deleted from the zone. If so,
this key will not be considered when checking the zone for NSEC only
DNSKEYs. This is needed to allow a transition from an NSEC zone with
NSEC only DNSKEYs to an NSEC3 zone.
Clean up dns_rdatalist_tordataset() and dns_rdatalist_fromrdataset()
functions by making them return void, because they cannot fail.
Clean up other functions that subsequently cannot fail.
"rndc fetchlimit" now also prints a list of domain names that are
currently rate-limited by "fetches-per-zone".
The "fetchlimit" system test has been updated to use this feature
to check that domain limits are applied correctly.
this command runs dns_adb_dumpquota() to display all servers
in the ADB that are being actively fetchlimited by the
fetches-per-server controls (i.e, servers with a nonzero average
timeout ratio or with the quota having been reduced from the
default value).
the "fetchlimit" system test has been updated to use the
new command to check quota values instead of "rndc dumpdb".
We do this by adding callbacks for when a node is added or deleted
from the keytable. dns_keytable_add and dns_keytable_delete where
extended to take a callback. dns_keytable_deletekey does not remove
the node so it was not extended.
When namespace is grafted on, the DNSSEC proofs for non existance
need to come from that namespace and not a higher namespace. We
add 3 function dns_view_sfd_add, dns_view_sfd_del and dns_view_sfd_find
to add, remove and find the namespace that should be used when
checking NSEC records.
dns_view_sfd_add adds a name to a tree, creating the tree if needed.
If the name already existed in the tree the reference count is
increased otherwise it is initalised to 1.
dns_view_sfd_del removes a reference to a name in the tree, if the
count goes to 0 the node is removed.
dns_view_sfd_find returns the namespace to be used to entered name.
If there isn't an enclosing name in the tree, or the tree does not
yet exist, the root name is returned.
Access to the tree is controlled by a read/write lock.
The "glue-cache" option was marked as deprecated by commit
5ae33351f2 (first released in BIND 9.17.6,
back in October 2020), so now obsolete that option, removing all code
and documentation related to it.
Note: this causes the glue cache feature to be permanently enabled, not
disabled.
Update the defaultconf with the built-in policies. These will now be
printed with "named -C".
Change the defines in kasp.h to be strings, so they can be concatenated
in the defaultconf. This means when creating a kasp structure, we no
longer initialize the defaults (this is fine because only kaspconf.c
uses dns_kasp_create() and it inherits from the default policy).
In kaspconf.c, the default values now need to be parsed from string.
Introduce some variables so we don't need to do get_duration multiple
times on the same configuration option.
Finally, clang-format-14 decided to do some random formatting changes.
The conversion of `DNS_R_PARTIALMATCH` into `DNS_R_NOTFOUND` is done
in the `dns_rbt_deletename()` function so there is no need to do that
in `dns_fwdtable_delete()`.
Add a possible return value of `ISC_R_NOSPACE` into the header file's
function description comment.
When processing a catalog zone member zone make sure that there is no
configured pre-existing forward zone with that name.
Refactor the `dns_fwdtable_find()` function to not alter the
`DNS_R_PARTIALMATCH` result (coming from `dns_rbt_findname()`) into
`DNS_R_SUCCESS`, so that now the caller can differentiate partial
and exact matches. Patch the calling sites to expect and process
the new return value.
It is simply called "compression" now, without any qualifiers. Also,
improve some variable names in dns_name_towire2() so they are not two
letter abbreviations for global something.
It's wasteful to use 20 bytes and a pointer indirection to represent
two bits of information, so turn the struct into an enum. And change
the names of the enumeration constants to make the intent more clear.
This change introduces some inline functions into another header,
which confuses `gcovr` when it is trying to collect code coverage
statistics. So, in the CI job, copy more header files into a directory
where `gcovr` looks for them.
The aim is to get rid of the obsolete term "GLOBAL14" and instead just
refer to DNS name compression.
This is mostly mechanically renaming
from dns_(de)compress_(get|set)methods()
to dns_(de)compress_(get|set)permitted()
and replacing the related enum by a simple flag, because compression
is either on or off.
There was a proposal in the late 1990s that it might, but it turned
out to be unworkable. See RFC 6891, Extension Mechanisms for
DNS (EDNS(0)), section 5, Extended Label Types.
The remnants of the code that supported this in BIND are redundant.
Since the fctx hash table is now self-resizing, and resolver tasks are
selected to match the thread that created the fetch context, there
shouldn't be any significant advantage to having multiple tasks per CPU;
a single task per thread should be sufficient.
Additionally, the fetch context is always pinned to the calling netmgr
thread to minimize the contention just to coalesced fetches - if two
threads starts the same fetch, it will be pinned to the first one to get
the bucket.
The dns_message_gettempname(), dns_message_gettemprdata(),
dns_message_gettemprdataset(), and dns_message_gettemprdatalist() always
succeeds because the memory allocation cannot fail now. Change the API
to return void and cleanup all the use of aforementioned functions.
the request manager has no direct dependency on the
view, so there's no need for a weak reference. remove the
dns_requestmgr_whenshutdown() mechanism since it is no longer
used.
weakly attaching and detaching when creating and destroying the
resolver obviates the need to have a callback event to do the weak
detach. remove the dns_resolver_whenshutdown() mechanism, as it is
now unused.
weakly attaching and detaching the view when creating or destroying
the ADB obviates the need for a whenshutdown callback event to do
the detaching. remove the dns_adb_whenshutdown() mechanism, since
it is no longer needed.
for better object separation, ADB and resolver statistics counters
are now stored in the ADB and resolver objects themsevles, rather than
in the associated view.
it's not necessary to use view attributes to determine whether
the ADB, resolver and requestmgr need to be shut down; checking
whether they're NULL is sufficient.
- eliminate dns_view_flushanddetach(), which was only called from
one place; instead, we now call a function dns_view_flushonshutdown()
which sets the view up to flush zones when it is detached normally
with dns_view_detach().
- cleaned up code in dns_view_create().
Add a new parameter to the dst_key structure, mark a key modified if
dst_key_(un)set[bool,num,state,time] is called. Only write out key
files during a keymgr run if the metadata has changed.
The isc_task_onshutdown() was used to post event that should be run when
the task is being shutdown. This could happen explicitly in the
isc_test_shutdown() call or implicitly when we detach the last reference
to the task and there are no more events posted on the task.
This whole task onshutdown mechanism just makes things more complicated,
and it's easier to post the "shutdown" events when we are shutting down
explicitly and the existing code already always knows when it should
shutdown the task that's being used to execute the onshutdown events.
Replace the isc_task_onshutdown() calls with explicit calls to execute
the shutdown tasks.
The rbtnode->rpz flag was left behind when rbt and rpz were disentangled
by CHANGES #4576. Removing it makes the comment above correct again.
This reduces the flags so they fit in a 32 bit word again. On 64
bit systems there is still padding so it doesn't change the size
of an rbtnode. On 32 bit systems it reduces an rbtnode by 4 bytes.
dns_message_findname and dns_message_sectiontotext incorrectly accepted
DNS_SECTION_ANY. If DNS_SECTION_ANY was passed the section array could
be incorrectly accessed at (-1).
dns_message_pseudosectiontotext and dns_message_pseudosectiontoyaml
incorrectly accepted DNS_PSEUDOSECTION_ANY. These functions are
designed to process a single section.
Catalog zones change of ownership is special mechanism to facilitate
controlled migration of a member zone from one catalog to another.
It is implemented using catalog zones property named "coo" and is
documented in DNS catalog zones draft version 5 document.
Implement the feature using a new hash table in the catalog zone
structure, which holds the added "coo" properties for the catalog zone
(containing the target catalog zone's name), and the key for the hash
table being the member zone's name for which the "coo" property is being
created.
Change some log messages to have consistent zone name quoting types.
Update the ARM with change of ownership documentation and usage
examples.
Add tests which check newly the added features.
Update the function that synchronizes the CDS and CDNSKEY DELETE
records. It now allows for the possibility that the CDS DELETE record
is published and the CDNSKEY DELETE record is not, and vice versa.
Also update the code in zone.c how 'dns_dnssec_syncdelete()' is called.
With KASP, we still maintain the DELETE records our self. Otherwise,
we publish the CDS and CDNSKEY DELETE record only if they are added
to the zone. We do still check if these records can be signed by a KSK.
This change will allow users to add a CDS and/or CDNSKEY DELETE record
manually, without BIND removing them on the next zone sign.
Note that this commit removes the check whether the key is a KSK, this
check is redundant because this check is also made in
'dst_key_is_signing()' when the role is set to DST_BOOL_KSK.
Previously, the RPZ updates ran quantized on the main nm_worker loops.
As the quantum was set to 1024, this might lead to service
interruptions when large RPZ update was processed.
Change the RPZ update process to run as the offloaded work. The update
and cleanup loops were refactored to do as little locking of the
maintenance lock as possible for the shortest periods of time and the db
iterator is being paused for every iteration, so we don't hold the rbtdb
tree lock for prolonged periods of time.
Previously dns_rpz_add() were passed dns_rpz_zones_t and index to .zones
array. Because we actually attach to dns_rpz_zone_t, we should be using
the local pointer instead of passing the index and "finding" the
dns_rpz_zone_t again.
Additionally, dns_rpz_add() and dns_rpz_delete() were used only inside
rpz.c, so make them static.
Do a general cleanup of lib/dns/rpz.c style:
* Removed deprecated and unused functions
* Unified dns_rpz_zone_t naming to rpz
* Unified dns_rpz_zones_t naming to rpzs
* Add and use rpz_attach() and rpz_attach_rpzs() functions
* Shuffled variables to be more local (cppcheck cleanup)
In order to modify the .localhost and .localnets members of the
dns_aclenv, all other processing on the netmgr loops needed to be
stopped using the task exclusive mode. Add the isc_rwlock to the
dns_aclenv, so any modifications to the .localhost and .localnets can be
done under the write lock.
Instead of passing the number of worker to the dns_zonemgr manually,
get the number of nm threads using the new isc_nm_getnworkers() call.
Additionally, remove the isc_pool API and manage the array of memory
context, zonetasks and loadtasks directly in the zonemgr.
After switching to per-thread resources in the zonemgr, the performance
was decreased because the memory context, zonetask and loadtask was
picked from the pool at random.
Pin the zone to single threadid (.tid) and align the memory context,
zonetask and loadtask to be the same, this sets the hard affinity of the
zone to the netmgr thread.
Previously, the zonemgr created 1 task per 100 zones and 1 memory
context per 1000 zones (with minimum 10 tasks and 2 memory contexts) to
reduce the contention between threads.
Instead of reducing the contention by having many resources, create a
per-nm_thread memory context, loadtask and zonetask and spread the zones
between just per-thread resources.
Note: this commit alone does decrease performance when loading the zone
by couple seconds (in case of 1M zone) and thus there's more work in
this whole MR fixing the performance.
The ADB previously used separate reference counters for internal
and external references, plus additional counters for ABD find
and namehook objects, and used all these counters to coordinate
its shutdown process, which was a multi-stage affair involving
a sequence of control events.
It also used a complex interlocking set of static functions for
referencing, deferencing, linking, unlinking, and cleaning up various
internal objects; these functions returned boolean values to their
callers to indicate what additional processing was needed.
The changes in the previous two commits destabilized this fragile
system in a way that was difficult to recover from, so in this commit
we refactor all of it. The dns_adb and dns_adbentry objects now use
conventional attach and detach functions for reference counting, and
the shutdown process is much more straightforward. Instead of
handling shutdown asynchronously, we can just destroy the ADB when
references reach zero
In addition, ADB locking has been simplified. Instead of a
single `find_{name,entry}_and_lock()` function which searches for
a name or entry's hash bucket, locks it, and then searches for the
name or entry in the bucket, we now use one function to find the
bucket (leaving it to the caller to do the locking) and another
find the name or entry. Instead of locking the entire ADB when
modifying hash tables, we now use read-write locks around the
specific hash table. The only remaining need for adb->lock
is when modifying the `whenshutdown` list.
Comments throughout the module have been improved.
Replace adb->{names,entries} and related arrays (indexed by hashed
bucket) with a isc_ht hash tables storing the new struct
adb{name,entry}bucket_t that wraps all the variables that were
originally stored in arrays indexed by "bucket" number stored directly
in the struct dns_adb.
Previously, the task exclusive mode has been used to grow the internal
arrays used to store the named and entries objects. The isc_ht hash
tables are now protected by the isc_rwlock instead and thus the usage of
the task exclusive mode has been removed from the dns_adb.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
The use of isc_appctx_t in dns_client was used to wait for
dns_client_startresolve() to finish the processing (the resolve_done()
task callback).
This has been replaced with standard bool+cond+lock combination removing
the need of isc_appctx_t altogether.
Previously, the function(s) in the commit subject could fail for various
reasons - mostly allocation failures, or other functions returning
different return code than ISC_R_SUCCESS. Now, the aforementioned
function(s) cannot ever fail and they would always return ISC_R_SUCCESS.
Change the function(s) to return void and remove the extra checks in
the code that uses them.
mem_maybedup() calls isc_mem_allocate() if an mctx is supplied,
but that can no longer fail, so now the only way mem_maybedup()
could return NULL is if it was given a NULL source address by the
caller. this commit adds a REQUIRE to prevent that scenario, and
cleans up all the calling code that previously checked for NULL
return values.
this function is mostly used in rdata tostruct() implementations, so
the documentation for dns_rdata_tostruct() has been updated to
remove 'ISC_R_NOMEMORY' as a possible return value.
this brings DNS_CLIENTINFO_VERSION into line with the subscription
branch so that fixes applied to clientinfo processing can also be
applied to the main branch without diverging.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
Commit 308bc46a59 introduced a change to
the view_flushanddetach() function which makes the latter access
view->zonetable without holding view->lock. As confirmed by TSAN, this
enables races between threads for view->zonetable accesses.
Swap the view->zonetable pointer under view lock and then detach the
local swapped dns_zt_t later when the view lock is already unlocked.
This commit also changes the dns_zt interfaces, so the setting the
zonetable "flush" flag is separate operation to dns_zt_detach,
e.g. instead of doing:
if (view->flush) {
dns_zt_flushanddetach(&zt);
} else {
dns_zt_detach(&zt);
}
the code is now:
if (view->flush) {
dns_zt_flush(zt);
}
dns_zt_detach(&zt);
making the code more consistent with how we handle flushing and
detaching dns_zone_t pointers from the view.
This commit enables client-side TLS contexts re-use for zone transfers
over TLS. That, in turn, makes it possible to use the internal session
cache associated with the contexts, allowing the TLS connections to be
established faster and requiring fewer resources by not going through
the full TLS handshake procedure.
Previously that would recreate the context on every connection, making
TLS session resumption impossible.
Also, this change lays down a foundation for Strict TLS (when the
client validates a server certificate), as the TLS context cache can
be extended to store additional data required for validation (like
intermediates CA chain).
A number of DNS implementation produce NSEC records with bad type
maps that don't contain types that exist at the name leading to
NODATA responses being synthesize instead of the records in the
zone. NSEC records with these bad type maps often have the NSEC
NSEC field set to '\000.QNAME'. We look for the first label of
this pattern.
e.g.
example.com NSEC \000.example.com SOA NS NSEC RRSIG
example.com RRRSIG NSEC ...
example.com SOA ...
example.com RRRSIG SOA ...
example.com NS ...
example.com RRRSIG NS ...
example.com A ...
example.com RRRSIG A ...
A is missing from the type map.
This introduces a temporary option 'reject-000-label' to control
this behaviour.
'server <prefix> { broken-nsec yes; };' can now be used to stop
NSEC records from negative responses from servers in the given
prefix being cached and hence available to synth-from-dnssec.
dns_db_nodecount can now be used to get counts from the auxilary
rbt databases. The existing node count is returned by
tree=dns_dbtree_main. The nsec and nsec3 node counts by dns_dbtree_nsec
and dns_dbtree_nsec3 respectively.
This commit adds support for client-side TLS parameters to XoT.
Prior to this commit all client-side TLS contexts were using default
parameters only, ignoring the options from the BIND's configuration
file.
Currently, the following 'tls' parameters are supported:
- protocols;
- ciphers;
- prefer-server-ciphers.
This commit completes the integration of the new, extended ACL syntax
featuring 'port' and 'transport' options.
The runtime presentation and ACL loading code are extended to allow
the syntax to be used beyond the 'allow-transfer' option (e.g. in
'acl' definitions and other 'allow-*' options) and can be used to
ultimately extend the ACL support with transport-only
ACLs (e.g. 'transport-acl tls-acl port 853 transport tls'). But, due
to fundamental nature of such a change, it has not been completed as a
part of 9.17.X release series due to it being close to 9.18 stable
release status. That means that we do not have enough time to fully
test it.
The complete integration is planned as a part of 9.19.X release
series.
The code was manually verified to work as expected by temporarily
enabling the extended syntax for 'acl' statements and 'allow-query'
options, including ACL merging, negated ACLs.
The following scenario triggers a "named" crash:
1. Configure a catalog zone.
2. Start "named".
3. Comment out the "catalog-zone" clause.
4. Run `rndc reconfig`.
5. Uncomment the "catalog-zone" clause.
6. Run `rndc reconfig` again.
Implement the required cleanup of the in-memory catalog zone during
the first `rndc reconfig`, so that the second `rndc reconfig` could
find it in an expected state.
Add a new parameter to 'ns_client_t' to store potential extended DNS
error. Reset when the client request ends, or is put back.
Add defines for all well-known info-codes.
Update the number of DNS_EDNSOPTIONS that we are willing to set.
Create a new function to set the extended error for a client reply.
This commit makes BIND set the "max-age" value of the "Cache-Control"
HTTP header to the minimal TTL from the Answer section for positive
answers, as RFC 8484 advises in section 5.1.
We calculate the minimal TTL as a side effect of rendering the
response DNS message, so it does not change the code flow much, nor
should it have any measurable negative impact on the performance.
For negative answers, the "max-age" value is set using the TTL and
SOA-minimum values from an SOA record in the Authority section.
The new rules compare the target name in PTR and SRV records against
the machine name embedded in the kerberos principal. This can be
used to further restrict what PTR and SRV records can be added or
deleted via dynamic updates if desired.
The librpz.h defined LIRPZ_LIKELY() and LIBRPZ_UNLIKELY() macros that
were actually unused in the code. Remove the macros and the autoconf
check for __builtin_expect().
Unify the header guard style and replace the inconsistent include guards
with #pragma once.
The #pragma once is widely and very well supported in all compilers that
BIND 9 supports, and #pragma once was already in use in several new or
refactored headers.
Using simpler method will also allow us to automate header guard checks
as this is simpler to programatically check.
For reference, here are the reasons for the change taken from
Wikipedia[1]:
> In the C and C++ programming languages, #pragma once is a non-standard
> but widely supported preprocessor directive designed to cause the
> current source file to be included only once in a single compilation.
>
> Thus, #pragma once serves the same purpose as include guards, but with
> several advantages, including: less code, avoidance of name clashes,
> and sometimes improvement in compilation speed. On the other hand,
> #pragma once is not necessarily available in all compilers and its
> implementation is tricky and might not always be reliable.
1. https://en.wikipedia.org/wiki/Pragma_once
With isc_mem_get() and dns_name_dup() no longer being able to fail, some
functions can now only return ISC_R_SUCCESS. Change the return type to
void for the following function(s):
* dns_zone_setprimaries()
* dns_zone_setparentals()
* dns_zone_setparentals()
* dns_zone_setalsonotify()
With isc_mem_get() and dns_name_dup() no longer being able to fail, some
functions can now only return ISC_R_SUCCESS. Change the return type to
void for the following function(s):
* dns_view_adddelegationonly()
* dns_view_excludedelegationonly()
With isc_mem_get() and dns_name_dup() no longer being able to fail, some
functions can now only return ISC_R_SUCCESS. Change the return type to
void for the following function(s):
* dns_ssutable_addrule()
* dns_ssutable_create()
* dns_ssutable_createdlz()
With isc_mem_get() and dns_name_dup() no longer being able to fail, some
functions can now only return ISC_R_SUCCESS. Change the return type to
void for the following function(s):
* dns_resolver_addalternate()
With isc_mem_get() and dns_name_dup() no longer being able to fail, some
functions can now only return ISC_R_SUCCESS. Change the return type to
void for the following function(s):
* dns_catz_options_copy()
* dns_catz_options_setdefault()
* dns_catz_entry_new()
* dns_catz_entry_copy()
Replace some "master/slave" terminology in the code with the preferred
"primary/secondary" keywords. This also changes user output such as
log messages, and fixes a typo ("seconary") in cfg_test.c.
There are still some references to "master" and "slave" for various
reasons:
- The old syntax can still be used as a synonym.
- The master syntax is kept when it refers to master files and formats.
- This commit replaces mainly keywords that are local. If "master" or
"slave" is used in for example a structure that is all over the
place, it is considered out of scope for the moment.
Originally, the hash table used in RBT database would be resized when it
reached certain number of elements (defined by overcommit). This was
causing resolution brownouts for busy resolvers, because the rehashing
could take several seconds to complete. This was mitigated by
pre-allocating the hash table in the RBT database used for caching to be
large-enough as determined by max-cache-size. The downside of this
solution was that the pre-allocated hash table could take a significant
chunk of the memory even when the resolver cache would be otherwise
empty because the default value for max-cache-size is 90% of available
memory.
Implement incremental resizing[1] to perform the rehashing gradually:
1. During the resize, allocate the new hash table, but keep the old
table unchanged.
2. In each lookup or delete operation, check both tables.
3. Perform insertion operations only in the new table.
4. At each insertion also move r elements from the old table to the new
table.
5. When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new
table itself needs to be enlarged, it is necessary to increase the
size of the table by a factor of at least (r + 1)/r during resizing.
In our implementation r is equal to 1.
The downside of this approach is that the old table and the new table
could stay in memory for longer when there are no new insertions into
the hash table for prolonged periods of time as the incremental
rehashing happens only during the insertions.
The upside of this approach is that it's no longer necessary to
pre-allocate large hash table, because the RBT hash table rehashing
doesn't cause resolution brownouts anymore and thus we can use the
memory as needed.
1. https://en.m.wikipedia.org/wiki/Hash_table#Dynamic_resizing
Remove the dynamic registration of result codes. Convert isc_result_t
from unsigned + #defines into 32-bit enum type in grand unified
<isc/result.h> header. Keep the existing values of the result codes
even at the expense of the description and identifier tables being
unnecessary large.
Additionally, add couple of:
switch (result) {
[...]
default:
break;
}
statements where compiler now complains about missing enum values in the
switch statement.
Renamed some functions for clarity and readability:
- dns_dispatch_addresponse() -> dns_dispatch_add()
- dns_dispatch_removeresponse() -> dns_dispatch_done()
The dns_dispatch_cancel() function now calls dns_dispatch_done()
directly, so it is no longer ever necessary to call both functions.
dns_dispatch_cancel() is used to terminate dispatch connections
that are still pending, while dns_dispatch_done() is used when they
are complete.
as libdns is no longer exported, it's not necessary to have
init and shutdown functions. the only purpose they served
was to create a private mctx and run dst_lib_init(), which
can be called directly instead.
- startrecv() and getnext() have been rewritten.
- Don't set TCP flag when connecting a UDP dispatch.
- Prevent TCP connections from trying to connect twice.
- dns_dispatch_gettcp() can now find a matching TCP dispatch that has
not yet fully connected, and attach to it. when the connection is
completed, the connect callbacks are run for all of the pending
entries.
- An atomic 'state' variable is now used for connection state instead of
attributes.
- When dns_dispatch_cancel() is called on a TCP dispatch entry, only
that one entry is canceled. the dispatch itself should not be shut
down until there are no dispatch entries left associated with it.
- Other incidental cleanup, including removing DNS_DISPATCHATTR_IPV4 and
_IPV6 (they were being set in the dispatch attributes but never used),
cleaning up dns_requestmgr_create(), and renaming dns_dispatch_read()
to the more descriptive dns_dispatch_resume().
- It is no longer necessary to pass a 'timeout' callback to
dns_dispatch_addresponse(); timeouts are handled directly by the
'response' callback instead.
- The netmgr handle is no longer passed to dispatch callbacks, since
they don't (and can't) use it. instead, dispatch_cb_t now takes a
result code, region, and argument.
- Cleaned up timeout-related tests in dispatch_test.c
- Responses received by the dispatch are no longer sent to the caller
via a task event, but via a netmgr-style recv callback. the 'action'
parameter to dns_dispatch_addresponse() is now called 'response' and
is called directly from udp_recv() or tcp_recv() when a valid response
has been received.
- All references to isc_task and isc_taskmgr have been removed from
dispatch functions.
- All references to dns_dispatchevent_t have been removed and the type
has been deleted.
- Added a task to the resolver response context, to be used for fctx
events.
- When the caller cancels an operation, the response handler will be
called with ISC_R_CANCELED; it can abort immediately since the caller
will presumably have taken care of cleanup already.
- Cleaned up attach/detach in resquery and request.
Remove the debugging printfs. (leaving this as a separate commit rather
than squashing it so we can restore it in the future if we ever need it
again.)
Since every dispsock was associated with a dispentry anyway (though not
always vice versa), the members of dispsock have been combined into
dispentry, which is now reference-counted. dispentry objects are now
attached before connecting and detached afterward to prevent races
between the connect callback and dns_dispatch_removeresponse().
Dispatch and dispatchmgr objects are now reference counted as well, and
the shutdown process has been simplified. reference counting of
resquery and request objects has also been cleaned up significantly.
dns_dispatch_cancel() now flags a dispentry as having been canceled, so
that if the connect callback runs after cancellation, it will not
initiate a read.
The isblackholed() function has been simplified.
- The `timeout_action` parameter to dns_dispatch_addresponse() been
replaced with a netmgr callback that is called when a dispatch read
times out. this callback may optionally reset the read timer and
resume reading.
- Added a function to convert isc_interval to milliseconds; this is used
to translate fctx->interval into a value that can be passed to
dns_dispatch_addresponse() as the timeout.
- Note that netmgr timeouts are accurate to the millisecond, so code to
check whether a timeout has been reached cannot rely on microsecond
accuracy.
- If serve-stale is configured, then a timeout received by the resolver
may trigger it to return stale data, and then resume waiting for the
read timeout. this is no longer based on a separate stale timer.
- The code for canceling requests in request.c has been altered so that
it can run asynchronously.
- TCP timeout events apply to the dispatch, which may be shared by
multiple queries. since in the event of a timeout we have no query ID
to use to identify the resp we wanted, we now just send the timeout to
the oldest query that was pending.
- There was some additional refactoring in the resolver: combining
fctx_join() and fctx_try_events() into one function to reduce code
duplication, and using fixednames in fetchctx and fetchevent.
- Incidental fix: new_adbaddrinfo() can't return NULL anymore, so the
code can be simplified.
The flow of operations in dispatch is changing and will now be similar
for both UDP and TCP queries:
1) Call dns_dispatch_addresponse() to assign a query ID and register
that we'll be listening for a response with that ID soon. the
parameters for this function include callback functions to inform the
caller when the socket is connected and when the message has been
sent, as well as a task action that will be sent when the response
arrives. (later this could become a netmgr callback, but at this
stage to minimize disruption to the calling code, we continue to use
isc_task for the response event.) on successful completion of this
function, a dispatch entry object will be instantiated.
2) Call dns_dispatch_connect() on the dispatch entry. this runs
isc_nm_udpconnect() or isc_nm_tcpdnsconnect(), as needed, and begins
listening for responses. the caller is informed via a callback
function when the connection is established.
3) Call dns_dispatch_send() on the dispatch entry. this runs
isc_nm_send() to send a request.
4) Call dns_dispatch_removeresponse() to terminate listening and close
the connection.
Implementation comments below:
- As we will be using netmgr buffers now. code to send the length in
TCP queries has also been removed as that is handled by the netmgr.
- TCP dispatches can be used by multiple simultaneous queries, so
dns_dispatch_connect() now checks whether the dispatch is already
connected before calling isc_nm_tcpdnsconnect() again.
- Running dns_dispatch_getnext() from a non-network thread caused a
crash due to assertions in the netmgr read functions that appear to be
unnecessary now. the assertions have been removed.
- fctx->nqueries was formerly incremented when the connection was
successful, but is now incremented when the query is started and
decremented if the connection fails.
- It's no longer necessary for each dispatch to have a pool of tasks, so
there's now a single task per dispatch.
- Dispatch code to avoid UDP ports already in use has been removed.
- dns_resolver and dns_request have been modified to use netmgr callback
functions instead of task events. some additional changes were needed
to handle shutdown processing correctly.
- Timeout processing is not yet fully converted to use netmgr timeouts.
- Fixed a lock order cycle reported by TSAN (view -> zone-> adb -> view)
by by calling dns_zt functions without holding the view lock.
We now use dns_dispatch_cancel() for this purpose. NOTE: The caller
still has to track whether there are pending send or connect events in
the dispatch or dispatch entry; later this should be moved into the
dispatch module as well.
Also removed some public dns_dispatch_*() API calls that are no longer
used outside dispatch itself.
dns_dispatch_connect() connects a dispatch socket (for TCP) or a
dispatch entry socket (for UDP). This is the next step in moving all
uses of the isc_socket code into the dispatch module.
This API is temporary; it needs to be cleaned up further so that it can
be called the same way for both TCP and UDP.
Continuing the effort to move all uses of the isc_socket API into
dispatch.c, this commit removes the dns_tcpmsg module entirely, as
dispatch was its only caller, and moves the parts of its functionality
that were being used into the dispatch module.
This code will be removed when we switch to using netmgr TCPDNS.
Previously, creation of TCP dispatches differed from UDP in that a TCP
dispatch was created to attach to an existing socket, whereas a UDP
dispatch would be created in a vacuum and sockets would be opened on
demand when a transaction was initiated.
We are moving as much socket code as possible into the dispatch module,
so that it can be replaced with a netmgr version as easily as
possible. (This will also have the side effect of making TCP and UDP
dispatches more similar.)
As a step in that direction, this commit changes
dns_dispatch_createtcp() so that it creates the TCP socket.