Is there a time when new_qp(c|z)node() would not be followed by
assignment of the namespace? No, so let's add the assignment to the
function that creates the node.
Now that we have to code working, rename 'dns_qp_lookup2' back to
'dns_qp_lookup' and adjust all remaining 'dns_qp_lookup' occurrences
to take a space=0 parameter.
For now we only allow DNS_DB_NSEC_* values so it makes sense to change
the type to an enum.
Rename 'denial' to the more intuitive 'space', indicating the namespace
of the keyvalue pair.
The dbiterator can take three modes: full, nsec3only and nonsec3.
Previously, in full mode the dbiterator requires special logic to jump
from one qp-trie to the other. Now everything is in one trie, other
special logic is needed.
The qp-trie is now sorted in such a way that all the normal nodes come
first, followed by NSEC nodes, and finally the NSEC3 nodes. NSEC nodes
are empty nodes and need to be skipped when iterating.
We add an additional auxiliary node to the trie, an NSEC origin, so
we can easily find the point in the trie where we need to continue
iterating.
In preparation to merge the three qp tries (tree, nsec, nsec3) into
one, add the piece of information into the qpkey. This is the most
significant bit of information, so prepend the denial type to the qpkey.
This means we need to pass on the denial type when constructing the
qpkey from a name, or doing a lookup.
Reuse the the DNS_DB_NSEC_* values. Most qp tries in the code we just
pass on 0 (nta, rpz, zt, etc.), because there is no need for denial of
existence, but for qpzone and qpcache we must pass the right value.
Change the code, so that node->nsec no longer can have the value
DNS_DB_NSEC_HAS_NSEC, instead track this in a new attribute 'havensec'.
Since we use node->nsec to convert names to keys, the value MUST be set
before inserting the node into the qp-trie.
Update the fuzzing and unit tests accordingly. This only adds a few
extra test cases, more are needed.
In the qp_test.c we can remove test code for empty keys as this is
no longer possible.
RRset ordering is now an enum inside struct rdataset attributes. This
was done to keep size to of the structure to its original value before
this MR.
I expect zero performance impact but it should be easier to deal with
attributes in debuggers and language servers.
As mentioned in the comments block before the changed code block,
the dropped or slipped responses should be logged in the query
category (or rather query-errors category as done in lib/ns/client.c),
so that requests are not silently lost.
Also fix a couple of errors/typos in the code comments.
The ns_client_t struct is reset and zero-ed out on every query,
but some fields (query, message, manager) are preserved.
We observe two things:
- The sendbuf field is going to be overwritten anyway, there's
no need to zero it out.
- The fields are copied out when the struct is zero-ed out, and
then copied back in. For the query field (which is 896 bytes)
this is very inefficient.
This commit makes the reset more efficient avoiding to unnecessary
zero-ing and copy.
Replace the read-write locked isc_hashmap with lock-free cds_lfht
hashtable and replace the singular LRU tables for ADB names and entries
with a per-thread LRU tables. These changes allowed to remove all the
read-write locking on the names and entries tables.
Instead of having hand crafted attach/detach/destroy functions, replace
them with the standard ISC_REFCOUNT macro. This also have advantage
that delayed netmgr detach (from dns_dispatch) now doesn't cause
assertion failure. This can happen with delayed (call_rcu) shutdown of
dns_adb.
The dns_adb cleaning is little bit muddled as it mixes the "TTL"
based cleaning (.expire_v4 and .expire_v6 for adbname, .expires for
adbentry) with overmem cleaning.
Rewrite the LRU based cleaning to use SIEVE algorithm and to be overmem
cleaning only with a requirement to always cleanup at least 2-times the
size of the newly added entry.
Qpzone employs a locking strategy where rwlocks are grouped into
buckets, and each zone gets 17 buckets.
This strategy is suboptimal in two ways:
- If named is serving a single zone or a zone is the majority of the
traffic, this strategy pretty much guarantees contention when using
more than a dozen threads.
- If named is serving many small zones, it causes substantial memory
usage.
This commit switches the locking to a global table initialized at start
time. This should have three effects:
- Performance should improve in the single zone case, since now we are
selecting from a bigger pool of locks.
- Memory consumption should go down significantly in the many zone
cases.
- Performance should not degrade substantially in the many zone cases.
The reason for this is that, while we could have substantially more
zones than locks, we can query/edit only O(num threads) at the same
time. So by making the global table much bigger than the expected
number of threads, we can limit contention.
In the current implementation, the resigning heap is part of the zone
database. This leads to a cycle, as the database has a reference to its
nodes, but each node needs a reference to the database.
This MR splits the resigning heap into its own separate struct, in order
to help breaking the cycle.
Recovering the node lock from a pointer to the header and a pointer to
the db is a common operation. This commit abstracts it away into a
function, so that the node lock selection logic may be modified more
easily.
Print optionally a bit more details not passed to event in case
dns_view_findzonecut returns unexpected result. Result would be
visible later in foundevent, but found fname would be lost. Print it
into the log.
When authoritative zone is loaded when query minimization query for the
same zone is already pending, it might receive unexpected result codes.
Normally DNS_R_CNAME would follow to query_cname after processing sent
events, but dns_view_findzonecut does not fill CNAME target into
event->foundevent. Usual lookup via query_lookup would always have that
filled.
Ideally we would restart the query with unmodified search name, if
unexpected change from recursing to local zone cut were detected. Until
dns_view_findzonecut is modified to export zone/cache source of the cut,
at least fail queries which went into unexpected state.
The purge_stale_names()/purge_stale_entries() is opportunistic even when
we are under memory pressure (overmem). Split the opportunistic LRU
cleaning and overmem cleaning. This makes the stale purging much
simpler as we don't have to try that hard and makes the overmem cleaning
always cleanup double the amount of the newly allocated ADB name/entry.
There are three adbname flags that are used to identify different
types of adbname lookups when hashing rather than using multiple
hash tables. Separate these to their own structure element as these
need to be able to be read without locking the adbname structure.
The .inuse member was causing a lot of contention between threads using
the same memory context. Scather the .inuse and .overmem members of
isc_mem_t structure to be an per-tid array of variables to reduce the
contention as the writes are now independent of each other.
The array uses one tad bit nasty trick, as ISC_TID_UNKNOWN is now -1,
the array has been sized to fit the unknown tid with [-1] index into the
array accomplished with `ctx->stat = &ctx->stat_s[1];`. It will not win
a beauty contest, but it works seamlessly by just passing `isc_tid()` as
an index into the array.
The caveat here is that gathering the real inuse value requires walking
the whole array for all registered tid values (isc_tid_count()). The
gather part happens only when statistics are being gathered or when
isc_mem_isovermem() is called. As the isc_mem_isovermem() call happens
only when new data is being added to cache or ADB, it doesn't happen on
the hottest (read-only) path and according to the measurements, it
doesn't slow down neither the cold cache nor the hot cache latency.
As POSIX guarantees only that the type ssize_t shall be capable of
storing values at least in the range [-1, {SSIZE_MAX}], it can't be used
to calculate the difference between two memory sizes. Change the logic
for junk filling to test whether the new size is larger than old size
and then use size_t as the result will be always positive.
The jemalloc arena in isc_mem was added to solve runaway memory problem
for outgoing TCP connections. In the end, this was a red herring and
the jemalloc arena code is now unused (via e28266bf). Remove the
support for jemalloc memory arenas as we can restore this at any time if
we need it ever again, but right now it's just a dead code.
In jemalloc_shim.h, we relied on including <isc/overflow.h> implicitly
instead of explicitly and same was happening inside isc/overflow.h - the
stdbool.h (for bool type) was being included implicitly instead of
explicitly.
Change the internal type used for isc_tid unit to isc_tid_t to hide the
specific integer type being used for the 'tid'. Internally, the signed
integer type is being used. This allows us to have negatively indexed
arrays that works both for threads with assigned tid and the threads
with unassigned tid. This should be used only in specific situations.
Per pspacek, currently qp and qpcache logs are too verbose and enabled at a
level too low compared to how often the logging is useful.
This commit increases the logging level, while keeping it configurable
via a define.
The RAD/agent domain is a functionality from RFC 9567 that provides
a suffix for reporting error messages. On every query context reset,
we need to check if a RAD is configured and, if so, copy it.
Since we allow the RAD to be changed by reconfiguring the zone,
this access is currently protected by a mutex, which causes contention.
This commit replaces the mutex with RCU to reduce contention. The
change results in a 3% performance improvement in the 1M delegation
test.
We need to turn off clang-format to preserve the brackets as
'attribute' can be an expression and we need it to be evaluated
first.
Similarly we need the entire result to be evaluated independent of
the adjoining code.
This happens because old key is purged by one zone view, then the other
is freaking out about it.
Keys that are unused or being purged should not be taken into account
when verifying key files are available.
The keyring is maintained per zone. So in one zone, a key in the
keyring is being purged. The corresponding key file is removed.
The key maintenance is done for the other zone view. The key in that
keyring is not yet set to purge, but its corresponding key file is
removed. This leads to "some keys are missing" log errors.
We should not check the purge variable at this point, but the
current time and purge-keys duration.
This commit fixes this erroneous logic.
Use the existing RSASHA256 and RSASHA512 implementation to provide
working PRIVATEOID example implementations. We are using the OID
values normally associated with RSASHA256 (1.2.840.113549.1.1.11)
and RSASHA512 (1.2.840.113549.1.1.13).
Add support for proposed DS digest types that encode the private
algorithm identifier at the start of the DS digest as is done for
DNSKEY and RRSIG. This allows a DS record to identify the specific
DNSSEC algorithm, rather than a set of algorithms, when the algorithm
field is set to PRIVATEDNS or PRIVATEOID.
- dns_zone_cdscheck() has been extended to extract the key algorithms
from DNSKEY data when the CDS algorithm is PRIVATEOID or PRIVATEDNS.
- dns_zone_signwithkey() has been extended to support signing with
PRIVATEDNS and PRIVATEOID algorithms. The signing record (type 65534)
added at the zone apex to indicate the current state of automatic zone
signing can now contain an additional two-byte field for the DST
algorithm value, when the DNS secalg value isn't enough information.
dns_resolver_algorithm_supported() has been extended so in addition to
an algorithm number, it can also take a pointer to an RRSIG signature
field in which key information is encoded.
DST algorithm and DNSSEC algorithm values are not necessarily the same
anymore: if the DNSSEC algorithm value is PRIVATEOID or PRIVATEDNS, then
the DST algorithm will be mapped to something else. The conversion is
now done correctly where necessary.
The algorithm values PRIVATEDNS and PRIVATEOID are placeholders,
signifying that the actual algorithm identifier is encoded into the
key data. Keys using this mechanism are now supported.
- The algorithm values PRIVATEDNS and PRIVATEOID cannot be used to
build a key file name; dst_key_buildfilename() will assert if
they are used.
- The DST key values for private algorithms are higher than 255.
Since DST_ALG_MAXALG now exceeds 256, algorithm arrays that were
previously hardcoded to size 256 have been resized.
- New mnemonic/text conversion functions have been added.
dst_algorithm_{fromtext,totext,format} can handle algorithm
identifiers encoded in PRIVATEDNS and PRIVATEOID keys, as well
as the traditional algorithm identifiers. (Note: The existing
dns_secalg_{fromtext,totext,format} functions are similar, but
do *not* support PRIVATEDNS and PRIVATEOID. In most cases, the
new functions have taken the place of the old ones, but in a few
cases the old version is still appropriate.)
- dns_private{oid,dns}_{fromtext,totext,format} converts between
DST algorithm values and the mnemonic strings for algorithms
implemented using PRIVATEDNS or PRIVATEOID. (E.g., "RSASHA256OID").
- dst_algorithm_tosecalg() returns the DNSSEC algorithm identifier
that applies for a given DST algorithm. For PRIVATEDNS- or
PRIVATEOID- based algorithms, the result will be PRIVATEDNS or
PRIVATEOID, respectively.
- dst_algorithm_fromprivatedns() and dst_algorithm_fromprivateoid()
return the DST algorithm identifier for an encoded algorithm in
wire format, represented as in DNS name or an object identifier,
respectively.
- dst_algorithm_fromdata() is a front-end for the above; it extracts
the private algorithm identifier encoded at the begining of a
block of key or signature data, and returns the matching DST
algorithm number.
- dst_key_fromdns() and dst_key_frombuffer() now work with keys
that have PRIVATEDNS and PRIVATEOID algorithm identifiers at the
beginning.
The header file dns/rdatastruct.h was not being rebuilt when the
rdata type header files where modified.
Removed proforma.c from the list. It is a starting point for new
types.
The "keyopts" field of the dns_zone object was added to support
"auto-dnssec"; at that time the "options" field already had most of
its 32 bits in use by other flags, so it made sense to add a new
field.
Since then, "options" has been widened to 64 bits, and "auto-dnssec"
has been obsoleted and removed. Most of the DNS_ZONEKEY flags are no
longer needed. The one that still seems useful (_FULLSIGN) has been
moved into DNS_ZONEOPT and the rest have been removed, along with
"keyopts" and its setter/getter functions.
"type delegation-only" has been obsolete for some time
(see #3953) but the zone type flag for it was still defined
in libisccfg. It has now been removed.
Meson is a modern build system that has seen a rise in adoption and some
version of it is available in almost every platform supported.
Compared to automake, meson has the following advantages:
* Meson provides a significant boost to the build and configuration time
by better exploiting parallelism.
* Meson is subjectively considered to be better in readability.
These merits alone justify experimenting with meson as a way of
improving development time and ergonomics. However, there are some
compromises to ensure the transition goes relatively smooth:
* The system tests currently rely on various files within the source
directory. Changing this requirement is a non-trivial task that can't
be currently justified. Currently the last compiled build directory
writes into the source tree which is in turn used by pytest.
* The minimum version supported has been fixed at 0.61. Increasing this
value will require choosing a baseline of distributions that can
package with meson. On the contrary, there will likely be an attempt
to decrease this value to ensure almost universal support for building
BIND 9 with meson.