The attacker that controls DNSSEC-signed zone can trigger a memory leak
in the addnoqname() and/or addclosest() by creating more than
max-records-per-type RRSIG for any NSEC records. The memory leaks have
been fixed.
(cherry picked from commit a854a5c83d)
This is a new seek function for dbiterator that is meant to find an
NSEC3 node in a zone database. The difference with dns_dbiterator_seek
is that if the node does not exist, this seek function will point the
iterator to the next NSEC3 name.
(cherry picked from commit 41159e9062)
dns_rbtnodechain_prev requires the current node to still be valid
which was not always the case after dereference_iter_node was called.
Move the call to dereference_iter_node to after the dns_rbtnodechain_prev
to preserve the node.
previously, there were over 40 separate definitions of CHECK macros, of
which most used "goto cleanup", and the rest "goto failure" or "goto
out". there were another 10 definitions of RETERR, of which most were
identical to CHECK, but some simply returned a result code instead of
jumping to a cleanup label.
this has now been standardized throughout the code base: RETERR is for
returning an error code in the case of an error, and CHECK is for jumping
to a cleanup tag, which is now always called "cleanup". both macros are
defined in isc/util.h.
(cherry picked from commit 52bba5cc34)
The code to test whether to store the RRSIGs on DNS_R_UNCHANGED
with CD=1 was failing because the comparison methods of the two
rdatatset instances were not compatible. Move the testing into
dns_db_addrdataset(), and request it by setting the DNS_ADD_EQUALOK
option. If the option is set and the old and new rrsets compare
as equal, dns_db_addrdataset() returns ISC_R_SUCCESS instead of
DNS_R_UNCHANGED.
(cherry picked from commit b954a1df43)
Under certain circumstances, cache entries with equivalent rdataset
might not get replaced. Previously such entry would get preserved
regardless of the new TTL and expire time on the existing header would
get updated when the expire time was less than the expire time on the
existing header. Change the logic to preserve the existing header only
if the new expire time is larger than the existing one and replace the
existing cache entry when the new expire time is less than the existing
one.
Co-authored-by: Jinmei Tatuya <jtatuya@infoblox.com>
(cherry picked from commit 9f7ba584cf)
Previously, BIND 9 would drop the ZEROTTL attribute when updating
previously cached NS entry with ZEROTTL attribute set.
Co-authored-by: Jinmei Tatuya <jtatuya@infoblox.com>
(cherry picked from commit 982ca161c2)
prio_type was being used in the wrong place to optimize cname_and_other.
We have to first exclude and accepted types and we also have to
determine that the record exists before we can check if we are at
a point where a later CNAME cannot appear.
(cherry picked from commit 5e49a9e4ae)
In #1870, the expiration time of ANCIENT records were printed, but
actually the ancient records are very short lived, and the information
carries a little value.
Instead of printing the expiration of ANCIENT records, print the
expiration time of STALE records.
(cherry picked from commit 355fc48472)
When the mark_ancient() helper function was introduced, couple of places
with duplicate (or almost duplicate) code was missed. Move the
mark_ancient() function closer to the top of the file, and correctly use
it in places that mark the header as ANCIENT.
(cherry picked from commit 58179e6a19)
If we know that the header has ZEROTTL set, the server should never send
stale records for it and the TTL should never be anything else than 0.
The comment was already there, but the code was not matching the
comment.
(cherry picked from commit cfee6aa565)
When the header has been marked as ANCIENT, but the ttl hasn't been
reset (this happens in couple of places), the rdataset TTL would be
set to the header timestamp instead to a reasonable TTL value.
Since this header has been already expired (ANCIENT is set), set the
rdataset TTL to 0 and don't reuse this field to print the expiration
time when dumping the cache. Instead of printing the time, we now
just print 'expired (awaiting cleanup'.
(cherry picked from commit 1bbb57f81b)
Change the names of the node reference counting functions
and add comments to make the mechanism easier to understand:
- dns__rbtdb_newref() and dns__rbtdb_decref() are now called
dns__rbtnode_acquire() and dns__rbtnode_release()
respectively; this reflects the fact that they modify both
the internal and external reference counters for a node.
- rbtnode_newref() and rbtnode_decref are now called
rbtnode_erefs_increment() and rbtnode_erefs_decrement(),
to reflect that they only increase and decrease the node's
external reference counters, not internal.
Refactor the pattern in the newref() and decref() functions in rbtdb.c
following the pattern, so it follows the similar pattern we already have
for QPDB.
Previously, this function always acquires a node write lock if it
might need node cleanup in case the reference decrements to 0. In
fact, the lock is unnecessary if the reference is larger than 1 and it
can be optimized as an "easy" case. This optimization could even be
"necessary". In some extreme cases, many worker threads could repeat
acquring and releasing the reference on the same node, resulting in
severe lock contention for nothing (as the ref wouldn't decrement to 0
in most cases). This change would prevent noticeable performance
drop like query timeout for such cases.
Co-authored-by: JINMEI Tatuya <jtatuya@infoblox.com>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
(cherry picked from commit 7f4471594d)
This is a second attempt to rewrite the GLUE cache to not use per
database version hash table. Instead of keeping a hash table indexed by
the node, use a directly linked list of GLUE records for each
slabheader. This was attempted before, but there was a data race caused
by the fact that the thread cleaning the GLUE records could be slower
than accessing the slab headers again and reinitializing the wait-free
stack.
The improved design builds on the previous design, but adds a new
dns_gluelist structure that has a pointer to the database version.
If a dns_gluelist belonging to a different (old) version is detected, it
is just detached from the slabheader and left for the closeversion() to
clean it up later.
(cherry picked from commit 29bde687b5)
The new log message is emitted when adding or updating an RRset
fails due to exceeding the max-records-per-type limit. The log includes
the owner name and type, corresponding zone name, and the limit value.
It will be emitted on loading a zone file, inbound zone transfer
(both AXFR and IXFR), handling a DDNS update, or updating a cache DB.
It's especially helpful in the case of zone transfer, since the
secondary side doesn't have direct access to the offending zone data.
It could also be used for max-types-per-name, but this change
doesn't implement it yet as it's much less likely to happen
in practice.
(cherry picked from commit 4156995431)
When adding glue to the header, we add header to the wait-free stack to
be cleaned up later which sets wfc_node->next to non-NULL value. When
the actual cleaning happens we would only cleanup the .glue_list, but
since the database isn't locked for the time being, the headers could be
reused while cleaning the existing glue entries, which creates a data
race between database versions.
Revert the code back to use per-database-version hashtable where keys
are the node pointers. This allows each database version to have
independent glue cache table that doesn't affect nodes or headers that
could already "belong" to the future database version.
(cherry picked from commit 5beae5faf9)
Instead of outright refusing to add new RR types to the cache, be a bit
smarter:
1. If the new header type is in our priority list, we always add either
positive or negative entry at the beginning of the list.
2. If the new header type is negative entry, and we are over the limit,
we mark it as ancient immediately, so it gets evicted from the cache
as soon as possible.
3. Otherwise add the new header after the priority headers (or at the
head of the list).
4. If we are over the limit, evict the last entry on the normal header
list.
Add HTTPS, SVCB, SRV, PTR, NAPTR, DNSKEY and TXT records to the list of
the priority types that are put at the beginning of the slabheader list
for faster access and to avoid eviction when there are more types than
the max-types-per-name limit.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
dns_db_addrdataset() enforces a requirement that version can only
be NULL for a cache database. code that checks for zone semantics
and version == NULL can never be reached.
Reconstruct the variant of the prune_tree() parent cleaning to consider
all elibible parents in a single loop as we were doing before all the
changes that led to this commit.
Update code comments so that they more precisely describe what the
relevant bits of code actually do.
It was discovered that the TTL-based cleaning could build up
a significant backlog of the rdataset headers during the periods where
the top of the TTL heap isn't expired yet. Make the TTL-based cleaning
more aggressive by cleaning more headers from the heap when we are
adding new header into the RBTDB.
It was discovered that an expired header could sit on top of the heap
a little longer than desireable. Remove expired headers (headers with
rdh_ttl set to 0) from the heap completely, so they don't block the next
TTL-based cleaning.
Instead of juggling with node locks in a cycle, cleanup the node we are
just pruning and send any the parent that's also subject to the pruning
to the prune tree via normal way (e.g. enqueue pruning on the parent).
This simplifies the code and also spreads the pruning load across more
event loop ticks which is better for lock contention as less things run
in a tight loop.
The log message for commit 24381cc36d
explained:
In some older BIND 9 branches, the extra queuing overhead eliminated by
this change could be remotely exploited to cause excessive memory use.
Due to architectural shift, this branch is not vulnerable to that issue,
but applying the fix to the latter is nevertheless deemed prudent for
consistency and to make the code future-proof.
However, it turned out that having a single queue for the nodes to be
pruned increased lock contention to a level where cleaning up nodes from
the RBTDB took too long, causing the amount of memory used by the cache
to grow indefinitely over time.
This commit reverts the change to the pruning mechanism introduced by
commit 24381cc36d as BIND branches newer
than 9.16 were not affected by the excessive event queueing overhead
issue mentioned in the log message for the above commit.
- the DNS_DB_NSEC3ONLY and DNS_DB_NONSEC3 flags are mutually
exclusive; it never made sense to set both at the same time.
to enforce this, it is now a fatal error to do so. the
dbiterator implementation has been cleaned up to remove
code that treated the two as independent: if nonsec3 is
true, we can be certain nsec3only is false, and vice versa.
- previously, iterating a database backwards omitted
NSEC3 records even if DNS_DB_NONSEC3 had not been set. this
has been corrected.
- when an iterator reaches the origin node of the NSEC3 tree, we
need to skip over it and go to the next node in the sequence.
the NSEC3 origin node is there for housekeeping purposes and
never contains data.
- the dbiterator_test unit test has been expanded, several
incorrect expectations have been fixed. (for example, the
expected number of iterations has been reduced by one; we were
previously counting the NSEC3 origin node and we should not
have been doing so.)
- create_node() in rbt.c cannot fail
- the dns_rbt_*name() functions, which are wrappers around
dns_rbt_[add|find|delete]node(), were never used except in tests.
this change isn't really necessary since RBT is likely to go away
eventually anyway. but keeping the API as simple as possible while it
persists is a good thing, and may reduce confusion while QPDB is being
developed from RBTDB code.
these values pertain to whether a node is in the main, nsec, or nsec3
tree of an RBTDB. they need to be moved to a more generic location so
they can also be used by QPDB.
(this is in db.h rather than db_p.h because rbt.c needs access to it.
technically, that's a layer violation, but it's a long-existing one;
refactoring to get rid of it would be a large hassle, and eventually
we expect to remove rbt.c anyway.)
when the QPDB is implemented, we will need to have both qpdb_p.h and
rbtdb_p.h. in order to prevent name collisions or code duplication,
this commit adds a generic private header file, db_p.h, containing
structures and macros that will be used by both databases.
some functions and structs have been renamed to more specifically refer
to the RBT database, in order to avoid namespace collision with similar
things that will be needed by the QPDB later.
Stop the cname_and_other_data processing if we already know that the
result is true. Also, we know that CNAME will be placed in the priority
headers, so we can stop looking for CNAME if we haven't found CNAME and
we are past the priority headers.
Mark the infrastructure RRTypes as "priority" types and place them at
the beginning of the rdataslab header data graph. The non-priority
types either go right after the priority types (if any).
Instead of issuing a separate isc_async_run() call for every RBTDB node
that triggers tree pruning, maintain a list of nodes from which tree
pruning can be started from and only issue an isc_async_run() call if
pruning has not yet been triggered by another RBTDB node.
In some older BIND 9 branches, the extra queuing overhead eliminated by
this change could be remotely exploited to cause excessive memory use.
Due to architectural shift, this branch is not vulnerable to that issue,
but applying the fix to the latter is nevertheless deemed prudent for
consistency and to make the code future-proof.
Only cleanup headers that are less than equal to the rbt's last_used
time. Adjust the rbt's last_used time when the target cleaning was
not achieved to the oldest value of the remaining set of headers.
When updating delegating NS and glue records last_used was not being
updated when it should have been.
When adding zero TTL records to the tail of the LRU lists set
last_used to rbtdb->last_used + 1 rather than now. This appoximately
preserves the lists LRU order.
The updatenotify mechanism in dns_db relied on unlocked ISC_LIST for
adding and removing the "listeners". The mechanism relied on the
exclusive mode - it should have been updated only during reconfiguration
of the server. This turned not to be true anymore in the dns_catz - the
updatenotify list could have been updated during offloaded work as the
offloaded threads are not subject to the exclusive mode.
Change the update_listeners to be cds_lfht (lock-free hash-table), and
slightly refactor how register and unregister the callbacks - the calls
are now idempotent (the register call already was and the return value
of the unregister function was mostly ignored by the callers).
With ThreadSanitizer support added to the Userspace RCU, we no longer
need to wrap the call_rcu and caa_container_of with
__tsan_{acquire,release} hints. Remove the direct calls to
__tsan_{acquire,release} and the isc_urcu_{container,cleanup} macros.
The isc_stats_create() can no longer return anything else than
ISC_R_SUCCESS. Refactor isc_stats_create() and its variants in libdns,
libns and named to just return void.
ultimately we want the slab implementation of dns_rdataset to
be usable by more database implementaions than just rbtdb. this
commit moves rdataset_methods to rdataslab.c, renamed
dns_rdataslab_rdatasetmethods.
new database methods have been added: locknode, unlocknode,
addglue, expiredata, and deletedata, allowing external functions to
perform functions that previously required internal access to the
database implementation.
database and heap pointers are now stored in the dns_slabheader object
so that header is the only thing that needs to be passed to some
functions; this will simplify moving functions that process slabheaders
out of rbtdb.c so they can be used by other database implementations.