The attacker that controls DNSSEC-signed zone can trigger a memory leak
in the addnoqname() and/or addclosest() by creating more than
max-records-per-type RRSIG for any NSEC records. The memory leaks have
been fixed.
(cherry picked from commit a854a5c83d)
This is a new seek function for dbiterator that is meant to find an
NSEC3 node in a zone database. The difference with dns_dbiterator_seek
is that if the node does not exist, this seek function will point the
iterator to the next NSEC3 name.
(cherry picked from commit 41159e9062)
previously, there were over 40 separate definitions of CHECK macros, of
which most used "goto cleanup", and the rest "goto failure" or "goto
out". there were another 10 definitions of RETERR, of which most were
identical to CHECK, but some simply returned a result code instead of
jumping to a cleanup label.
this has now been standardized throughout the code base: RETERR is for
returning an error code in the case of an error, and CHECK is for jumping
to a cleanup tag, which is now always called "cleanup". both macros are
defined in isc/util.h.
(cherry picked from commit 52bba5cc34)
The code to test whether to store the RRSIGs on DNS_R_UNCHANGED
with CD=1 was failing because the comparison methods of the two
rdatatset instances were not compatible. Move the testing into
dns_db_addrdataset(), and request it by setting the DNS_ADD_EQUALOK
option. If the option is set and the old and new rrsets compare
as equal, dns_db_addrdataset() returns ISC_R_SUCCESS instead of
DNS_R_UNCHANGED.
(cherry picked from commit b954a1df43)
Under certain circumstances, cache entries with equivalent rdataset
might not get replaced. Previously such entry would get preserved
regardless of the new TTL and expire time on the existing header would
get updated when the expire time was less than the expire time on the
existing header. Change the logic to preserve the existing header only
if the new expire time is larger than the existing one and replace the
existing cache entry when the new expire time is less than the existing
one.
Co-authored-by: Jinmei Tatuya <jtatuya@infoblox.com>
(cherry picked from commit 9f7ba584cf)
Previously, BIND 9 would drop the ZEROTTL attribute when updating
previously cached NS entry with ZEROTTL attribute set.
Co-authored-by: Jinmei Tatuya <jtatuya@infoblox.com>
(cherry picked from commit 982ca161c2)
qp-tries allocate their nodes (twigs) in chunks to reduce allocator
pressure and improve memory locality. The choice of chunk size presents
a tradeoff: larger chunks benefit qp-tries with many values (as seen
in large zones and resolvers) but waste memory in smaller use cases.
Previously, our fixed chunk size of 2^10 twigs meant that even an
empty qp-trie would consume 12KB of memory, while reducing this size
would negatively impact resolver performance.
This commit implements an adaptive chunking strategy that:
- Tracks the size of the most recently allocated chunk.
- Doubles the chunk size for each new allocation until reaching a
predefined maximum.
This approach effectively balances memory efficiency for small tries
while maintaining the performance benefits of larger chunk sizes for
bigger data structures.
This commit also splits the callback freeing qpmultis into two
phases, one that frees the underlying qptree, and one that reclaims
the qpmulti memory. In order to prevent races between the qpmulti
destructor and chunk garbage collection jobs, the second phase is
protected by reference counting.
(cherry picked from commit 70b1777d8a)
dns_qp_lookup was returning ISC_R_NOTFOUND rather than DNS_R_PARTIALMATCH
when there wasn't a parent with a NSEC record in the cache. This was
causing find_coveringnsec to fail rather than returing the covering NSEC.
(cherry picked from commit 7de4207cb6)
The isc_queue_t was missing in the calculation of the required
padding size inside the qpcache bucket structure.
(cherry picked from commit 3ef9b09620)
Acquire the database refernce in the detachnode() to prevent the last
reference to be release while the NODE_LOCK being locked. The NODE_LOCK
is locked/unlocked inside the RCU critical section, thus it is most
probably this should not pose a problem as the database uses call_rcu
memory reclamation, but this it is still safer to acquire the reference
before releasing the node.
(cherry picked from commit d1ef6a93c1)
Reduce the number of qpzone_ref() and qpzone_unref() calls in
qpzone_detachnode() by relying on the call_rcu to delay
the destruction of the lock buckets.
(cherry picked from commit 1fa5219fdf)
Instead of having many node_lock_count * sizeof(<member>) arrays, pack
all the members into a qpcache_bucket_t struct that is cacheline aligned
and have a single array of those.
Additionaly, make both the head and the tail of isc_queue_t padded, not
just the head, to prevent false sharing of the lock-free structure with
the lock that follows it.
(cherry picked from commit c602d76c1f)
In #1870, the expiration time of ANCIENT records were printed, but
actually the ancient records are very short lived, and the information
carries a little value.
Instead of printing the expiration of ANCIENT records, print the
expiration time of STALE records.
(cherry picked from commit 355fc48472)
When the mark_ancient() helper function was introduced, couple of places
with duplicate (or almost duplicate) code was missed. Move the
mark_ancient() function closer to the top of the file, and correctly use
it in places that mark the header as ANCIENT.
(cherry picked from commit 58179e6a19)
If we know that the header has ZEROTTL set, the server should never send
stale records for it and the TTL should never be anything else than 0.
The comment was already there, but the code was not matching the
comment.
(cherry picked from commit cfee6aa565)
When the header has been marked as ANCIENT, but the ttl hasn't been
reset (this happens in couple of places), the rdataset TTL would be
set to the header timestamp instead to a reasonable TTL value.
Since this header has been already expired (ANCIENT is set), set the
rdataset TTL to 0 and don't reuse this field to print the expiration
time when dumping the cache. Instead of printing the time, we now
just print 'expired (awaiting cleanup'.
(cherry picked from commit 1bbb57f81b)
the search for the deepest known zone cut in the cache could
improperly reject a node containing stale data, even if the
NS rdataset wasn't the data that was stale.
this change also improves the efficiency of the search by
stopping it when both NS and RRSIG(NS) have been found.
(cherry picked from commit 1f095b902c)
Change the names of the node reference counting functions
and add comments to make the mechanism easier to understand:
- newref() and decref() are now called qpcnode_acquire()/
qpznode_acquire() and qpcnode_release()/qpznode_release()
respectively; this reflects the fact that they modify both
the internal and external reference counters for a node.
- qpcnode_newref() and qpznode_newref() are now called
qpcnode_erefs_increment() and qpznode_erefs_increment(), and
qpcnode_decref() and qpznode_decref() are now called
qpcnode_erefs_decrement() and qpznode_erefs_decrement(),
to reflect that they only increase and decrease the node's
external reference counters, not internal.
(cherry picked from commit d4f791793e)
This removes the db_nodelock_t structure and changes the node_locks
array to be composed only of isc_rwlock_t pointers. The .reference
member has been moved to qpdb->references in addition to
common.references that's external to dns_db API users. The .exiting
members has been completely removed as it has no use when the reference
counting is used correctly.
(cherry picked from commit 431513d8b3)
The origin_node in qpcache was always NULL, so we can remove the
getoriginode() function and origin_node pointer as the
dns_db_getoriginnode() correctly returns ISC_R_NOTFOUND when the
function is not implemented.
(cherry picked from commit 36a26bfa1a)
Cleanup the pattern in the decref() functions in both qpcache.c and
qpzone.c, so it follows the similar patter as we already have in
newref() function.
(cherry picked from commit 814b87da64)
Previously, this function always acquires a node write lock if it
might need node cleanup in case the reference decrements to 0. In
fact, the lock is unnecessary if the reference is larger than 1 and it
can be optimized as an "easy" case. This optimization could even be
"necessary". In some extreme cases, many worker threads could repeat
acquring and releasing the reference on the same node, resulting in
severe lock contention for nothing (as the ref wouldn't decrement to 0
in most cases). This change would prevent noticeable performance
drop like query timeout for such cases.
Co-authored-by: JINMEI Tatuya <jtatuya@infoblox.com>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
(cherry picked from commit 7f4471594d)
The new log message is emitted when adding or updating an RRset
fails due to exceeding the max-records-per-type limit. The log includes
the owner name and type, corresponding zone name, and the limit value.
It will be emitted on loading a zone file, inbound zone transfer
(both AXFR and IXFR), handling a DDNS update, or updating a cache DB.
It's especially helpful in the case of zone transfer, since the
secondary side doesn't have direct access to the offending zone data.
It could also be used for max-types-per-name, but this change
doesn't implement it yet as it's much less likely to happen
in practice.
(cherry picked from commit 4156995431)
when the QP cache was adapted from the RBT database, some names
weren't changed. this could be confusing, so let's change them now.
also, we no longer need to include rbt.h.
(cherry picked from commit 5a444838db)
Instead of outright refusing to add new RR types to the cache, be a bit
smarter:
1. If the new header type is in our priority list, we always add either
positive or negative entry at the beginning of the list.
2. If the new header type is negative entry, and we are over the limit,
we mark it as ancient immediately, so it gets evicted from the cache
as soon as possible.
3. Otherwise add the new header after the priority headers (or at the
head of the list).
4. If we are over the limit, evict the last entry on the normal header
list.
Add HTTPS, SVCB, SRV, PTR, NAPTR, DNSKEY and TXT records to the list of
the priority types that are put at the beginning of the slabheader list
for faster access and to avoid eviction when there are more types than
the max-types-per-name limit.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
Replace the ISC_LIST based deadnodes implementation with isc_queue which
is wait-free and we don't have to acquire neither the tree nor node lock
to append nodes to the queue and the cleaning process can also
copy (splice) the list into a local copy without acquiring the list.
Currently, there's little benefit to this as we need to hold those
locks anyway, but in the future as we move to RCU based implementation,
this will be ready.
To align the cleaning with our event loop based model, remove the
hardcoded count for the node locks and use the number of the event loops
instead. This way, each event loop can have its own cleaning as part of
the process. Use uniform random numbers to spread the nodes evenly
between the buckets (instead of hashing the domain name).
The expireheader() call in the expire_ttl_headers() function
is erroneous as it passes the 'nlocktypep' and 'tlocktypep'
arguments in a wrong order, which then causes an assertion
failure.
Fix the order of the arguments so it corresponds to the function's
prototype.
there were some structure names used in qpcache.c and qpzone.c that
were too similar to each other and could be confusing when debugging.
they have been changed as follows:
in qcache.c:
- changed_t was unused, and has been removed
- search_t -> qpc_search_t
- qpdb_rdatasetiter_t -> qpc_rditer_t
- qpdb_dbiterator_t -> qpc_dbiter_t
in qpzone.c:
- qpdb_changed_t -> qpz_changed_t
- qpdb_changedlist_t -> qpz_changedlist_t
- qpdb_version_t -> qpz_version_t
- qpdb_versionlist_t -> qpz_versionlist_t
- qpdb_search_t -> qpz_search_t
- qpdb_load_t -> qpz_search_t
when calling dns_qp_lookup() from qpcache, instead of passing
'foundname' so that a name would be constructed from the QP key,
we now just use the name field in the node data. this makes
dns_qp_lookup() run faster.
the same optimization has also been added to qpzone.
the documentation for dns_qp_lookup() has been updated to
discuss this performance consideration.
when the cache is over memory, we purge from the LRU list until
we've freed the approximate amount of memory to be added. this
approximation could fail because the memory allocated for nodenames
wasn't being counted.
add a dns_name_size() function so we can look up the size of nodenames,
then add that to the purgesize calculation.
in a cache database, unlike zones, NSEC3 records are stored in
the main tree. it is not necessary to maintain a separate 'nsec3'
tree, nor to have code in the dbiterator implementation to traverse
from one tree to another.
(if we ever implement synth-from-dnssec using NSEC3 records, we'll
need to revert this change. in the meantime, simpler code is better.)
the QP database doesn't support relative names as the RBTDB did, so
there's no need for a 'new_origin' flag or to handle `DNS_R_NEWORIGIN`
result codes.
- remove unneeded struct members and misleading comments.
- remove unused parameters for static functions.
- rename 'find_callback' to 'delegating' for consistency with qpzone;
the find callback mechanism is not used in QP databases.
- change dns_qpdata_t to qpcnode_t (QP cache node), and dns_qpdb_t to
qpcache_t, as these types are only accessed locally.
- also change qpdata_t in qpzone.c to qpznode_t (QP zone node), for
consistency.
- make the refcount declarations for qpcnode_t and qpznode_t static,
using the new ISC_REFCOUNT_STATIC macros.
In qpcache (and rbtdb), there are some functions that acquire
neither the tree lock nor the node lock when calling newref().
In theory, this could lead to a race in which a new reference
is added to a node that was about to be deleted.
We now detect this condition by passing the current tree and node
lock status to newref(). If the node was previously unreferenced
and we don't hold at least one read lock, we will assert.
an assertion could be triggered in the QPDB cache if a DNAME
was found above a queried NS, because the 'foundname' value was
not correctly updated to point to the zone cut.
the same mistake existed in qpzone and has been fixed there as well.