In previous_closest_nsec(), a new qpreader was opened to search the NSEC
tree. It was possible for that to be used to update a QP iterator object
owned by the caller, and then be destroyed when the function returned.
This has been addressed by having the caller open the NSEC qpreader
instead.
This commit implements a batch update function for qpzone. The main
reason for this is speed: using addrdataset would cause a qp transaction
per rrdataset added, leading to a substantial slowdown compared to
RBTDB. The new API results in a qp transaction per applied diff.
(cherry picked from commit da53708dcb)
This commit adds a layer of indirection to the apply_diff logic used by
IXFR and resigning by having the database updates go through a vtable.
We do this in three steps:
- We extend dns_rdatacallbacks_t vtable to allow subtraction and
resigning.
- We add a new set of api (begin|commit|abort)update to the dbmethods
vtable, that model an incremental update that can be aborted.
- We extract the core logic of diff_apply into a function that
satisfies the new interface.
- We make diff_apply use this new function, and log the results.
The intent of this commit is to allow databases to expose a batch
incremental update implementation, just like they expose a custom
batch creation implementation through (begin|end)load.
(cherry picked from commit e36dc0ca76)
This is a new seek function for dbiterator that is meant to find an
NSEC3 node in a zone database. The difference with dns_dbiterator_seek
is that if the node does not exist, this seek function will point the
iterator to the next NSEC3 name.
(cherry picked from commit 41159e9062)
dns_qpiter_{prev,next} requires the current iterator node to still be
valid which might not always the case after dereference_iter_node was
called. Currently, this is ensured via closeversion() mechanism, but it
is not guaranteed to be true in the future.
Move the call to dereference_iter_node to after the dns_qpiter_prev()
and dns_qpiter_next() to prevent a possible use-after-free of the
current iterator node.
(cherry picked from commit 9914bd383e)
previously, there were over 40 separate definitions of CHECK macros, of
which most used "goto cleanup", and the rest "goto failure" or "goto
out". there were another 10 definitions of RETERR, of which most were
identical to CHECK, but some simply returned a result code instead of
jumping to a cleanup label.
this has now been standardized throughout the code base: RETERR is for
returning an error code in the case of an error, and CHECK is for jumping
to a cleanup tag, which is now always called "cleanup". both macros are
defined in isc/util.h.
(cherry picked from commit 52bba5cc34)
when a requested name is found in the QP trie during a lookup, but its
records have been marked as nonexistent by a previous deletion, then
it's treated as a partial match, but the foundname could be left
pointing to the original qname rather than the parent. this could
lead to an assertion failure in query_findclosestnsec3().
the step() function (used for stepping to the prececessor or
successor of a database node) could overlook a node because
there was an rdataset marked IGNORE because it had been rolled
back, covering an active rdataset under it.
(cherry picked from commit 24eaff7adc)
Acquire the database refernce in the detachnode() to prevent the last
reference to be release while the NODE_LOCK being locked. The NODE_LOCK
is locked/unlocked inside the RCU critical section, thus it is most
probably this should not pose a problem as the database uses call_rcu
memory reclamation, but this it is still safer to acquire the reference
before releasing the node.
(cherry picked from commit d1ef6a93c1)
in some places there were checks for failures of dns_qp_insert()
after dns_qp_getname(). such failures could only happen if another
thread inserted a node between the two calls, and that can't happen
because the calls are serialized with dns_qpmulti_write(). we can
simplify the code and just add an INSIST.
(cherry picked from commit fffa150df3)
prio_type was being used in the wrong place to optimize cname_and_other.
We have to first exclude and accepted types and we also have to
determine that the record exists before we can check if we are at
a point where a later CNAME cannot appear.
(cherry picked from commit 5e49a9e4ae)
Instead of using on hash of the name modulo number of the buckets,
assign the locknum randomly with isc_random_uniform(). This makes
the locknum assignment aligned with qpcache and allows the bucket
number to be non-prime in the future.
(cherry picked from commit 732fc338a9)
Reduce the number of qpzone_ref() and qpzone_unref() calls in
qpzone_detachnode() by relying on the call_rcu to delay
the destruction of the lock buckets.
(cherry picked from commit 1fa5219fdf)
Instead of having many node_lock_count * sizeof(<member>) arrays, pack
all the members into a qpzone_bucket_t that is cacheline aligned and have
a single array of those.
(cherry picked from commit 6dcc398726)
Change the names of the node reference counting functions
and add comments to make the mechanism easier to understand:
- newref() and decref() are now called qpcnode_acquire()/
qpznode_acquire() and qpcnode_release()/qpznode_release()
respectively; this reflects the fact that they modify both
the internal and external reference counters for a node.
- qpcnode_newref() and qpznode_newref() are now called
qpcnode_erefs_increment() and qpznode_erefs_increment(), and
qpcnode_decref() and qpznode_decref() are now called
qpcnode_erefs_decrement() and qpznode_erefs_decrement(),
to reflect that they only increase and decrease the node's
external reference counters, not internal.
(cherry picked from commit d4f791793e)
This removes the db_nodelock_t structure and changes the node_locks
array to be composed only of isc_rwlock_t pointers. The .reference
member has been moved to qpdb->references in addition to
common.references that's external to dns_db API users. The .exiting
members has been completely removed as it has no use when the reference
counting is used correctly.
(cherry picked from commit 431513d8b3)
Cleanup the pattern in the decref() functions in both qpcache.c and
qpzone.c, so it follows the similar patter as we already have in
newref() function.
(cherry picked from commit 814b87da64)
Previously, this function always acquires a node write lock if it
might need node cleanup in case the reference decrements to 0. In
fact, the lock is unnecessary if the reference is larger than 1 and it
can be optimized as an "easy" case. This optimization could even be
"necessary". In some extreme cases, many worker threads could repeat
acquring and releasing the reference on the same node, resulting in
severe lock contention for nothing (as the ref wouldn't decrement to 0
in most cases). This change would prevent noticeable performance
drop like query timeout for such cases.
Co-authored-by: JINMEI Tatuya <jtatuya@infoblox.com>
Co-authored-by: Ondřej Surý <ondrej@isc.org>
(cherry picked from commit 7f4471594d)
This is a second attempt to rewrite the GLUE cache to not use per
database version hash table. Instead of keeping a hash table indexed by
the node, use a directly linked list of GLUE records for each
slabheader. This was attempted before, but there was a data race caused
by the fact that the thread cleaning the GLUE records could be slower
than accessing the slab headers again and reinitializing the wait-free
stack.
The improved design builds on the previous design, but adds a new
dns_gluelist structure that has a pointer to the database version.
If a dns_gluelist belonging to a different (old) version is detected, it
is just detached from the slabheader and left for the closeversion() to
clean it up later.
(cherry picked from commit 29bde687b5)
The new log message is emitted when adding or updating an RRset
fails due to exceeding the max-records-per-type limit. The log includes
the owner name and type, corresponding zone name, and the limit value.
It will be emitted on loading a zone file, inbound zone transfer
(both AXFR and IXFR), handling a DDNS update, or updating a cache DB.
It's especially helpful in the case of zone transfer, since the
secondary side doesn't have direct access to the offending zone data.
It could also be used for max-types-per-name, but this change
doesn't implement it yet as it's much less likely to happen
in practice.
(cherry picked from commit 4156995431)
Return partial match from dns_db_find/dns_db_find when requested
to short circuit the closest encloser discover process. Most of the
time this will be the actual closest encloser but may not be when
there yet to be committed / cleaned up versions of the zone with
names below the actual closest encloser.
(cherry picked from commit d42ea08f16)
When adding glue to the header, we add header to the wait-free stack to
be cleaned up later which sets wfc_node->next to non-NULL value. When
the actual cleaning happens we would only cleanup the .glue_list, but
since the database isn't locked for the time being, the headers could be
reused while cleaning the existing glue entries, which creates a data
race between database versions.
Revert the code back to use per-database-version hashtable where keys
are the node pointers. This allows each database version to have
independent glue cache table that doesn't affect nodes or headers that
could already "belong" to the future database version.
(cherry picked from commit 5beae5faf9)
Add HTTPS, SVCB, SRV, PTR, NAPTR, DNSKEY and TXT records to the list of
the priority types that are put at the beginning of the slabheader list
for faster access and to avoid eviction when there are more types than
the max-types-per-name limit.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
there were TSAN error reports because of conflicting uses of
node->dirty and node->nsec, which were in the same qword.
this could be resolved by separating them, but we could also
make them into atomic values and remove some node locking.
there were some structure names used in qpcache.c and qpzone.c that
were too similar to each other and could be confusing when debugging.
they have been changed as follows:
in qcache.c:
- changed_t was unused, and has been removed
- search_t -> qpc_search_t
- qpdb_rdatasetiter_t -> qpc_rditer_t
- qpdb_dbiterator_t -> qpc_dbiter_t
in qpzone.c:
- qpdb_changed_t -> qpz_changed_t
- qpdb_changedlist_t -> qpz_changedlist_t
- qpdb_version_t -> qpz_version_t
- qpdb_versionlist_t -> qpz_versionlist_t
- qpdb_search_t -> qpz_search_t
- qpdb_load_t -> qpz_search_t
when calling dns_qp_lookup() from qpcache, instead of passing
'foundname' so that a name would be constructed from the QP key,
we now just use the name field in the node data. this makes
dns_qp_lookup() run faster.
the same optimization has also been added to qpzone.
the documentation for dns_qp_lookup() has been updated to
discuss this performance consideration.
- remove unneeded struct members and misleading comments.
- remove unused parameters for static functions.
- rename 'find_callback' to 'delegating' for consistency with qpzone;
the find callback mechanism is not used in QP databases.
- change dns_qpdata_t to qpcnode_t (QP cache node), and dns_qpdb_t to
qpcache_t, as these types are only accessed locally.
- also change qpdata_t in qpzone.c to qpznode_t (QP zone node), for
consistency.
- make the refcount declarations for qpcnode_t and qpznode_t static,
using the new ISC_REFCOUNT_STATIC macros.
the previous commit introduced a possible race in getsigningtime()
where the rdataset header could change between being found on the
heap and being bound.
getsigningtime() now looks at the first element of the heap, gathers the
locknum, locks the respective lock, and retrieves the header from the
heap again. If the locknum has changed, it will rinse and repeat.
Theoretically, this could spin forever, but practically, it almost never
will as the heap changes on the zone are very rare.
we simplify matters further by changing the dns_db_getsigningtime()
API call. instead of passing back a bound rdataset, we pass back the
information the caller actually needed: the resigning time, owner name
and type of the rdataset that was first on the heap.
in RBTDB, the heap was used by zone databases for resigning, and
by the cache for TTL-based cache cleaning. the cache use case required
very frequent updates, so there was a separate heap for each of the
node lock buckets.
qpzone is for zones only, so it doesn't need to support the cache
use case; the heap will only be touched when the zone is updated or
incrementally signed. we can simplify the code by using only a single
heap.
an assertion could be triggered in the QPDB cache if a DNAME
was found above a queried NS, because the 'foundname' value was
not correctly updated to point to the zone cut.
the same mistake existed in qpzone and has been fixed there as well.
every node of a QP database contains a copy of the nodename,
which is used as the key for the QP-trie. previously, the name
was stored as a dns_fixedname object, which has room for up to
255 characters. we can reduce the space consumed by dynamically
allocating a dns_name object that's just long enough for the name
to be stored.
The dns_qpiter_next() was called without checking the return value. If
we cannot move the iterator forward, there is no use in calling the
step() function.
/lib/dns/qpzone.c: 2804 in activeempty()
2798 * of the name we were searching for. Step the iterator
2799 * forward, then step() will continue forward until it
2800 * finds a node with active data. If that node is a
2801 * subdomain of the one we were looking for, then we're
2802 * at an active empty nonterminal node.
2803 */
>>> CID 487882: Error handling issues (CHECKED_RETURN)
>>> Calling "dns_qpiter_next" without checking return value (as is done elsewhere 26 out of 27 times).
2804 dns_qpiter_next(it, NULL, NULL, NULL);
2805 return (step(search, it, FORWARD, next) &&
2806 dns_name_issubdomain(next, current));
2807 }
qpzone does not support cache semantics, so dns_db_addrdataset(),
_deleterdataset() and _subtractrdataset() can't be run with
version == NULL; there's no need to check for it.
we can also clean up free_qpdb() a bit since current_version
is always non-NULL.
because dns_qpmulti_commit() can be time consuming, it's inefficient
to open and commit a qpmulti transaction for each rdataset being loaded
into a database. we can improve load time by opening a qpmulti
transaction before adding a group of rdatasets and then committing it
afterward.
this commit adds 'setup' and 'commit' functions to dns_rdatacallbacks_t,
which can be called before and after the loops in which 'add' is
called in dns_master_load() and axfr_apply().
QP database node data is not reference counted the same way RBT nodes
were: in the RBT, node->references could be zero if the node was in the
tree but was not in use by any caller, whereas in the QP trie, the
database itself uses reference counting of nodes internally.
this caused some subtle errors. in RBTDB, when the newref() function is
called and the node reference count was zero, the node lock reference
counter would also be incremented. in the QP trie, this can never
happen - because as long as the node is in the database its reference
count cannot be zero - and so the node lock reference counter was never
incremented.
this has been addressed by maintaining a separate "erefs" counter for
external references to the node. this is the same approach used in the
"qpdb-lite" database in commit e91fbd8dea.
while troubleshooting this issue, some compile errors were discovered
when building with DNS_DB_NODETRACE; those have also been fixed.