This commit implements a batch update function for qpzone. The main
reason for this is speed: using addrdataset would cause a qp transaction
per rrdataset added, leading to a substantial slowdown compared to
RBTDB. The new API results in a qp transaction per applied diff.
This commit adds a layer of indirection to the apply_diff logic used by
IXFR and resigning by having the database updates go through a vtable.
We do this in three steps:
- We extend dns_rdatacallbacks_t vtable to allow subtraction and
resigning.
- We add a new set of api (begin|commit|abort)update to the dbmethods
vtable, that model an incremental update that can be aborted.
- We extract the core logic of diff_apply into a function that
satisfies the new interface.
- We make diff_apply use this new function, and log the results.
The intent of this commit is to allow databases to expose a batch
incremental update implementation, just like they expose a custom
batch creation implementation through (begin|end)load.
As we removed the ability to count nodes in the auxiliary trees (because
there are no auxiliary trees), we can also cleanup the API and
associated enum type (dns_dbtree_t).
remove the dns_db_locknode() and _unlocknode() calls, so that callers no
longer have the ability to directly manipulate the internal locking of
cache and zone databases.
if the dns_db_getoriginnode() call is not implemented, we can
fall back to running dns_db_findnode() on the database origin.
we now only implement getoriginnode directly in databases where
it's clearly faster than the fallback implementation would be.
the dns_db_findext and _findnodeext calls are extended versions
of dns_db_find and _findnode, which take additional arguments for
client information in order to support ECS. previously, database
implementations could support either API call, with cross-compatibility
so that, for example, dns_db_findext() could call a find implementation
if findext was not implemented, and dns_db_find() could call findext
if find was not implemented.
this has now been simplified. the find and findnodeext implementations
now support client info. all database implementations will now provide
these calls. implementations which do not support ECS will simply
ignore the clientinfo and clientinfomethods parameters.
this only affects the underlying implementation; callers will still
use the same interface. dns_db_find() and dns_db_findnode() are now
macros which pass NULL to the clientinfo parameters, so that callers
don't have to do so explicitly. dns_db_findext() and dns_db_findnodeext()
are still available for callers that do wish to pass clientinfo pointers.
this function's purpose was to populate the "CacheBuckets" statistic,
but there are no databases left that implemented it, so the return
value was always 0. "CacheBuckets" has now been removed from the
statistics, and the dns_db_hashsize() API call has been removed.
> Put a space before opening parentheses only after control statement
> keywords (for/if/while...) except this option doesn’t apply to ForEach
> and If macros. This is useful in projects where ForEach/If macros are
> treated as function calls instead of control statements.
Use dns_rdatatype_none instead of plain '0' for dns_rdatatype_t and
dns_typepair_t manipulation. While plain '0' is technically ok, it
doesn't carry the required semantic meaning, and using the named
dns_rdatatype_none constant makes the code more readable.
All databases in the codebase follow the same structure: a database is
an associative container from DNS names to nodes, and each node is an
associative container from RR types to RR data.
Each database implementation (qpzone, qpcache, sdlz, builtin, dyndb) has
its own corresponding node type (qpznode, qpcnode, etc). However, some
code needs to work with nodes generically regardless of their specific
type - for example, to acquire locks, manage references, or
register/unregister slabs from the heap.
Currently, these generic node operations are implemented as methods in
the database vtable, which creates problematic coupling between database
and node lifetimes. If a node outlives its parent database, the node
destructor will destroy all RR data, and each RR data destructor will
try to unregister from heaps by calling a virtual function from the
database vtable. Since the database was already freed, this causes a
crash.
This commit breaks the coupling by standardizing the layout of all
database nodes, adding a dedicated vtable for node operations, and
moving node-specific methods from the database vtable to the node
vtable.
the pattern `for (x = ISC_LIST_HEAD(...); x != NULL; ISC_LIST_NEXT(...)`
has been changed to `ISC_LIST_FOREACH` throughout BIND, except in a few
cases where the change would be excessively complex.
in most cases this was a straightforward change. in some places,
however, the list element variable was referenced after the loop
ended, and the code was refactored to avoid this necessity.
also, because `ISC_LIST_FOREACH` uses typeof(list.head) to declare
the list elements, compilation failures can occur if the list object
has a `const` qualifier. some `const` qualifiers have been removed
from function parameters to avoid this problem, and where that was not
possible, `UNCONST` was used.
Instead of relying on unreliable order of execution of the library
constructors and destructors, move them to individual binaries. The
advantage is that the execution time and order will remain constant and
will not depend on the dynamic load dependency solver.
This requires more work, but that was mitigated by a simple requirement,
any executable using libisc and libdns, must include <isc/lib.h> and
<dns/lib.h> respectively (in this particular order). In turn, these two
headers must not be included from within any library as they contain
inlined functions marked with constructor/destructor attributes.
DNS_LOGMODULE_RBTDB was simply inappropriate, and this
log message is actually dependent on db implementation
details, so DNS_LOGMODULE_DB would be the best choice.
The new log message is emitted when adding or updating an RRset
fails due to exceeding the max-records-per-type limit. The log includes
the owner name and type, corresponding zone name, and the limit value.
It will be emitted on loading a zone file, inbound zone transfer
(both AXFR and IXFR), handling a DDNS update, or updating a cache DB.
It's especially helpful in the case of zone transfer, since the
secondary side doesn't have direct access to the offending zone data.
It could also be used for max-types-per-name, but this change
doesn't implement it yet as it's much less likely to happen
in practice.
QPDB is now a default implementation for both cache and zone. Remove
the venerable RBTDB database implementation, so we can fast-track the
changes to the database without having to implement the design changes
to both QPDB and RBTDB and this allows us to be more aggressive when
refactoring the database design.
Remove the complicated mechanism that could be (in theory) used by
external libraries to register new categories and modules with
statically defined lists in <isc/log.h>. This is similar to what we
have done for <isc/result.h> result codes. All the libraries are now
internal to BIND 9, so we don't need to provide a mechanism to register
extra categories and modules.
When adding glue to the header, we add header to the wait-free stack to
be cleaned up later which sets wfc_node->next to non-NULL value. When
the actual cleaning happens we would only cleanup the .glue_list, but
since the database isn't locked for the time being, the headers could be
reused while cleaning the existing glue entries, which creates a data
race between database versions.
Revert the code back to use per-database-version hashtable where keys
are the node pointers. This allows each database version to have
independent glue cache table that doesn't affect nodes or headers that
could already "belong" to the future database version.
Previously, the number of RR types for a single owner name was limited
only by the maximum number of the types (64k). As the data structure
that holds the RR types for the database node is just a linked list, and
there are places where we just walk through the whole list (again and
again), adding a large number of RR types for a single owner named with
would slow down processing of such name (database node).
Add a configurable limit to cap the number of the RR types for a single
owner. This is enforced at the database (rbtdb, qpzone, qpcache) level
and configured with new max-types-per-name configuration option that
can be configured globally, per-view and per-zone.
Previously, the number of RRs in the RRSets were internally unlimited.
As the data structure that holds the RRs is just a linked list, and
there are places where we just walk through all of the RRs, adding an
RRSet with huge number of RRs inside would slow down processing of said
RRSets.
Add a configurable limit to cap the number of the RRs in a single RRSet.
This is enforced at the database (rbtdb, qpzone, qpcache) level and
configured with new max-records-per-type configuration option that can
be configured globally, per-view and per-zone.
the previous commit introduced a possible race in getsigningtime()
where the rdataset header could change between being found on the
heap and being bound.
getsigningtime() now looks at the first element of the heap, gathers the
locknum, locks the respective lock, and retrieves the header from the
heap again. If the locknum has changed, it will rinse and repeat.
Theoretically, this could spin forever, but practically, it almost never
will as the heap changes on the zone are very rare.
we simplify matters further by changing the dns_db_getsigningtime()
API call. instead of passing back a bound rdataset, we pass back the
information the caller actually needed: the resigning time, owner name
and type of the rdataset that was first on the heap.
the dyndb test requires a mechanism to retrieve the name associated
with a database node, and since the database no longer uses RBT for
its underlying storage, dns_rbt_fullnamefromnode() doesn't work.
addressed this by adding dns_db_nodefullname() to the database API.
replace the string "rbt" throughout BIND with "qp" so that
qpdb databases will be used by default instead of rbtdb.
rbtdb databases can still be used by specifying "database rbt;"
in a zone statement.
- Copy rbtdb.c, rbt-zonedb.c and rbt-cachedb.c to qp-*.
- Added qpmethods.
- Added a new structure dns_qpdata that will replace dns_rbtnode.
- Replaced normal, nsec, and nsec3 dns_rbt trees with dns_qp tries.
- Replaced dns_rbt_create() calls with dns_qp_create().
- Replaced the dns_rbt_destroy() call with dns_qp_destroy().
- Create a dns_qpdata struct and create/destroy methods.
This commit will not build.
- the DNS_DB_NSEC3ONLY and DNS_DB_NONSEC3 flags are mutually
exclusive; it never made sense to set both at the same time.
to enforce this, it is now a fatal error to do so. the
dbiterator implementation has been cleaned up to remove
code that treated the two as independent: if nonsec3 is
true, we can be certain nsec3only is false, and vice versa.
- previously, iterating a database backwards omitted
NSEC3 records even if DNS_DB_NONSEC3 had not been set. this
has been corrected.
- when an iterator reaches the origin node of the NSEC3 tree, we
need to skip over it and go to the next node in the sequence.
the NSEC3 origin node is there for housekeeping purposes and
never contains data.
- the dbiterator_test unit test has been expanded, several
incorrect expectations have been fixed. (for example, the
expected number of iterations has been reduced by one; we were
previously counting the NSEC3 origin node and we should not
have been doing so.)
when the QPDB is implemented, we will need to have both qpdb_p.h and
rbtdb_p.h. in order to prevent name collisions or code duplication,
this commit adds a generic private header file, db_p.h, containing
structures and macros that will be used by both databases.
some functions and structs have been renamed to more specifically refer
to the RBT database, in order to avoid namespace collision with similar
things that will be needed by the QPDB later.
The updatenotify mechanism in dns_db relied on unlocked ISC_LIST for
adding and removing the "listeners". The mechanism relied on the
exclusive mode - it should have been updated only during reconfiguration
of the server. This turned not to be true anymore in the dns_catz - the
updatenotify list could have been updated during offloaded work as the
offloaded threads are not subject to the exclusive mode.
Change the update_listeners to be cds_lfht (lock-free hash-table), and
slightly refactor how register and unregister the callbacks - the calls
are now idempotent (the register call already was and the return value
of the unregister function was mostly ignored by the callers).
ultimately we want the slab implementation of dns_rdataset to
be usable by more database implementaions than just rbtdb. this
commit moves rdataset_methods to rdataslab.c, renamed
dns_rdataslab_rdatasetmethods.
new database methods have been added: locknode, unlocknode,
addglue, expiredata, and deletedata, allowing external functions to
perform functions that previously required internal access to the
database implementation.
database and heap pointers are now stored in the dns_slabheader object
so that header is the only thing that needs to be passed to some
functions; this will simplify moving functions that process slabheaders
out of rbtdb.c so they can be used by other database implementations.
to reduce the amount of common code that will need to be shared
between the separated cache and zone database implementations,
clean up unused portions of dns_db.
the methods dns_db_dump(), dns_db_isdnssec(), dns_db_printnode(),
dns_db_resigned(), dns_db_expirenode() and dns_db_overmem() were
either never called or were only implemented as nonoperational stub
functions: they have now been removed.
dns_db_nodefullname() was only used in one place, which turned out
to be unnecessary, so it has also been removed.
dns_db_ispersistent() and dns_db_transfernode() are used, but only
the default implementation in db.c was ever actually called. since
they were never overridden by database methods, there's no need to
retain methods for them.
in rbtdb.c, beginload() and endload() methods are no longer defined for
the cache database, because that was never used (except in a few unit
tests which can easily be modified to use the zone implementation
instead). issecure() is also no longer defined for the cache database,
as the cache is always insecure and the default implementation of
dns_db_issecure() returns false.
for similar reasons, hashsize() is no longer defined for zone databases.
implementation functions that are shared between zone and cache are now
prepended with 'dns__rbtdb_' so they can become nonstatic.
serve_stale_ttl is now a common member of dns_db.
This implements node reference tracing that passes all the internal
layers from dns_db API (and friends) to increment_reference() and
decrement_reference().
It can be enabled by #defining DNS_DB_NODETRACE in <dns/trace.h> header.
The output then looks like this:
incr:node:check_address_records:rootns.c:409:0x7f67f5a55a40->references = 1
decr:node:check_address_records:rootns.c:449:0x7f67f5a55a40->references = 0
incr:nodelock:check_address_records:rootns.c:409:0x7f67f5a55a40:0x7f68304d7040->references = 1
decr:nodelock:check_address_records:rootns.c:449:0x7f67f5a55a40:0x7f68304d7040->references = 0
There's associated python script to find the missing detach located at:
https://gitlab.isc.org/isc-projects/bind9/-/snippets/1038
move database attach/detach functions to db.c, instead of
requiring them to be implemented for every database type.
instead, they must implement a 'destroy' function that is
called when references go to zero.
this enables us to use ISC_REFCOUNT_IMPL for databases,
with detailed tracing enabled by setting DNS_DB_TRACE to 1.
SDB is currently (and foreseeably) only used by the named
builtin databases, so it only needs as much of its API as
those databases use.
- removed three flags defined for the SDB API that were always
set the same by builtin databases.
- there were two different types of lookup functions defined for
SDB, using slightly different function signatures. since backward
compatibility is no longer a concern, we can eliminate the 'lookup'
entry point and rename 'lookup2' to 'lookup'.
- removed the 'allnodes' entry point and all database iterator
implementation code
- removed dns_sdb_putnamedrr() and dns_sdb_putnamedrdata() since
they were never used.
some dns_db functions would have crashed if the DB implementation failed
to implement them, requiring the implementations to add functions that
did nothing but return ISC_R_NOTIMPLEMENTED or some obvious default
value. we can just have the dns_db wrapper functions themselves return
those values, and clean up the implementations accordingly.
This changes the internal isc_rwlock implementation to:
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra
J. Marathe, and Nir Shavit. 2013. NUMA-aware reader-writer locks.
SIGPLAN Not. 48, 8 (August 2013), 157–166.
DOI:https://doi.org/10.1145/2517327.24425
(The full article available from:
http://mcg.cs.tau.ac.il/papers/ppopp2013-rwlocks.pdf)
The implementation is based on the The Writer-Preference Lock (C-RW-WP)
variant (see the 3.4 section of the paper for the rationale).
The implemented algorithm has been modified for simplicity and for usage
patterns in rbtdb.c.
The changes compared to the original algorithm:
* We haven't implemented the cohort locks because that would require a
knowledge of NUMA nodes, instead a simple atomic_bool is used as
synchronization point for writer lock.
* The per-thread reader counters are not being used - this would
require the internal thread id (isc_tid_v) to be always initialized,
even in the utilities; the change has a slight performance penalty,
so we might revisit this change in the future. However, this change
also saves a lot of memory, because cache-line aligned counters were
used, so on 32-core machine, the rwlock would be 4096+ bytes big.
* The readers use a writer_barrier that will raise after a while when
readers lock can't be acquired to prevent readers starvation.
* Separate ingress and egress readers counters queues to reduce both
inter and intra-thread contention.
The mechanism for associating a worker task to a database now
uses loops rather than tasks.
For this reason, the parameters to dns_cache_create() have been
updated to take a loop manager rather than a task manager.