Previously, rbtdb->task had quantum of 1 because it was originally used
just for freeing RBTDB contents, which can happen on a "best effort"
basis (does not need to be prioritized). However, when tree pruning was
implemented, it also started sending events to that task, enabling the
latter to become clogged up with a significant event backlog because it
only pruned a single RBTDB node per event.
To prioritize tree pruning (as it is necessary for enforcing the
configured memory use limit for the cache memory context), create a
second task with a virtually unlimited quantum (UINT_MAX) and send the
tree-pruning events to this new task, to ensure that all nodes scheduled
for pruning will be processed before further nodes are queued in a
similar fashion.
This change enables dropping the prunenodes list and restoring the
originally-used logic that allocates and sends a separate event for each
node to prune.
(cherry picked from commit 540a5b5a2c)
We were missing a test where a single owner name would have multiple
types with a different case. The generated RRSIGs and NSEC records will
then have different case than the signed records and message parser have
to cope with that and treat everything as the same owner.
(cherry picked from commit a114042059)
The "masterformat" system test attempts to check named-checkzone
behavior when it is fed corrupt map-format zone files. However, despite
the RBTDB and RBT structures having evolved over the years, the offsets
at which a valid map-format zone file is malformed by the "masterformat"
test have not been updated accordingly, causing the relevant checks to
introduce a different type of corruption than they were originally meant
to cause:
- the "bad node header" check originally mangled the 'type' member of
the rdatasetheader_t structure for cname.example.nil,
- the "bad node data" check originally mangled the 'serial' and
'rdh_ttl' members of the rdatasetheader_t structure for
aaaa.example.nil.
Update the offsets at which the map-format zone file is malformed at by
the "masterformat" system test so that the relevant checks fulfill their
original purpose again.
To prevent allocating large hashtable in dns_message, we need to
backport the improvements to isc_ht API from BIND 9.18+ that includes
support for case insensitive keys and incremental rehashing of the
hashtables.
In Net::DNS 1.42 $ns->main_loop no longer loops. Use current methods
for starting the server, wait for SIGTERM then cleanup child processes
using $ns->stop_server(), then remove the pid file.
(cherry picked from commit c2c59dea60)
When transitioning from NSEC3 to NSEC the added records where not
being signed because the wrong time was being used to determine if
a key should be used or not. Check that these records are actually
signed.
(cherry picked from commit bdb42d3838)
The check_loaded() function compares the zone's loadtime value and
an expected loadtime value, which is based on the zone file's mtime
extracted from the filesystem.
For the secondary zones there may be cases, when the zone file isn't
ready yet before the zone transfer is complete and the zone file is
dumped to the disk, so a so zero value mtime is retrieved.
In such cases wait one second and retry until timeout. Also modify
the affected check to allow a possible difference of the same amount
of seconds as the chosen timeout value.
(cherry picked from commit 4e94ff2541)
This enables the "logfileconfig" and "rpzextra" system tests to pass
when named is started under the supervision of rr (USE_RR=1).
(cherry picked from commit 422286e9c2)
when transferring in a non-inline-signing secondary for the first time,
we previously never set the value of zone->loadtime, so it remained
zero. this caused a test failure in the statschannel system test,
and that test case was temporarily disabled. the value is now set
correctly and the test case has been reinstated.
(cherry picked from commit 9643281453)
This covers both root hints and the default primaries for the root
zone mirror. The official change date is Nov 27, 2023.
(cherry picked from commit 2ca2f7e985)
Add a test case where serve-stale is enabled on a server that also
servers a local authoritative zone.
The particular case tests a lame delegation and checks if falling
back to serving stale data does not attempt to retrieve the query
by recursing from the root down.
(cherry picked from commit e196ba6168)
All changes in this commit were automated using the command:
shfmt -w -i 2 -ci -bn bin/tests/system/ util/ $(find bin/tests/system/ -name "*.sh.in")
By default, only *.sh and files without extension are checked, so
*.sh.in files have to be added additionally. (See mvdan/sh#944)
(manually replayed commit 4cb8b13987)
'rndc thaw' initiates asynchrous loading of all the zones
similar to 'rndc load'. Wait for the test zone's load to
complete before testing that it is updatable again.
(cherry picked from commit 5b3238aa85)
The covered type previously displayed as TYPE0 when it should
have reflected the records that was actually covered.
(cherry picked from commit 8ce359652a)
In some cases, BIND is not fast enough to fill the send buffer and
manages to answer all queries, contrary to what the test expects.
Repeat the check up to 3 times to limit this test instability.
(cherry picked from commit 681b23c398)
If the flaky plugin for pytest is available, use its decorator to
support re-running unstable tests. In case the package is missing,
execute the test as usual without attempts to re-run it in case of
failure.
This is mostly intended to increase the test stability in CI. Using a
custom decorator enables us to keep the flaky package as an optional
dependency.
(cherry picked from commit 5b703de733)
The mkeys system test could fail because root zone was resigned
within the same second as it was previously signed causing reloads
to fail. Add delays to the test to prevent this.
(cherry picked from commit 40e3529379)
Errors getting transfer statistics from named.run where not detected
as ret was not set to one if there hadn't been a success after looping
for a while.
(cherry picked from commit 287a1ac09b)
The setup.pl script has been replaced with static BIND configurations,
and in the course of this change, the unused ns1 server was removed.
This enhancement has greatly improved the overall test's readability.
(cherry picked from commit 08a8906cfc)
The shell version of the test was completed only after all DNS zone
updates were sent, even if the BIND server crashed while processing
them, leading to prolonged execution and potential hang in the CI
environment. The Python rewrite of the test ensures that DNS update
tasks finish within five minutes of starting, irrespective of a BIND
crash possibility or DNS zone updates not finishing in time.
(cherry picked from commit ecd7b30d0a)
The fstrm_capture utility is started in the background during the
"dnstap" system test. Consequently, "rndc dnstap-reopen" and similar
commands may be executed before fstrm_capture starts listening on the
Unix domain socket it is configured to receive dnstap data on. This
results in the dnstap data sent to that socket in the meantime to be
lost; while the fstrm writer thread is able to recover from such a
scenario within a couple of seconds (by reopening the configured dnstap
destination itself), only one write attempt is made for data
successfully queued to the writer thread, so dnstap frames can still be
lost in the process. This may happen during the "dnstap" system test,
leading to the dnstap output file being empty, which in turn causes the
test to fail.
Fix by waiting until fstrm_capture starts listening on the Unix domain
socket it is configured to use before asking named to reopen the
configured dnstap destination. Since various fstrm_capture versions log
different messages when the listening socket is set up, wait for a
common string that works for all fstrm_capture versions released to
date. Add a few extra debug messages indicating test progress and make
the test fail if the expected fstrm_capture log message is not generated
within 10 seconds.
(cherry picked from commit 26d3d97f12)
The fstrm_capture.out file is overwritten when the fstrm_capture utility
is restarted during the "dnstap" system test. Use a separate output
file for each fstrm_capture instance to ensure all output produced by
that tool during the "dnstap" system test is preserved for forensic
purposes.
(cherry picked from commit bd2941fc72)
'HOME=value command' should only change HOME for command but on
some platforms this occasionally sets HOME for the rest of the
test. Explicitly isolate the enviroment change using a sub shell.
(cherry picked from commit 96f75bba18)
The resolve binary is affected by GL#4119 which occassionally makes it
hand during system tests when running with TSAN. This is a workaround to
avoid wasting resources caused by a CI timeout for the system test tsan
jobs.
(cherry picked from commit 774b9bc629)
The conditions that trigger the crash:
- a stale record is in cache
- stale-answer-client-timeout is 0
- multiple clients query for the stale record, enough of them to exceed
the recursive-clients quota
- the response from the authoritative is sufficiently delayed so that
recursive-clients quota is exceeded first
The reproducer attempts to simulate this situation. However, it hasn't
proven to be 100 % reproducible, especially in CI. When reproducing
locally, the priming query also seems to sometimes interfere and prevent
the crash. When the reproducer is ran twice, it appears to be more
reliable in reproducing the issue.
(cherry picked from commit f617512d37)
In pytest 7.4.0, there were some changes to how the configuration file
for pytest is located. In our case, this resulted in a failure to find
the conftest.py with the needed fixtures which then prevented our python
tests from being executed successfully.
Configure the --confcutdir to ensure it points to the system test
directory, where our conftest.py is located.
Related https://github.com/pytest-dev/pytest/pull/11043
Because of a typo, the fetch.pl script tries to extract the server
address from the input parameter 'a' instead of 's'. Fix the typo.
(cherry picked from commit aa7538fd38)
Add this test scenario for a bug fixed a while ago. When a third key is
introduced while the previous rollover hasn't finished yet, the keymgr
could decide to remove the first two keys, because it was not checking
for an indirect dependency on the keys.
In other words, the previous bug behavior was that the first two keys
were removed from the zone too soon.
This test case checks that all three keys stay in the zone, and no keys
are removed premature after another new key has been introduced.
(cherry picked from commit 9c40cf0566)
In the kasp script, if one expected key is not found, continue checking
the other key ids, even if there is no match for the first one. This
provides a bit more information which keys mismatch and makes for
easier debugging test failures.
(cherry picked from commit 674249f66a)
Make the cds/setup.sh compatible with the workaround which relies on
testing the TSAN_OPTIONS variable which may not be set.
(cherry picked from commit 76d9873ef6)
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
(manually picked from commit 06056c44a7)
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
(cherry picked from commit fbcf37f914)
Previously, the first check silently failed, as 450 is apparently (in
the CI) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
(manually picked from commit 5f809e50b6)
On main, the minimum file size seems to 454 bytes, while on some
platforms in our CI setup for the 9.16 branch, it appears to be 450
instead.
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
(cherry picked from commit 1dd4c2b9e2)
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
(cherry picked from commit dd7bcd2855)
The $t1 value equals $t2 due to the time elapsed between "rndc
managed-keys status" calls being equal to the normal active refresh
period (as calculated per rules listed in RFC 5011 section 2.3) minus an
"hour" (as set using -T mkeytimers). This value equality is expected to
happen on really slow machines. On our Windows CI runner, it happens
very often.