With the resolver now legitimately escalating to TCP after repeated
UDP timeouts to the same authoritative, each lame-server lookup
takes ~50% longer to fail. The recursive-client backlog therefore
peaks a little higher before the fetches-per-server auto-tune drops
the quota below 200.
Bump the upper bound for the burst-against-lame-server and recovery
steps from 200 to 250 to absorb that extra latency. The lower bound
and the final post-recovery target (clients <= 20) are unchanged.
Assisted-by: Claude:claude-opus-4-7
The removal has been done with the following command:
find bin/tests/system/ -type f -name "*.db*" -exec sed -i '1,10d; 11{/^$/d}' {} +
The following files have been handled manually, since they already
didn't have the license info, or had it in a slightly different format:
bin/tests/system/ssutoctou/ns1/example.db.in
bin/tests/system/checkzone/zones/crashzone.db
bin/tests/system/checkzone/zones/warn.deprecated.cds-sha1.db
bin/tests/system/checkzone/zones/warn.deprecated.digest-sha1.db
bin/tests/system/checkzone/zones/warn.deprecated.ds-alg.db
bin/tests/system/legacy/ns6/edns512.db.signed
The removal was done with the following commands:
find bin/tests/system/ -type f -name "*.conf" -exec sed -i '1,12d; 13{/^$/d}' {} +
find bin/tests/system/ -type f -name "*.conf.*" -exec sed -i '1,12d; 13{/^$/d}' {} +
When a referral lookup is triggered by a QMIN query, it should be
exempt from the fetches-per-zone limit just as the QMIN query itself
is.
Also restart the test server between the fetches-per-server and
fetches-per-zone tests so that leftover statistics from the former
do not pollute the latter.
Another fix is because zone spills and general query drops are no longer
in a strict >= relation (on a parent-centric resolver), so check that
both counters are non-zero instead.
Previously, the server relied on the modules being imported by the
isctest.asyncserver module. This is fragile and confuses tooling.
Clean up stray imports in the process.
When this class was introduced, the constructor of its base class had no
parameters. This was changed in the meantime and these parameters were
not accessible by users of the subclass.
Don't override the constructor.
Move command setup to methods.
Move subclass-specific storage to cached properties.
Take instances of Command instead of the classes themselves for
symmetry with install_response_handler.
The following tests use multiple named configs. Previously, these have
been rendered with copy_setports in tests.sh when needed. Transform
these into jinja2 templates and render them during setup. In the tests,
the copy_setports invocations can be then replaced with a simple cp.
Maintaining compatibility with pre-2.0.0 dnspython became cumbersome
leading to failure in nightly CI jobs which are the only ones that run
with dnspython this old.
Abort all AsyncServer instances when running with old dnspython. Add an
importor skip for all system tests using isctest.asyncserver.
It's possible to use pytest.mark.flaky, which achieves the exact same
thing as our custom-defined isctest.mark.flaky -- attempts to rerun the
test on failure, but only is flaky package is available.
The fetchlimit test has failed 8 times in the nightly CI over the past
three weeks. That makes the overall failure rate somewhere around 1 %,
which isn't a lot, but is still annoying when lots of testing is going
on.
There are many system tests where we set `dnssec-validation yes;` only
to also set `trust-anchors { };` which effectively disables the
validation.
This commit replaces this convoluted setup with just
`dnssec-validation no;`.
The ditch.pl script is used to generate burst traffic without waiting
for the responses. When running other tests in parallel, this can result
in a ephemeral port clash, since the ditch.pl process closes the socket
immediately. In rare occasions when the message ID also clashes with
other tests' queries, it might result in an UnexpectedSource error from
dnspython.
Use a dedicated port EXTRAPORT8 which is reserved for each test as a
source port for the burst traffic.
Explicitly use an empty 'trust-anchors' statement in the system
tests where it was used implicitly before.
In resolver/ns5/named.conf.in use the trust anchor in 'trusted.conf',
which was supposed to be used there.
The lock-file configuration (both from configuration file and -X
argument to named) has better alternatives nowadays. Modern process
supervisor should be used to ensure that a single named process is
running on a given configuration.
Alternatively, it's possible to wrap the named with flock(1).
All changes in this commit were automated using the command:
shfmt -w -i 2 -ci -bn . $(find . -name "*.sh.in")
By default, only *.sh and files without extension are checked, so
*.sh.in files have to be added additionally. (See mvdan/sh#944)
The old name "common" clashes with the convention of system test
directory naming. It appears as a system test directory, but it only
contains helper files.
To reduce confusion and to allow automatic detection of issues with
possibly missing test files, rename the helper directory to "_common".
The leading underscore indicates the directory is different and the its
name can no longer be confused with regular system test directories.
The changes were mostly done with sed:
find . -name '*.sh' | xargs sed -i 's/`\([^`]*\)`/$(\1)/g'
There have been a few manual changes where the regex wasn't sufficient
(e.g. backslashes inside the `...`) or wrong (`...` referring to docs or
in comments).
Ensure all shell system tests are executed with the errexit option set.
This prevents unchecked return codes from commands in the test from
interfering with the tests, since any failures need to be handled
explicitly.
Prepare the fetchlimit system test for adding a clients-per-query
check. Change some functions and commands to accept a destination
NS IP address instead of using the hardcoded 10.53.0.3.
In order to run the shell system tests, the pytest runner has to pick
them up somehow. Adding an extra python file with a single function
for the shell tests for each system test proved to be the most
compatible way of running the shell tests across older pytest/xdist
versions.
Modify the legacy run.sh script to ignore these pytest-runner specific
glue files when executing tests written in pytest.
The test was setting a minimum count for recursive clients which
was not always being met (e.g. 91 instead of 100) producing a false
positive. Lower the lower bound on recursive clients for this
test to 1.
"rndc fetchlimit" now also prints a list of domain names that are
currently rate-limited by "fetches-per-zone".
The "fetchlimit" system test has been updated to use this feature
to check that domain limits are applied correctly.
this command runs dns_adb_dumpquota() to display all servers
in the ADB that are being actively fetchlimited by the
fetches-per-server controls (i.e, servers with a nonzero average
timeout ratio or with the quota having been reduced from the
default value).
the "fetchlimit" system test has been updated to use the
new command to check quota values instead of "rndc dumpdb".
Check that the recursing client count is above a reasonable
minimum, as well as below a maximum, so that we can detect
bugs that cause recursion to fail too early or too often.
The fetchlimit test depends on a resolver continuing to try UDP
and timing out while the client waits for resolution to succeed.
but since commit bb990030 (flag day 2020), a fetch will always
switch to TCP after two timeouts, unless EDNS was disabled for
the query.
This commit adds "edns no;" to server statements in the fetchlimit
resolver, to restore the behavior expected by the test.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
when "checking lame server clients are dropped below the hard limit",
periodically a query is sent for a name for which the server is
authoritative, to verify that legitimate queries can still be
processed while the server is dealing with a flood of lame delegation
queries. those queries used the same dig options as elsewhere in the
fetchlimit test, including "+tries=1 +timeout=1". on slow systems, a
1-second timeout may be insufficient to get an answer even if the server
is behaving well. this commit increases the timeout for the check
queries to 2 seconds in hopes that will be enough to eliminate test
failures in CI.
The ISC_MEM_DEBUGSIZE and ISC_MEM_DEBUGCTX did sanity checks on matching
size and memory context on the memory returned to the allocator. Those
will no longer needed when most of the allocator will be replaced with
jemalloc.
The system test framework starts all named instances with the "-d 99"
command line option (unless it is overridden by a named.args file in a
given instance's working directory). This causes a lot of log messages
to be written to named.run files - currently over 5 million lines for a
single test suite run. While debugging information preserved in the log
files is essential for troubleshooting intermittent test failures, some
system tests involve sending hundreds or even thousands of queries,
which causes the relevant log files to explode in size. When multiple
tests (or even multiple test suites) are run in parallel, excessive
logging contributes considerably to the I/O load on the test host,
increasing the odds of intermittent test failures getting triggered.
Decrease the debug level for the seven most verbose named instances:
- use "-d 3" for ns2 in the "cacheclean" system test (it is the lowest
logging level at which the test still passes without the need to
apply any changes to tests.sh),
- use "-d 1" for the other six named instances.
This roughly halves the number of lines logged by each test suite run
while still leaving enough information in the logs to allow at least
basic troubleshooting in case of test failures.
This approach was chosen as it results in a greater decrease in the
number of lines logged than running all named instances with "-d 3",
without causing any test failures.
The $SYSTEMTESTTOP shell variable if often set to .. in various shell
scripts inside bin/tests/system/, but most of the time it is only
used one line later, while sourcing conf.sh. This hardly improves
code readability.
$SYSTEMTESTTOP is also used for the purpose of referencing
scripts/files living in bin/tests/system/, but given that the
variable is always set to a short, relative path, we can drop it and
replace all of its occurrences with the relative path without adversely
affecting code readability.
this changes most visble uses of master/slave terminology in tests.sh
and most uses of 'type master' or 'type slave' in named.conf files.
files in the checkconf test were not updated in order to confirm that
the old syntax still works. rpzrecurse was also left mostly unchanged
to avoid interference with DNSRPS.