Change the way arithmetic operations are performed in system test shell
scripts from using `expr` to `$(())`. This ensures that updating the
variable won't end up with a non-zero exit code, which would case the
script to exit prematurely when `set -e` is in effect.
The following replacements were performed using sed in all text files
(git grep -Il '' | xargs sed -i):
s/status=`expr $status + $ret`/status=$((status + ret))/g
s/n=`expr $n + 1`/n=$((n + 1))/g
s/t=`expr $t + 1`/t=$((t + 1))/g
s/status=`expr $status + 1`/status=$((status + 1))/g
s/try=`expr $try + 1`/try=$((try + 1))/g
Ensure all shell system tests are executed with the errexit option set.
This prevents unchecked return codes from commands in the test from
interfering with the tests, since any failures need to be handled
explicitly.
Add this test scenario for a bug fixed a while ago. When a third key is
introduced while the previous rollover hasn't finished yet, the keymgr
could decide to remove the first two keys, because it was not checking
for an indirect dependency on the keys.
In other words, the previous bug behavior was that the first two keys
were removed from the zone too soon.
This test case checks that all three keys stay in the zone, and no keys
are removed premature after another new key has been introduced.
In the kasp script, if one expected key is not found, continue checking
the other key ids, even if there is no match for the first one. This
provides a bit more information which keys mismatch and makes for
easier debugging test failures.
Pass 5 second timeout to the rndc status command(s) to avoid hitting the
hard 10 second timeout from subprocess.call, which would result in an
unwanted exception that would only mask the real issue: if the rndc
status times out in this test, it is likely due to the server not
stopping as it should.
The shutdown test attempts to shut down the server using two different
methods - rndc and sigterm. Use pytest.mark.parametrize to run these as
separate test cases for easier identification of failures.
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
tsig-keygen is now used to generate key files for TSIG. These have
a different format to those that were generated by dnssec-keygen.
Test that dig can still read these files.
tsig-keygen generates key files that are different to those that
where generated by dnssec-keygen. Check that nsupdate can still
read those old format files.
The ability to read legacy HMAC-MD5 K* keyfile pairs using algorithm
number 157 was accidentally lost when the algorithm numbers were
consolidated into a single block, in commit
09f7e0607a.
The assumption was that these algorithm numbers were only known
internally, but they were also used in key files. But since HMAC-MD5
got renumbered from 157 to 160, legacy HMAC-MD5 key files no longer
work.
Move HMAC-MD5 back to 157 and GSSAPI back to 160. Add exception for
GSSAPI to list_hmac_algorithms.
the default value of dnssec-validation is 'auto', which causes
a server to send a key refresh query to the root zone when starting
up. this is undesirable behavior in system tests, so this commit
sets dnssec-validation to either 'yes' or 'no' in all tests where
it had not previously been set.
this change had the mostly-harmless side effect of changing the cached
trust level of unvalidated answer data from 'answer' to 'authanswer',
which caused a few test cases in which dumped cache data was examined in
the serve-stale system test to fail. those test cases have now been
updated to expect 'authanswer'.
Previously, the first check silently failed, as 454 is apparently (in my
local setup) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
In HTTP/1.0 and HTTP/1.1, RFC 9112 section 9.6 says the last response
in a connection should include a `Connection: close` header, but the
statschannel server omitted it.
In an HTTP/1.0 response, the statschannel server can sometimes send a
`Connection: keep-alive` header when it is about to close the
connection. There are two ways:
If the first request on a connection is keep-alive and the second
request is not, then _both_ responses have `Connection: keep-alive`
but the connection is (correctly) closed after the second response.
If a single request contains
Connection: close
Connection: keep-alive
then RFC 9112 section 9.3 says the keep-alive header is ignored, but
the statschannel sends a spurious keep-alive in its response, though
it correctly closes the connection.
To fix these bugs, make it more clear that the `httpd->flags` are part
of the per-request-response state. The Connection: flags are now
described in terms of the effect they have instead of what causes them
to be set.
The "dns_dnssec_findzonekeys2" log message is a leftover from when that
was the name of the function. Rename to match the current name of the
function.
When we add DNSKEY records via dynamic update, this should no longer
trigger signing the zone with these keys. This currently happens when
'find_zone_keys()' looks up the keys by inspecting the DNSKEY RRset,
then attempting to read the corresponding key files.
Add checks that inspect the logs whether an attempt to read the key
files for the newly added keys was done (and failed because these files
are not available).
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
Prepare the fetchlimit system test for adding a clients-per-query
check. Change some functions and commands to accept a destination
NS IP address instead of using the hardcoded 10.53.0.3.
The get_core_dumps.sh script couldn't find and process core files of
out-of-tree configurations because it looked for them in the source
instead of the build directory.
Add a test case where when priming the cache with a slow authoritative
resolver, the stale-answer-client-timeout option should not return
a delegation to the client (it should wait until an applicable answer
is found, if no entry is found in the cache).
The check for 'would limit' log message is triggered by sending at least
three messages within one second. However, in extremely slow conditions
(currently when running with clang+TSAN in CI), the individual queries
might take too much time to send enough of them within one second.
Since this is a pretty rare condition, let's just silently skip this
test in environments where a single query takes more than 500 ms, since
there's no way to perform the check under such conditions.
Closes#4082
The 'update-nsec3.example' requires to be DNSSEC maintained via
dynamic update. Commit 03b22983cd20cec51ad8b9f25f2e7d0e472dc79c adds
checks to make sure the raw zone is not signed. So the test case neesd
to be updated to allow for DNSSEC maintenance.
A zone in multisigner model 2 should also be possible to remove
previously added DNSKEY, CDS and CDNSKEY records from the zone operated
by the other provider.
Add a test case where updates are being made against a hidden primary
and two bump in the wire signers (the providers in the multisigner
model) serve the zone.
The test covers the same cases as for two primary providers that is:
- Add DNSKEY
- Remove (previously added) DNSKEY
- Add CDNSKEY
- Remove (previously added) CDNSKEY
- Add CDS
- Remove (previously added) CDS
A zone in multisigner model 2 should also be possible to publish the
CDS and CDNSKEY records from their KSK into the zone operated by the
other provider.
Add a new system test to test multisigner model use cases. This
initial test just tests a small part of the model 2, and uses two
providers for the same zone, ns3 and ns4, each with their own unique
key set. This commit tests that each provider can import their ZSK
of the other provider into their DNSKEY RRset, using dynamic update.
Both providers use dnssec-policy, ns3 applies the DNSSEC records
directly, while ns4 uses inline-signing.
The check which attempts to forward dynamic update to a dead primary may
trigger a timing issue #4080. For some reason, this has manifested under
the pytest runner, while the test still passes with the legacy runner.
Move the dead primary check closer to the end of the test to avoid
hitting this issue before we have a proper fix.
The module-level logger has a handler that writes into a temporary
directory. Ensure the logging output is flushed and the handler is
closed before attempting to remove this temporary directory.
Previously, run.sh tried to use pytest's -k option for test selection.
The downside was that this filter expression matched any test case with
the given substring, rather than executing a system test suite with the
given name.
The run.sh has been rewritten to invoke pytest from a system test
directory instead. This behaves more consistently with the run.sh from
legacy system test framework.
run.sh is now also a shell script to avoid confusion regarding its
file extension.
EL8+ systems declare "which" function using environment variables in the
/etc/profile.d/which2.sh file. Because of our suboptimal environment
variable detection, which is required in order to support the legacy
runner, these variables are picked up by the pytest runner.
If subprocesses are spawned with these environment variables set, it
will cause the following issue when they spawn yet another subprocess:
/bin/sh: which: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `which'
Instantiate a new logger that is used during pytest initialization /
configuration. This logging isn't handled by pytest itself, since it
happens outside of any tests or fixtures.
Root logger can't be reused for this purpose, because that would
duplicate the logs. Instead, create a conftest-specific logger for this
purpose.
Unfortunately, this introduces another log file,
pytest.conftest.log.txt, which contains only the logging from pytest
initialization. However, unless one is debugging the runner /
environment, there should be no need to investigate this file.
In order to take the most advantage of parallel execution of tests,
ensure certain long running tests are scheduled first.
The list of tests considered long-running was created empirically. In
addition to the test run time, its position in the default
(alphabetical) ordering was also taken into account.
The logger fixture is provided as a test-level logging facility which
can be easily passed to tests to enable capturing and/or displaying
messages from tests written in Python.
While this works optimally with the pytest runner, messages on INFO
level or above will also be visible when using the legacy runner.
The test_zone_timers_secondary_json() and
test_zone_timers_secondary_xml() tests are affected by issue #3983. Due
to the way tests are run, they are only affected when executing them
with the pytest runner.
Strict mode is set for pytest runner, as it always fails there. The
strict mode ensures we'll catch the change when the it starts passing
once the underlying issue is fixed. It can't be set for the legacy
runner, since the test (incorrectly) passes there.
Related #3983
If a test fails with an assertion failure or exception, its content
along with traceback is displayed in pytest output. This information
should be preserved in the test-specific logger for a given system test
to make it easier to debug test failures.
In order to avoid issues with decoding/encoding env variables due to
different encodings on different systems, deal with the environment
variables directly as bytes.
The loadscope setting is required for parallel execution of our system
tests using pytest. The option ensure that all tests within a single
(module) scope will be assigned to the same worker.
This is neccessary because the worker sets up the nameservers for all
the tests within a module scope. If tests from the same module would be
assigned to different workers, then the setup could happen multiple
times, causing a race condition. This happens because each module uses
deterministic port numbers for the nameservers.
Utilize developers' muscle memory to incentivize using the pytest runner
instead of the legacy one. The script also serves as basic examples of
how to run the pyest command to achieve the same results as the legacy
runner.
Invoking pytest directly should be the end goal, since it offers many
potentially useful options (refer to pytest --help).
In order to run the shell system tests, the pytest runner has to pick
them up somehow. Adding an extra python file with a single function
for the shell tests for each system test proved to be the most
compatible way of running the shell tests across older pytest/xdist
versions.
Modify the legacy run.sh script to ignore these pytest-runner specific
glue files when executing tests written in pytest.
Special care needs to be taken to support older pytest / xdist versions.
The target versions are what is available in EL8, since that seems to
have the oldest versions that can be reasonably supported.
When an issue occurs inside a fixture (e.g. servers fail to start/stop),
the test result won't be detected as failed, but rather an error will be
thrown.
To ensure the tempdir is kept even if the test itself passes but the
system_test() fixture throws an error, a different mechanism is needed.
At the start of the critical test setup section, note that the fixture
hasn't finished yet. When this is detected in the system_test_dir()
fixture, it is recognized as error in test setup/teardown and the temp
directory is kept.
This may seem cumbersome, because it is. It's basically a workaround for
the way pytest handles fixtures and test errors in general.
The temporary directory contains artifacts for the pytest module. That
module may contain multiple individual tests which were executed
sequentially. The artifacts should be kept if even one of these tests
failed.
Since pytest doesn't have any facility to expose test results to
fixtures, customize the pytest_runtest_makereport() hook to enable that.
It stores the test results into a session scope variable which is
available in all fixtures.
When deciding whether to remove the temporary directory, find the
relevant test results for this module and don't remove the tmpdir if any
one the tests failed.
For every pytest module, create a copy of its system test directory and
run the test from that directory. This makes it easier to clean up
afterwards and makes it less error-prone when re-running (failed) tests.
Configure the logger to capture the module's log output into a separate
file available from the temporary directory. This is quite convenient
for exploring failures.
Cases where temporary directory should be kept are handled in a
follow-up commits.
This is basically the pytest re-implementation of the run.sh script.
The fixture is applied to every module and ensures complete test
setup/teardown, such as starting/stopping servers, detecting coredumps
etc.
Note that the fixture system_test_dir is not defined yet. It is omitted
now for review readability and it's added in follow-up commits.
Add fixtures for deriving system test name from the directory name and
for a module-specific logger.
Add fixtures to execute shell and perl scripts. Similar to how run.sh
calls commands, this functionality is also needed in pytest. The
fixtures take care of switching to a proper directory, logging
everything and handling errors.
Note that the fixture system_test_dir is not defined yet. It is omitted
now for review readability and it's added in follow-up commits.
This is basically a pytest re-implementation of the get_ports.sh script.
The main difference is that ports are assigned on a module basis, rather
than a directory basis. Module is the new atomic unit for parallel
execution, therefore it needs to have unique ports to avoid collisions.
Each module gets its ports through the env fixture which is updated with
ports and other module-specific variables.
Some system tests require extra programs and/or dependencies to be
compiled first. This is done via `make check` with the automake
framework when using check_* variables such as check_PROGRAMS.
To avoid running any tests via the automake framework, set the TESTS
env variable to empty string and utilize `make -e check` to override
default Makefile variables with environment ones. This ensures automake
will only compile the needed dependencies without running any tests.
Additional consideration needs to be taken for xdist. The compilation
command should be called just once before any tests are executed. To
achieve that, use the pytest_configure() hook and check that the
PYTEST_XDIST_WORKER env variable isn't set -- if it is, it indicates
we're in the spawned xdist worker and the compilation was already done
by the main pytest process that spawned the workers.
This is mostly done to have on-par functionality with legacy test
framework. In the future, we should get rid of the need to run "empty"
make -e check and perhaps compile test-stuff by default.
The commands executed by pytest during a system test need to have the
same environment variables set as if they were executed by the run.sh
shell script.
It was decided that for the moment, legacy way of executing system tests
with run.sh should be kept, which complicates things a bit. In order to
avoid duplicating the required variables in both conf.sh and pytest, it
was decided to use the existing conf.sh as the only authoritative
place for the variables.
It is necessary to process the environment variables from conf.sh right
when conftest.py is loaded, since they might be needed right away (e.g.
to test for feature support during test collection).
This solution is a bit hacky and is only meant to be used during the
transitory phase when both pytest and the legacy run.sh are both
supported. In the future, a superior pytest-only solution should be
used.
For discussion of other options, refer to
https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/6809#note_318889
Ensure pytest picks up our python test modules, since we're using
tests_*.py convention, which is different from the default.
Configure pytest logging to display the output from all tests (even the
ones that passed). This ensures we have sufficient amount of information
to debug test post-mortem just from the artifacts.
The legacy system test framework uses pytest to execute some tests.
Since it'd be quite difficult to convince pytest to decide whether to
include conftest.py (or which ones to include when launching from
subdir), it makes more sense to have a shared conftest.py which is used
by both the legacy test runner invocations of pytest and the new pytest
system test runner. It is ugly, but once we drop support for the legacy
runner, we'll get rid of it.
Properly scope the *port fixtures in order to ensure they'll work as
expected with the new pytest runner. Instead of using "session" (which
means the fixture is only evaluated once for the entire execution of
pytest), use "module" scope, which is evaluated separately for each
module. The legacy runner invoked pytest for each system test
separately, while the new pytest runner is invoked once for all system
tests -- therefore it requires the more fine-grained "module" scope to
for the fixtures to work properly.
Remove python shebang, as conftest.py isn't supposed to be an executable
script.
The line summarising TSAN reports was misplaced in the ASAN territory
and thus never used.
I also made core dumps, assertion failures, and TSAN reports detection
independent of each other.
When the FIPS provider is available, RSASHA1 signing keys for zone
"example.com." are ignored if the zone is attempted to be signed with
the dnssec-signzone "-F" (FIPS mode) option:
"fatal: No signing keys specified or found"
The upforwd test for forwarding updates to a dead primary can continue
running a little bit past its end, causing update replies to be
recorded during a subsequent test case. Correct this by only looking
for update requests and replies for the specific domain name being
tested at any given time.
After the RCU changes were merged, the `upforwd` test started
consistenly failing when run under thread sanitizer. After some
investigation, it turned out that retry attempts were continuing after
the "update forwarding to dead primary" test. This caused mismatches
in the DNSTAP message counts for the subsequent tests, because they
were also counting retries.
Fix this problem by `wait`ing for the `nsupdate` processes to exit.
While investigating the bug, I replaced several fixed 15 second delays
with `wait_for_log`, so the test runs faster.
All the places the qp-trie code was using `call_rcu()` needed
`__tsan_release()` and `__tsan_acquire()` annotations, so
add a couple of wrappers to encapsulate this pattern.
With these wrappers, the tests run almost clean under thread
sanitizer. The remaining problems are due to `rcu_barrier()`
which can be suppressed using `.tsan-suppress`. It does not
suppress the whole of `liburcu`, because we would like thread
sanitizer to detect problems in `call_rcu()` callbacks, which
are called from `liburcu`.
The CI jobs have been updated to use `.tsan-suppress` by
default, except for a special-case job that needs the
additional suppressions in `.tsan-suppress-extra`.
We might be able to get rid of some of this after liburcu gains
support for thread sanitizer.
Note: the `rcu_barrier()` suppression is not entirely effective:
tsan sometimes reports races that originate inside `rcu_barrier()`
but tsan has discarded the stack so it does not have the
information required to suppress the report. These "races" can
be made much easier to reproduce by adding `atexit_sleep_ms=1000`
to `TSAN_OPTIONS`. The problem with tsan's short memory can be
addressed by increasing `history_size`: when it is large enough
(6 or 7) the `rcu_barrier()` stack usually survives long enough
for suppression to work.
Previously, if an exception would happen inside the `with` block, the
error handler would wait indefinitely for the process to end. That would
never happen, since the termination signal was never sent to named and
the test would get stuck.
Using the try-finally block ensures that the named process is always
killed and any exception or errors will be handled gracefully.
Improve code readability by splitting the test into more functions. Some
could be re-used later on for more general-purpose subprocess handling
or named checks.
Add more tests to the dnstap system test to roll with different values.
Touch some files to make sure the number of existing files exceed the
number that we want to keep.
Add a test to the logfileconfig system test for the increment suffix.
When dns_request_create() failed in notify_send_toaddr(), sending the
notify would silently fail. When notify_done() failed, the error would
be logged on the DEBUG(2) level.
This commit remedies the situation by:
* Promoting several messages related to notifies to INFO level and add
a "success" log message at the INFO level
* Adding a TCP fallback - when sending the notify over UDP fails, named
will retry sending notify over TCP and log the information on the
NOTICE level
* When sending the notify over TCP fails, it will be logged on the
WARNING level
Closes: #4001, #4002
There is no 'ret' in this test, and it is obvious that 'ret=1'
should be 'tmp=1' for the check to work correctly, if the string
is not found in the log file.
Add a test case to cover #3679 where a user migrates from a KSK/ZSK
split using auto-dnssec maintain, to the default dnssec-policy (CSK).
The test actually does not use the default dnssec-policy, but it does
use one that has the same keys clause. For testing convenience, we use
the same propagation time values as other test cases that migrate to
dnssec-policy with mismatching existing key set.
At the time of test number (19), there were 10 "sending packet to
10.53.0.7" lines in the "legacy/ns1/named.run" file; usually, only seven
are present:
I:legacy:checking recursive lookup to edns 512 + no tcp server does not cause query loops (19)
I:legacy:ns1 sent 10 queries to ns7, expected less than 10
I:legacy:failed
Those three can be attributed to tests "8", "10", and "18", where the
dig of "resolution_fails()" retried after a timeout to succeed with
"status: SERVFAIL" subsequently, as seen in each of
dig.out.test{8,10,18} files.
;; communications error to 10.53.0.1#13093: timed out
; <<>> DiG 9.19.12-dev <<>> -p 13093 +tcp @10.53.0.1 edns512-notcp. TXT
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 5368
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
This retry is unnecessary because "resolution_fails()" considers timeout
a positive result.
This change makes the zone table lock-free for reads. Previously, the
zone table used a red-black tree, which is not thread safe, so the hot
read path acquired both the per-view mutex and the per-zonetable
rwlock. (The double locking was to fix to cleanup races on shutdown.)
One visible difference is that zones are not necessarily shut down
promptly: it depends on when the qp-trie garbage collector cleans up
the zone table. The `catz` system test checks several times that zones
have been deleted; the test now checks for zones to be removed from
the server configuration, instead of being fully shut down. The catz
test does not churn through enough zones to trigger a gc, so the zones
are not fully detached until the server exits.
After this change, it is still possible to improve the way we handle
changes to the zone table, for instance, batching changes, or better
compaction heuristics.
The dnspython.Resolve.resolve() requires at least dnspython >= 2.0.0,
this wasn't enforced in the shutdown system test leading to infinite
loop waiting for the server start due to failing resolve() call.
We don't need a separate module/file for every test. Both the rpz tests
could live in the same file.
The setup/teardown of servers if performed separately for each module --
unless there is a need to do that, it's better to avoid it.
This adds rudimentary test for response-policy zones in multiple
views. Different combinations are tested:
- two views with response-policy inherited from options {};
- two views view explicit response-policy using same RPZ zone name
- two views view explicit response-policy using secondary RPZ zone
* nsupdate should take 12 seconds (one try and three retries with
3 second timeout for each), UDP mode
* nsupdate -u 4 -r 1 should take 8 seconds (one try and one retry with
4 second timeout for each), UDP mode
* nsupdate -u 0 -t 8 -r 1 should also take 8 seconds, UDP mode
* nsupdate -u 4 -t 30 -r 1 should also take 8 seconds, as -u takes
precedence over -t, UDP mode
* nsupdate -t 8 -v should also take 8 seconds, TCP mode
The checkds system test could fail if some parent secondary servers did
not yet loaded all the zones before ns9 started sending DS queries. This
leads to SERVFAIL responses, while the test case expects good DS
responses. In order to mitigate against this issue, call 'rndc loadkeys'
to quickly restart the checkds procedure again.
Also refactor the checkds system test, to get rid of the many zone
name duplications. Update the functions 'zone_check' and
'keystate_check' to make the zone name an FQDN so we can just pass
the 'zone' variable into the function.
If the 'checkds' option is not explicitly set, check if there are
'parental-agents' for the zone configured. If so, default to "explicit",
otherwise default to "yes".
Add two new checkds test servers, that are hidden secondaries (hidden
as in not published in the NS RRset), that can be used specifically
for testing explicitly configured parental-agents.
Implement the new feature, automatic parental-agents. This is enabled
with 'checkds yes'.
When set to 'yes', instead of querying the explicit configured
parental agents, look up the parental agents by resolving the parent
NS records. The found parent NS RRset is considered to be the list
of parental agents that should be queried during a KSK rollover,
looking up the DS RRset corresponding to the key signing keys.
For each NS record, look up the addresses in the ADB. These addresses
will be used to send the DS requests. Count the number of servers and
keep track of how many good DS responses were seen.
The previous test cases already test the more complex case where there
are empty non-terminals between the child apex and the parent domain.
Add a test case where this is not the case, to execute the other code
path.
Add test cases for when checkds is disabled. Copy the test cases that
would have resulted in a DSPublish or DSRemoved and make sure that
with 'checkds no' the metadata is not set.
Add the test cases for automatic parental-agents, i.e. when 'checkds'
is set to 'yes'. Split out the special cases that use a reference
or a resolver as parental-agent so that the common use cases can be
tested with the same function.
Make the checkds system test more structured with the many more test
cases to come. Add a README for clarity.
Update the 'has_signed_apex_nsec' helper function so it can take any
domain name regardless of the number of labels.
Change the DNS tree structure such that we have different TLD names
for the various test scenarios, because we need servers that respond
differently to DS queries. Note that this isn't applicable to the
existing "checkds explicit" test cases, but is preparation work for
testing "checkds yes" (automatic parental agents).
Add a trust-anchor to the server that will be querying for parent
NS records.
Add a new configuration option to set how the checkds method should
work. Acceptable values are 'yes', 'no', and 'explicit'.
When set to 'yes', the checkds method is to lookup the parental agents
by querying the NS records of the parent zone.
When set to 'no', no checkds method is enabled. Users should run
the 'rndc checkds' command to signal that DS records are published and
withdrawn.
When set to 'explicit', the parental agents are explicitly configured
with the 'parental-agents' configuration option.
Cleanup the remnants of MS Compiler bits from <isc/refcount.h>, printing
the information in named/main.c, and cleanup some comments about Windows
that no longer apply.
The bits in picohttpparser.{h,c} were left out, because it's not our
code.
hypothesis prior to 4.41.2 uses hashlib.md5 which is not FIPS
compliant causing the wildcard system test to fail. Check if
we are running if FIPS mode and if so make the minimum version
of hypothesis we will accept to be 4.41.2.
The existing set of kerberos credential used deprecated algorithms
which are not supported by some implementations in FIPS mode.
Regenerate the saved credentials using more modern algorithms.
Added tsiggss/krb/setup.sh which sets up a test KDC with the required
principals for the system test to work. The tsiggss system test
needs to be run once with this active and KRB5_CONFIG appropriately.
set. See tsiggss/tests.sh for an example of how to do this.
OPENSSL_CONF="" is treated differently to no OPENSSL_CONF in
the environment by OpenSSL. OPENSSL_CONF="" lead to crypto
failure being reported in FIPS mode.
Call dst_lib_init to set FIPS mode if it was turned on at configure
time.
Check that named-checkconf report that dnssec policies that wont
work in FIPS mode are reported if named would be running in FIPS
mode.
Diffie-Hellman key echange doesn't appear to work in FIPS mode for
OpenSSL 1.x.x. Add feature test (--have-fips-dh) to identify builds
where DH key exchanges work (non FIPS builds and OpenSSL 3.0.0+) and
exclude test that would otherwise fail.
- RSASHA1 (5) and NSEC3RSASHA1 (7) are not accepted in FIPS mode
- minimum RSA key size is set to 2048 bit
adjust kasp and checkconf system tests to ensure non FIPS
compliant configurations are not used in FIPS mode
when testing the DNSRPS API, instead of linking to an installed
librpz.so from fastrpz, we now link to the test library. code that
ran dnsrpzd and checked the fastrpz license is now unnecessary and
has been removed.
two dnsrps-specific test cases in rpz (qname_as_ns and ip_as_ns) have
been removed, because they were only supported by fastrpz and do not
work in the test library. in rpzrecurse, nsip-wait-recurse and
nsdname-wait-recurse are now only tested in native mode, due to those
tests being specific to the native implementation.
These options and zone type were created to address the
SiteFinder controversy, in which certain TLD's redirected queries
rather than returning NXDOMAIN. since TLD's are now DNSSEC-signed,
this is no longer likely to be a problem.
The deprecation message for 'type delegation-only' is issued from
the configuration checker rather than the parser. therefore,
isccfg_check_namedconf() has been modified to take a 'nodeprecate'
parameter to suppress the warning when named-checkconf is used with
the command-line option to ignore warnings on deprecated options (-i).
Previously, an AXFR request would be issued every second while waiting
for the zone to be signed. This might've been the cause of issues in CI
where many tests are running in parallel and any extra load may increase
test instability.
Instead, check for the last NSEC record to have a signature before
commencing the AXFR request to check the zone has been fully signed.
Also increase the time for the zone signing to a total of 60+10 seconds
up from the previous 30.
Ensure messages from dupsigs system test end up in its log rather than
stdout. Previously, the output was hard to debug when running the tests
in parallel and messages wouldn't end up in the dupsigs.log.
stop and restart the server in the 'tsiggss' test, in order
to confirm that GSS negotiated TSIG keys are saved and restored
when named loads.
added logging to dns_tsigkey_createfromkey() to indicate whether
a key has been statically configured, generated via GSS negotiation,
or restored from a file.
If the zone already has existing NSEC/NSEC3 chains then zone_sign
needs to continue to use them. If there are no chains then use
kasp setting otherwise generate an NSEC chain.
The dnstap system test fails intermittently, and it appears to be
a timing issue - adding a short delay after running 'fstrm_capture',
and before running 'dnstap -reopen' improves the situation from
50% failures (5 out of 10 times) to 0% failures (0 out of 20 times),
tested locally.
The reason is that 'fstrm_capture' is executed in the background,
and due to OS scheduling and other factors, the listener socket
may not be ready when the following command runs and tells 'named'
to (re)open it.
Dumping of the freshly transferred zone file can take some time.
Retry 5 times before failing.
The log excerpt below shows such a case, when dumping lasted more than
two seconds.
06-Mar-2023 09:32:09.973 zone example6/IN: Transfer started.
06-Mar-2023 09:32:10.301 zone example6/IN: zone transfer finished: success
06-Mar-2023 09:32:10.301 zone_dump: zone example6/IN: enter
06-Mar-2023 09:32:11.789 client @0x7fe9ab435d68 10.53.0.10#44113 (example6): AXFR request
06-Mar-2023 09:32:11.801 client @0x7fe9ab435d68 10.53.0.10#44113 (example6): transfer of 'example6/IN': AXFR ended: 5 messages, 2676 records, 55815 bytes, 0.011 secs (5074090 bytes/sec) (serial 1397051952)
06-Mar-2023 09:32:12.409 zone_gotwritehandle: zone example6/IN: enter
06-Mar-2023 09:32:12.421 dump_done: zone example6/IN: enter
06-Mar-2023 09:32:12.421 zone_journal_compact: zone example6/IN: target journal size 53044
There can be comments in dig output for a zone transfer only in case
of an error, so we should print those errors not when wait_for_tls_xfer
succeeds, but when it fails.
Also, there is no point in printing those comments when a failure was
indeed expected.
The serve-stale system test was intermittently failing due to a timing
issue:
I:serve-stale:check stale data.example TXT was refreshed...
I:serve-stale:failed
The RRset is refreshed, however, it first checks for an expected log
line, prior checking that the stale data.example TXT was refreshed
(using dig). This log line is there to ensure the record is actually
refreshed before we start querying again. Alternatively we could just
retry_quiet 10 <wait for dig output matches expectations>. It would
lower the chances for intermittent test failures, since there is no
longer a "check for log line, sleep one second if check fails, check
for log line, ...", prior to the check.
Completely remove the TKEY Mode 2 (Diffie-Hellman Exchanged Keying) from
BIND 9 (from named, named.conf and all the tools). The TKEY usage is
fringe at best and in all known cases, GSSAPI is being used as it should.
The draft-eastlake-dnsop-rfc2930bis-tkey specifies that:
4.2 Diffie-Hellman Exchanged Keying (Deprecated)
The use of this mode (#2) is NOT RECOMMENDED for the following two
reasons but the specification is still included in Appendix A in case
an implementation is needed for compatibility with old TKEY
implementations. See Section 4.6 on ECDH Exchanged Keying.
The mixing function used does not meet current cryptographic
standards because it uses MD5 [RFC6151].
RSA keys must be excessively long to achieve levels of security
required by current standards.
We might optionally implement Elliptic Curve Diffie-Hellman (ECDH) key
exchange mode 6 if the draft ever reaches the RFC status. Meanwhile the
insecure DH mode needs to be removed.
The trick is to configure a duplicate zone, which comes after the
catalog zone, where the duplicate zone is an existing member zone.
In that scenario, all the zones which come before the "faulty" zone
in the configuration file will fail to be reverted to the previous
version of the view after a reconfiguration error, and in this
particular case that will result in an assertion failure when the
catalog zone update is initiated, because it will be still tied to
the new version of the view, which was dismissed.
Building the bin/tests/system/rpz/dnsrps helper binary is currently not
possible at all as the necessary compiler and linker flag definitions
are missing from bin/tests/system/Makefile.am. Add these as a basis for
addressing the problem.
Unfortunately, this is where the "mostly" bit mentioned in this commit's
subject line comes into play. The dlopen() parts of DNSRPS code have
not yet been reworked to use libuv's dlopen() API (uv_dlopen() etc.)
(See commit 37b9511ce1 for prior work in
this area.) While it is certainly possible to do that, implementing
such a change without testing it in practice against a usable librpz.so
(i.e. a DNSRPS provider library) is bound to cause more trouble and
confusion than keeping the code the way it is right now. However,
making that code buildable as-is requires linking against a C standard
library that exports the dlopen(), dlsym(), and dlclose() symbols used
by the DNSRPS dynamic loading code. glibc 2.34+ satisfies that
requirement, but older glibc versions do not (these come with a separate
libdl shared library that would need to be linked in as well). (Other
C standard library implementations have not been examined.) Since the
long-term plan is to rely on libuv's dlopen() API exclusively and
detecting the shared object containing dlopen() & friends would only
pull in build system complexity for no good reason, assume for now that
the target system provides the dlopen() API in its C standard library.
This change enables the system test suite to be run for a BIND 9 build
prepared using --enable-dnsrps --enable-dnsrps-dl (on systems satisfying
the requirement explained above). However, it is important to note that
this change by itself does NOT enable actual testing of the DNSRPS
feature as doing that requires a DNSRPS provider library to be present
on the test host.
This implements node reference tracing that passes all the internal
layers from dns_db API (and friends) to increment_reference() and
decrement_reference().
It can be enabled by #defining DNS_DB_NODETRACE in <dns/trace.h> header.
The output then looks like this:
incr:node:check_address_records:rootns.c:409:0x7f67f5a55a40->references = 1
decr:node:check_address_records:rootns.c:449:0x7f67f5a55a40->references = 0
incr:nodelock:check_address_records:rootns.c:409:0x7f67f5a55a40:0x7f68304d7040->references = 1
decr:nodelock:check_address_records:rootns.c:449:0x7f67f5a55a40:0x7f68304d7040->references = 0
There's associated python script to find the missing detach located at:
https://gitlab.isc.org/isc-projects/bind9/-/snippets/1038
Change the commandline option -G to take a string that determines what
sync records should be published. It is a comma-separated string with
each element being either "cdnskey", or "cds:<algorithm>", where
<algorithm> is a valid digest type. Duplicates are suppressed.
Change one of the test cases to use a different digest type (4). The
system tests and kasp script need to be updated to take into account
the new algorithm (instead of the hard coded 2).
The test was setting a minimum count for recursive clients which
was not always being met (e.g. 91 instead of 100) producing a false
positive. Lower the lower bound on recursive clients for this
test to 1.
Add the 'ixfr-from-differences yes;' option to trigger a failed
zone postload operation when a zone is updated but the serial
number is not updated, then issue two successive 'rndc reload'
commands to trigger the bug, which causes an assertion failure.
the dns_xfrin module was still using the network manager directly to
manage TCP connections and send and receive messages. this commit
changes it to use the dispatch manager instead.
the 'dispatchmgr' member of the resolver object is used by both
the dns_resolver and dns_request modules, and may in the future
be used by others such as dns_xfrin. it doesn't make sense for it
to live in the resolver object; this commit moves it into dns_view.
the parser could crash when "include" specified an empty string in place
of the filename. this has been fixed by returning ISC_R_FILENOTFOUND
when the string length is 0.
Reproduce the assertion by configuring a 'named' resolver with
'recursive-clients 10;' configuration option and running 20
queries is parallel.
Also tweak the 'ans2/ans.pl' to simulate a 50ms network latency
when qname starts with "latency". This makes sure that queries
running in parallel don't get served immediately, thus allowing
the configured recursive clients quota limitation to be activated.
move database attach/detach functions to db.c, instead of
requiring them to be implemented for every database type.
instead, they must implement a 'destroy' function that is
called when references go to zero.
this enables us to use ISC_REFCOUNT_IMPL for databases,
with detailed tracing enabled by setting DNS_DB_TRACE to 1.
initialize dns_dbmethods, dns_sdbmethods and dns_rdatasetmethods
using explicit struct member names, so we don't have to keep track
of NULLs for unimplemented functions any longer.
some dns_db functions would have crashed if the DB implementation failed
to implement them, requiring the implementations to add functions that
did nothing but return ISC_R_NOTIMPLEMENTED or some obvious default
value. we can just have the dns_db wrapper functions themselves return
those values, and clean up the implementations accordingly.
as there is no further use of isc_task in BIND, this commit removes
it, along with isc_taskmgr, isc_event, and all other related types.
functions that accepted taskmgr as a parameter have been cleaned up.
as a result of this change, some functions can no longer fail, so
they've been changed to type void, and their callers have been
updated accordingly.
the tasks table has been removed from the statistics channel and
the stats version has been updated. dns_dyndbctx has been changed
to reference the loopmgr instead of taskmgr, and DNS_DYNDB_VERSION
has been udpated as well.
change functions using isc_taskmgr_beginexclusive() to use
isc_loopmgr_pause() instead.
also, removed an unnecessary use of exclusive mode in
named_server_tcptimeouts().
most functions that were implemented as task events because they needed
to be running in a task to use exclusive mode have now been changed
into loop callbacks instead. (the exception is catz, which is being
changed in a separate commit because it's a particularly complex change.)
dns_request_create() and _createraw() now take a 'loop' parameter
and run the callback event on the specified loop.
as the task manager is no longer used, it has been removed from
the dns_requestmgr structure. the dns_resolver_taskmgr() function
is also no longer used and has been removed.
Include MD5 feature detection in featuretest tool and use it in some
places. When RHEL distribution or Fedora ELN is in FIPS mode, then MD5
algorithm is unavailable completely and even hmac-md5 algorithm usage
will always fail. Work that around by checking MD5 works and if not,
skipping its usage.
Those changes were dragged as downstream patch bind-9.11-fips-tests.patch
in Fedora and RHEL.
Tests using diff to compare outputs of dig +short shall ignore lines
starting with ";". In dig +short output, such lines should only be
present for errors such as network issues. Since we utilize dig's
default timeout/retry mechanisms, these transitory issues should be
ignored and only the final output should be considered during the diff
comparison.
This adds an island of trust that is reachable from the root
where the trust anchors are added to island.conf.
This add an island of trust that is not reachable from the root
where the trust anchors are added to private.conf.
Occasionally, the allotted 10 seconds for the "running" line to appear
in log after named is started proved insufficient in CI, especially
during increased load. Give named up to 60 seconds to start up to
mitigate this issue.
isc_bind9 was a global bool used to indicate whether the library
was being used internally by BIND or by an external caller. external
use is no longer supported, but the variable was retained for use
by dyndb, which needed it only when being built without libtool.
building without libtool is *also* no longer supported, so the variable
can go away.
Send the test message from ns3 to ns2 instead of ns2 to ns3 as ns2
is started first and therefore the test doesn't have to wait on the
resend of the the NOTIFY message to be successful.
the nsupdate system test was intermittently failing due to the update
quota not being exceeded when it should have been. this is most likely
a timing issue: the client is sending updates too slowly, or the server
is processing them too quickly, for the quota to fill. this commit
attempts to make that the failure less likely by increasing the number
of update transactions from 10 to 20.
Following deleting the root trust anchor and reconfiguring the
server it takes some time to for trust anchor to appear in 'rndc
managed-keys status' output. Retry several times.
check in the log files of receiving servers that the originating
ports for notify and SOA query messages were set correctly from
configured notify-source and transfer-source options.
* rbt node chains were sized to allow for bitstring labels, so they
had 256 levels; but in the absence of bistrings, 128 is enough.
* dns_byaddr_createptrname() had a redundant options argument,
and a very outdated doc comment.
* A number of comments referred to bitstring labels in a way that is
no longer helpful. (A few informative comments remain.)
Set the DS state after issuing 'rndc dnssec -checkds'. If the DS
was published, it should go in RUMOURED state, regardless whether it
is already safe to do so according to the state machine.
Leaving it in HIDDEN (or if it was magically already in OMNIPRESENT or
UNRETENTIVE) would allow for easy shoot in the foot situations.
Similar, if the DS was withdrawn, the state should be set to
UNRETENTIVE. Leaving it in OMNIPRESENT (or RUMOURED/HIDDEN)
would also allow for easy shoot in the foot situations.
The malloced and maxmalloced memory counters were mostly useless since
we removed the internal allocator blocks - it would only differ from
inuse by the memory context size itself.
The validity default days value of 1 was used for debugging and
left as such accidentally.
Use 10950 days, as used elsewhere (for example, in doth test CA).
This does not affect anything, the value will be effective when
generating new test certificates in the future.
Change the 'forward' system test to enable DoT on ns2 server,
and test that forwarding from ns4 to the DoT-enabled ns2 works.
In order to test different scenarios, create a test CA (based on
similar CAs for 'doth' and 'nsupdate' system tests), and test
both insecure (no certificate validation) and secure (also with
mutual TLS) TLS configurations, as well as a configuration with an
expired certificate.
A 'tls' statement can be specified both for individual addresses
and for the whole list (as a default value when an individual
address doesn't have its own 'tls' set), just as it was done
before for the 'port' value.
Create a new function 'print_rawqstring()' to print a string residing
in a 'isc_textregion_t' type parameter.
Create a new function 'copy_string()' to copy a string from a
'cfg_obj_t' object into a 'isc_textregion_t'.
Add a test case for a server that uses a resolver as an parental-agent.
We need two root servers, ns1 and ns10, one that delegates to the
'checkds' tld with the DS published (ns2), and one that delegates to
the 'checkds' tld with the DS removed (ns5). Both root zones are
being setup in the 'ns1/setup.sh' script.
We also need two resolvers, ns3 and ns8, that use different root hints
(one uses ns1 address as a hint, the other uses ns10).
Then add the checks to test_checkds.py is similar to the existing tests.
Update 'types' because for zones that have the DS withdrawn (or to be
withdrawn), the CDS and CDNSKEY records should not be published and
thus should not be in the NSEC bitmap.
Add 'port' token to deprecated.conf. Also add options
'use-v4-udp-ports', 'use-v6-udp-ports', 'avoid-v4-udp-ports',
and 'avoid-v6-udp-ports'.
All of these should trigger warnings (except when deprecation warnings
are being ignored).
Deprecate the use of "port" when configuring query-source(-v6),
transfer-source(-v6), notify-source(-v6), parental-source(-v6),
etc. Also deprecate use-{v4,v6}-udp-ports and avoid-{v4,v6}udp-ports.