For Linux >= 6.8:
Since 2023, Linux has introduced a change to the IP_LOCAL_PORT_RANGE
socket option that eliminates the need for the random window
shifting (implemented as a fallback in the next commit).
By setting IP_LOCAL_PORT_RANGE option, we tell the kernel to use better
approach to the source port selection.
For Linux << 6.8:
This implement selecting port by random shifting range leveraging the
IP_LOCAL_PORT_RANGE socket option. The network manager is initialized
with the ephemeral port range (on startup and on reconfig) and then for
every outgoing TCP connection, we define a custom port range (1000
ports) and then randomly shift the custom range within the system range.
This helps the kernel to reduce the search space to the custom window
between <random_offset, random_offset + 1000>.
Reference:
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/#kernel
Since 2015, Linux has introduced a new socket option to overcome TCP
limitations: When an application needs to force a source IP on an active
TCP socket it has to use bind(IP, port=x). As most applications do not
want to deal with already used ports, x is often set to 0, meaning the
kernel is in charge to find an available port. But kernel does not know
yet if this socket is going to be a listener or be connected. This
IP_BIND_ADDRESS_NO_PORT socket option ask the kernel to ignore the 0
port provided by application in bind(IP, port=0) and only remember the
given IP address. The port will be automatically chosen at connect()
time, in a way that allows sharing a source port as long as the 4-tuples
are unique.
Enable IP_BIND_ADDRESS_NO_PORT on the outgoing TCP sockets to overcome
this TCP limitation.
The function was already marked as never failing, always returning
ISC_R_SUCCESS, so there was a lot of dead code around checking whether
the result would be ISC_R_SUCCESS. This has been cleaned up.
An attacker controlling a malicious DNS server returns a DNAME record,
and the we stores a pointer to resp->foundname, frees the response
structure, then uses the dangling pointer in dns_name_fullcompare()
possibly causing invalid match. Only the `delv`is affected. This has
been fixed.
Prevent retrying the notify over TCP in case the source address is not
available or the source vs the destination address family mismatch or
when the destination address has been blackholed. Properly log the
hard notify failures.
When dns_request_create() fails in notify_send_toaddr() the TSIG key was
not cleared when retrying over TCP causing assertion failure. Set the
TSIG key to NULL in the dns_message to prevent the assertion failure.
In rdataset_getheader() a cast of the raw buffer to dns_slabheader_t and
pointer arithmetics was used to get the start of the slabheader
structure. Use more correct offsetof(dns_slabheader_t, raw) to
calculate the correct start of the dns_slabheader_t from the flexible
member raw[].
The count of items was stored in the raw data as first two bytes.
Instead of reading this from the raw header, move the number of the
items into the structure itself.
This needs the flexible member raw[] to be aligned on the size of the
pointer to prevent unaligned access to the start of the header from
rdataset_getheader() function that casts the raw[] to dns_slabheader_t.
After the rdataslab -> rdataslab,rdatavec split, there were couple of
unused struct members. Remove all the unused members, reorder the
members to eliminate the padding holes and thus reduce the
dns_slabheader_t and dns_slabtop_t structure sizes.
After the split to dns_rdataslab and dns_rdatavec, the
dns_rdataslab_merge() function was unused and it suffered from the same
data race as fixed in the previous commit. Instead of fixing it, just
remove the function and bunch of other unused functions from the
dns_rdataslab unit.
The fetch loop detection occured in two places: when
`dns_resolver_createfetch()` is invoked (looking up through the parent
fetches chain and stops the fetch if a parent fetch is the same qname and
qtype) and right after calling `dns_adb_findname()` in the resolver
(stops the fetch if the current fetch is the same name from the ADB
lookup, and ADB lookup needs to fetch it).
Regarding fetch loop detection at the `dns_resulver_createfetch()`
entry, there are case where both qname and qtype are similar but the
zonecut is different. This will then query different name servers and
get different responses. For instance, the following delegation
parent-side (both for `foo.example.` and `dnshost.example.`):
foo.example. 3600 NS ns.dnshost.example.
dnshost.example. 3600 NS ns.dnshost.example.
ns.dnshost.example. 3600 A 1.2.3.4
Then the child-side of `dnshost.example.`:
dnshost.example. 300 NS ns.dnshost.example.
ns.dnshost.example. 300 A 1.2.3.4
Then the child-side of `foo.example.`:
foo.example 3600 NS ns.dnshost.example.
a.foo.example 300 A 5.6.7.8
Obviously, there is a misconfiguration between the parent-side and the
child-side of `dnshost.example` (the mismatch of the TTL), but, this
happens...
Because the resolver is currently child-centric, the parent-side
delegation's glue of `dnshost.example.` will be overriden by the
child-side of the delegation. Once both A records will expires, the
resolver will attempt to find out the A RRs but will start from the
`foo.example.` zonecut, as the delegation itself is still valid.
Then the resolver will attempt to resolve `ns.dnshost.example.`, still
using the `foo.example.` zonecut, which will immediately trigger another
attempt to resolve `ns.foo.example.` (because the A RR is expired). This
is, however _not_ a loop, because the second attempt will have
`dnshost.example.` zonecut. And this changes everything, because the
resolver detects the A name is in-domain, and pass a flag to ADB so
`dns_view_find()` won't use the cache. As a result, the zonecut will be
`.`, and the hints (root servers) will be queried instead.
From that point, they'll return the parent-side delegation, which
includes the glue for `ns.dnshost.example/A`, and the resolution can
continue. Previously, this wouldn't be possible because a loop would be
detected from the second attempt to looking `ns.foo.example/A` and would
result in a SERVFAIL.
Now, the loop detection is relaxed as the loop is detected if the qname,
qtype _and_ zonecut are equals.
This commit also changes the way the loop detection post
`dns_adb_createfind()` works. From the same example above, there would
be two ADB fetches with the same name, but with two different ADB flags
(the first one without DNS_ADB_STARTATZONE, the second one with that
flag). It means that there will be two fetches out of those two ADB
lookups, both legit, and not a loop (i.e. it won't be stuck). To
differenciate between a find which has a pending fetch (which could be
from another find the current find has been attached to), a new find
option `DNS_ADBFIND_STARTEDFETCH` is introduced, which tells that the
current has did started a fetch.
That way, if a find doesn't have `DNS_ADBFIND_STARTEDFETCH` option but
has pending fetches, we know this is a find attached to a similar find
so this is a loop. Otherwise, with `DNS_ADBFIND_STARTEDFETCH`, we know
that even if there is a pending fetch, this is not a loop as the fetch
has just been started
ADB entry window and ADB min cache time can be tweaked using `named -T
adbentrywindow=<unsigned int>` and `named -T adbmincache=<unsigned
int>`.
While those values doesn't needs to be exposed to the operator, this can
be needed to be able to system test ADB behaviors without having to wait
as long as those values are by default.
It's potentially confusing to use "resp_rdataset" for QNAME
minimization, but we can make it a union and have resp.rdataset
and qmin.rdataset using the same memory.
We can save even more space by using the same union to combine
qminname and resp_foundname and access them as qmin.name and
resp.foundname.
Two rdataset property `qminrrset` and `qminsigrrset` are removed from
the fetch context. They only are used as temporary storage for the query
result of the qmin query, and are immediately detached from
`resume_qmin` once the query is over.
As an alternative, use `resp_rdataset` and `resp_sigrdataset`
instead; those are not needed for storing the response data until
after qmin_resume() is over.
Instead of first copying query response data into each fetch response
and then iterating again to send the response to the caller, perform
both operations in one go.
Also removed some duplicate code.
There is no longer a need to decide whether a fetch response should be
prepended or appended to the fetch response list. As query response data
is stored directly in the fetch context object, responses containing a
sigrdataset no longer need to be ordered first. Remove the code
implementing this logic.
Additionally, the distinction between `fetchstate_done` and
`fetchstate_sendevents` is no longer needed. New clients
`dns_fetchresponse_t` can be attached any time to the fetch context
until `fctx__done()` is called, since there is no dependency on the
first fetch response in the list. This simplifies the code and reduces
(just a bit) locking usage.
Query answers are now stored in dedicated fetch context properties,
instead of using `ISC_LIST_HEAD(fctx->resps)`.
This reduces lock critical section usage in some places, and enables
further simplifications. (In particular, it removes the need for special
logic to prepend a fetch response to the list when it contains a
sigrdataset.)
Instead of cloning fetch responses immediately after writing to the
head of the fetch response list, defer cloning until the events are
actually sent.
This removes the need for the `fctx->cloned` state. However, a new
fetch state value, fetchstate_sentevents, is introduced and occurs
after fetchstate_done. To prevent new fetch responses from being
prepended after the head is written but before cloning occurs,
fetchstate_done is now set at all call sites that previously invoked
`clone_results()`.
When RRSIG(rdtype) was independently cached before the RDATA for the
rdtype itself, named would crash on the subsequent query for the RDATA
itself. This has been fixed.
ISC would like to thank Vitaly Simonovich for bringing this
vulnerability to our attention.
In dst_gssapi_acceptctx(), the gnamebuf could leak a little bit of
memory if dns_name_fromtext() would theoretically fail. This would
require a Kerberos principal with invalid DNS name.
The description in the protobuf specification is not a list of request
types to process but rather a list of examples to qualify the
description of whether the time indicates when the message is received
or sent.
A lingering `sizeof` from the prototype era of !11094 caused the
key-wipe in `isc_hmac_key_destroy` to use `sizeof(key->len)` instead of
`key->len` for the length argument of `isc_safe_memwipe`.
This results in a buffer overflow of zero bytes in HMAC keys that are
less than 4 bytes. As such, the overflow can only be visibile in keys
that are less than 32-bits, which is beyond broken and creating such
keys are only possible in testing.
Therefore, this change is *not* a security fix since the conditions are
never reachable in any imaginable deployment scenario.
Builds that use OpenSSL >=3.0 are unaffected as the `sizeof` was only
remaining in pre-3.0 builds.
This is a bit of a namespace convention violation but it fits the spirit of
this header since it is exposing OpenSSL-isms to others.
Further work is needed to make sure the exposed EVP_MD isn't needed
anymore.
Using `EVP_SIGNATURE` explicit algoritms for signatures have been added
in OpenSSL 3.4 and so is skipped for the initial OpenSSL version
specific code splitting.
Using `EVP_SIGNATURE` explicit algoritms for signatures have been added
in OpenSSL 3.4 and so is skipped for the initial OpenSSL version
specific code splitting.
While being the best place at the time, the tlserr2result doesn't belong
inside TLS code since it is generic to OpenSSL and mostly used in the
dst interface. The newly created ossl_wrap interface is the idea place
for flushing the OpenSSL thread error queue.
Instead of the `EVP_MD_CTX` based functions, use either the new
`EVP_MAC` or the old `HMAC_CTX` based functions.
`EVP_MAC` is the recommended way using using MAC functions in post-3.0
while `HMAC_CTX` is used internally by `EVP_MD_CTX`, making the latter
redundant.
Get rid of the OpenSSL-isms that plague the codebase where the hash type
is `EVP_MD *`
By using a proper enum, alongside the cleanup, we also get the ability
to use constants for known hash sizes instead of having a function call
every time.
`EVP_MD_CTX_get0_md` has been removed instead of being adapted since it
wasn't used anymore.
Dealing with OpenSSL has been rapidly turning into an unwieldy situation
as post-3.0 changes turn the library into a different beast.
Start treating pre and post-3.0 versions differently for easier
maintenance.