This commit adds an initial implementation of isc_nm_streamdnssocket
transport: a unified transport for DNS over stream protocols messages,
which is capable of replacing both TCP DNS and TLS DNS
transports. Currently, the interface it provides is a unified set of
interfaces provided by both of the transports it attempts to replace.
The transport is built around "isc_dnsbuffer_t" and
"isc_dnsstream_assembler_t" objects and attempts to minimise both the
number of memory allocations during network transfers as well as
memory usage.
The added function provides the interface for getting an ALPN tag
negotiated during TLS connection establishment.
The new function can be used by higher level transports.
This commit adds manual read timer control mode, similarly to TCP.
This way the read timer can be controlled manually using:
* isc__nmsocket_timer_start();
* isc__nmsocket_timer_stop();
* isc__nmsocket_timer_restart().
The change is required to make it possible to implement more
sophisticated read timer control policies in DNS transports, built on
top of TLS.
This commit adds a manual read timer control mode to the TCP
code (adding isc__nmhandle_set_manual_timer() as the interface to it).
Manual read timer control mode suppresses read timer restarting the
read timer when receiving any amount of data. This way the read timer
can be controlled manually using:
* isc__nmsocket_timer_start();
* isc__nmsocket_timer_stop();
* isc__nmsocket_timer_restart().
The change is required to make it possible to implement more
sophisticated read timer control policies in DNS transports, built on
top of TCP.
This commit adds implementation of isc__nmsocket_timer_restart() and
isc__nmsocket_timer_stop() for generic TLS code in order to make its
interface more compatible with that of TCP.
This commit adds implementations of isc_nm_bad_request() and
isc__nmsocket_reset() to the generic TLS stream code in order to make
it more compatible with TCP code.
When isc_buffer_t buffer is created with isc_buffer_allocate() assume
that we want it to always auto-reallocate instead of having an extra
call to enable auto-reallocation.
Add internal logging functions isc__netmgr_log, isc__nmsocket_log(), and
isc__nmhandle_log() that can be used to add logging messages to the
netmgr, and change all direct use of isc_log_write() to use those
logging functions to properly prefix them with netmgr, nmsocket and
nmsocket+nmhandle.
This commit adds a check if 'sock->recv_cb' might have been nullified
during the call to 'sock->recv_cb'. That could happen, e.g. by an
indirect call to 'isc_nmhandle_close()' from within the callback when
wrapping up.
In this case, let's close the TLS connection.
This commit ensures that the non-atomic flags inside a DoH listener
socket object (and associated worker) are accessed when doing accept
for a connection only from within the context of the dedicated thread,
but not other worker threads.
The purpose of this commit is to avoid TSAN errors during
isc__nmsocket_closing() calls. It is a continuation of
4b5559cd8f.
This commit ensures that the non-atomic flags inside a TLS listener
socket object (and associated worker) are accessed when doing
handshake for a connection only from within the context of the
dedicated thread, but not other worker threads.
The purpose of this commit is to avoid TSAN errors during
isc__nmsocket_closing() calls. It is a continuation of
4b5559cd8f.
This commit ensures that the flags inside a TLS listener socket
object (and associated worker) are accessed when accepting a
connection only from within the context of the dedicated thread, but
not other worker threads.
The TLSDNS transport was not honouring the single read callback for
TLSDNS client. It would call the read callbacks repeatedly in case the
single TLS read would result in multiple DNS messages in the decoded
buffer.
This commit ensures that send callbacks are always called from within
the context of its worker thread even in the case of
shuttigdown/inactive socket, just like TCP transport does and with
which TLS attempts to be as compatible as possible.
This commit changes ISC_R_NOTCONNECTED error code to ISC_R_CANCELLED
when attempting to start reading data on the shutting down socket in
order to make its behaviour compatible with that of TCP and not break
the common code in the unit tests.
It turned out that after the latest Network Manager refactoring
'sock->reading' flag was not processed correctly. Due to this
isc_nm_read_stop() might not work as expected because reading from the
underlying TCP socket could have been resume in 'tls_do_bio()'
regardless of the 'sock->reading' value.
This bug did not seem to cause problems with DoH, so it was not
noticed, but Stream DNS has more strict expectations regarding the
underlying transport.
Additionally to the above, the 'sock->recv_read' flag was completely
ignored and corresponding logic was completely unimplemented. That did
not allow to implement one fine detail compared to TCP: once reading
is started, it could be satisfied by one datum reading.
This commit fixes the issues above.
This commit make TCP code use uv_try_write() on best effort basis,
just like TCP DNS and TLS DNS code does.
This optimisation was added in
'caa5b6548a11da6ca772d6f7e10db3a164a18f8d' but, similar change was
mistakenly omitted for generic TCP code. This commit fixes that.
Previously, the send callback would be synchronous only on success. Add
an option (similar to what other callbacks have) to decide whether we
need the asynchronous send callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the send callback, this means we need the asynchronous path for
failure paths inside the isc_nm_send() (which calls isc__nm_udp_send(),
isc__nm_tcp_send(), etc...) - all other invocations of the send callback
could be synchronous, because those are called from the respective libuv
send callbacks.
Previously, the read callback would be synchronous only on success or
timeout. Add an option (similar to what other callbacks have) to decide
whether we need the asynchronous read callback on a higher level.
On a general level, we need the asynchronous callbacks to happen only
when we are invoking the callback from the public API. If the path to
the callback went through the libuv callback or netmgr callback, we are
already on asynchronous path, and there's no need to make the call to
the callback asynchronous again.
For the read callback, this means we need the asynchronous path for
failure paths inside the isc_nm_read() (which calls isc__nm_udp_read(),
isc__nm_tcp_read(), etc...) - all other invocations of the read callback
could be synchronous, because those are called from the respective libuv
or netmgr read callbacks.
The netievent handler for isc_nmsocket_set_tlsctx() was inadvertently
ifdef'd out when BIND was built with --disable-doh, resulting in an
assertion failure on startup when DoT was configured.
It was possible that accept callback can be called after listener
shutdown. In such a case the callback pointer equals NULL, leading to
segmentation fault. This commit fixes that.
This commit introduces a primitive isc__nmsocket_stop() which performs
shutting down on a multilayered socket ensuring the proper order of
the operations.
The shared data within the socket object can be destroyed after the
call completed, as it is guaranteed to not be used from within the
context of other worker threads.
During loop manager refactoring isc_nmsocket_set_tlsctx() was not
properly adapted. The function is expected to broadcast the new TLS
context for every worker, but this behaviour was accidentally broken.
The isc_nm_udpconnect() erroneously set the reuse port with
load-balancing on the outgoing connected UDP sockets. This socket
option makes only sense for the listening sockets. Don't set the
load-balancing reuse port option on the outgoing UDP sockets.
This commit fixes TLS DNS verification error message reporting which
we probably broke during one of the recent networking code
refactorings.
This prevent e.g. dig from producing useful error messages related to
TLS certificates verification.
Ensure that TLS error is empty before calling SSL_get_error() or doing
SSL I/O so that the result will not get affected by prior error
statuses.
In particular, the improper error handling led to intermittent unit
test failure and, thus, could be responsible for some of the system
test failures and other intermittent TLS-related issues.
See here for more details:
https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html
In particular, it mentions the following:
> The current thread's error queue must be empty before the TLS/SSL
> I/O operation is attempted, or SSL_get_error() will not work
> reliably.
As we use the result of SSL_get_error() to decide on I/O operations,
we need to ensure that it works reliably by cleaning the error queue.
TLS DNS: empty error queue before attempting I/O
The check is left from when tcp_connect_direct() called isc__nm_socket()
and it was uncertain whether it had succeeded, but now isc__nm_socket()
is called before tcp_connect_direct(), so sock->fd cannot be -1.
*** CID 357292: (REVERSE_NEGATIVE)
/lib/isc/netmgr/tcp.c: 309 in isc_nm_tcpconnect()
303
304 atomic_store(&sock->active, true);
305
306 result = tcp_connect_direct(sock, req);
307 if (result != ISC_R_SUCCESS) {
308 atomic_store(&sock->active, false);
>>> CID 357292: (REVERSE_NEGATIVE)
>>> You might be using variable "sock->fd" before verifying that it is >= 0.
309 if (sock->fd != (uv_os_sock_t)(-1)) {
310 isc__nm_tcp_close(sock);
311 }
312 isc__nm_connectcb(sock, req, result, true);
313 }
314
Add new semantic patch to replace the straightfoward uses of:
ptr = isc_mem_{get,allocate}(..., size);
memset(ptr, 0, size);
with the new API call:
ptr = isc_mem_{get,allocate}x(..., size, ISC_MEM_ZERO);
The isc__nm_udp_send() callback would be called synchronously when
shutting down or when the socket has been closed. This could lead to
double locking in the calling code and thus those callbacks needs to be
called asynchronously.
By bumping the minimum libuv version to 1.34.0, it allows us to remove
all libuv shims we ever had and makes the code much cleaner. The
up-to-date libuv is available in all distributions supported by BIND
9.19+ either natively or as a backport.
After the loopmgr work has been merged, we can now cleanup the TCP and
TLS protocols a little bit, because there are stronger guarantees that
the sockets will be kept on the respective loops/threads. We only need
asynchronous call for listening sockets (start, stop) and reading from
the TCP (because the isc_nm_read() might be called from read callback
again.
This commit does the following changes (they are intertwined together):
1. Cleanup most of the asynchronous events in the TCP code, and add
comments for the events that needs to be kept asynchronous.
2. Remove isc_nm_resumeread() from the netmgr API, and replace
isc_nm_resumeread() calls with existing isc_nm_read() calls.
3. Remove isc_nm_pauseread() from the netmgr API, and replace
isc_nm_pauseread() calls with a new isc_nm_read_stop() call.
4. Disable the isc_nm_cancelread() for the streaming protocols, only the
datagram-like protocols can use isc_nm_cancelread().
5. Add isc_nmhandle_close() that can be used to shutdown the socket
earlier than after the last detach. Formerly, the socket would be
closed only after all reading and sending would be finished and the
last reference would be detached. The new isc_nmhandle_close() can
be used to close the underlying socket earlier, so all the other
asynchronous calls would call their respective callbacks immediately.
Co-authored-by: Ondřej Surý <ondrej@isc.org>
Co-authored-by: Artem Boldariev <artem@isc.org>
The destructor for the isc__nmsocket_t was missing call to the
isc_refcount_destroy() on the reference counter, which might lead to
spurious ThreadSanitizer data race warnings if we ever change the
acquire-release memory order in the isc_refcount_decrement().
Simplify the closing code - during the loopmgr implementation, it was
discovered that the various lists used by the uv_loop_t aren't FIFO, but
LIFO. See doc/dev/libuv.md for more details.
With this knowledge, we can close the protocol handles (uv_udp_t and
uv_tcp_t) and uv_timer_t at the same time by reordering the uv_close()
calls, and thus making sure that after calling the
isc__nm_stoplistening(), the code will not issue any additional callback
calls (accept, read) on the socket that stopped listening.
This might help with the TLS and DoH shutting down sequence as described
in the [GL #3509] as we now stop the reading, stop the timer and call
the uv_close() as earliest as possible.
The network manager UDP code was misinterpreting when the libuv called
the udp_recv_cb with nrecv == 0 and addr == NULL -> this doesn't really
mean that the "stream" has ended, but the libuv indicates that the
receive buffer can be freed. This could lead to assertion failure in
the code that calls isc_nm_read() from the network manager read callback
due to the extra spurious callbacks.
Properly handle the extra callback calls from the libuv in the client
read callback, and refactor the UDP isc_nm_read() implementation to be
synchronous, so no datagram is lost between the time that we stop the
reading from the UDP socket and we restart it again in the asychronous
udpread event.
Add a unit test that tests the isc_nm_read() call from the read
callback to receive two datagrams.
Commit b69e783164 inadvertently caused
builds using the --disable-doh switch to fail, by putting the
declaration of the isc__nm_async_settlsctx() function inside an #ifdef
block that is only evaluated when DNS-over-HTTPS support is enabled.
This results in the following compilation errors being triggered:
netmgr/netmgr.c:2657:1: error: no previous prototype for 'isc__nm_async_settlsctx' [-Werror=missing-prototypes]
2657 | isc__nm_async_settlsctx(isc__networker_t *worker, isc__netievent_t *ev0) {
| ^~~~~~~~~~~~~~~~~~~~~~~
Fix by making the declaration of the isc__nm_async_settlsctx() function
in lib/isc/netmgr/netmgr-int.h visible regardless of whether
DNS-over-HTTPS support is enabled or not.
The isc_nm_listentlsdns() function erroneously calls
isc__nm_tcpdns_stoplistening() instead of isc__nm_tlsdns_stoplistening()
when something goes wrong, which can cause an assertion failure.
When we are closing the listening sockets, there's a time window in
which the TCP connection could be accepted although the respective
stoplistening function has already returned to control to the caller.
Clear the accept callback function early, so it doesn't get called when
we are not interested in the incoming connections anymore.
Previously:
* applications were using isc_app as the base unit for running the
application and signal handling.
* networking was handled in the netmgr layer, which would start a
number of threads, each with a uv_loop event loop.
* task/event handling was done in the isc_task unit, which used
netmgr event loops to run the isc_event calls.
In this refactoring:
* the network manager now uses isc_loop instead of maintaining its
own worker threads and event loops.
* the taskmgr that manages isc_task instances now also uses isc_loopmgr,
and every isc_task runs on a specific isc_loop bound to the specific
thread.
* applications have been updated as necessary to use the new API.
* new ISC_LOOP_TEST macros have been added to enable unit tests to
run isc_loop event loops. unit tests have been updated to use this
where needed.
In some circumstances generic TLS code could have resumed data reading
unexpectedly on the TCP layer code. Due to this, the behaviour of
isc_nm_pauseread() and isc_nm_resumeread() might have been
unexpected. This commit fixes that.
The bug does not seems to have real consequences in the existing code
due to the way the code is used. However, the bug could have lead to
unexpected behaviour and, at any rate, makes the TLS code behave
differently from the TCP code, with which it attempts to be as
compatible as possible.
Sometimes tls_do_bio() might be called when there is no new data to
process (most notably, when resuming reads), in such a case internal
TLS session state will remain untouched and old value in 'errno' will
alter the result of SSL_get_error() call, possibly making it to return
SSL_ERROR_SYSCALL. This value will be treated as an error, and will
lead to closing the connection, which is not what expected.
The STATID_CONNECT and STATID_CONNECTFAIL statistics were used
incorrectly. The STATID_CONNECT was incremented twice (once in
the *_connect_direct() and once in the callback) and STATID_CONNECTFAIL
would not be incremented at all if the failure happened in the callback.
Closes: #3452
On FreeBSD (and perhaps other *BSD) systems, the TCP connect() call (via
uv_tcp_connect()) can fail with transient UV_EADDRINUSE error. The UDP
code already handles this by trying three times (is a charm) before
giving up. Add a code for the TCP, TCPDNS and TLSDNS layers to also try
three times before giving up by calling uv_tcp_connect() from the
callback two more time on UV_EADDRINUSE error.
Additionally, stop the timer only if we succeed or on hard error via
isc__nm_failed_connect_cb().
Before this change the TLS code would ignore the accept callback result,
and would not try to gracefully close the connection. This had not been
noticed, as it is not really required for DoH. Now the code tries to
shut down the TLS connection gracefully when accepting it is not
successful.
Otherwise the code path will lead to a call to SSL_get_error()
returning SSL_ERROR_SSL, which in turn might lead to closing
connection to early in an unexpected way, as it is clearly not what is
intended.
The issue was found when working on loppmgr branch and appears to
be timing related as well. Might be responsible for some unexpected
transmission failures e.g. on zone transfers.
In some operations - most prominently when establishing connection -
it might be beneficial to bail out earlier when the network manager
is stopping.
The issue is backported from loopmgr branch, where such a change is
not only beneficial, but required.
In some cases - in particular, in case of errors, NULL might be passed
to a connection callback instead of a handle that could have led to
an abort. This commit ensures that such a situation will not occur.
The issue was found when working on the loopmgr branch.
This commit ensures that the underlying TCP socket of a TLS connection
gets closed earlier whenever there are no pending operations on it.
In the loop-manager branch, in some circumstances the connection
could have remained opened for far too long for no reason. This
commit ensures that will not happen.
it's a style violation to have REQUIRE or INSIST contain code that
must run for the server to work. this was being done with some
atomic_compare_exchange calls. these have been cleaned up. uses
of atomic_compare_exchange in assertions have been replaced with
a new macro atomic_compare_exchange_enforced, which uses RUNTIME_CHECK
to ensure that the exchange was successful.
Before the changes from this commit were introduced, the accept
callback function will get called twice when accepting connection
during two of these stages:
* when accepting the TCP connection;
* when handshake has completed.
That is clearly an error, as it should have been called only once. As
far as I understand it the mistake is a result of TLS DNS transport
being essentially a fork of TCP transport, where calling the accept
callback immediately after accepting TCP connection makes sense.
This commit fixes this mistake. It did not have any very serious
consequences because in BIND the accept callback only checks an ACL
and updates stats.
Under specific rare timing circumstances the uv_read_start() could
fail with UV_EINVAL when the connection is reset between the connect (or
accept) and the uv_read_start() call on the nmworker loop. Handle such
situation gracefully by propagating the errors from uv_read_start() into
upper layers, so the socket can be internally closed().
The commit fixes a corner case in client-side DoH code, when a write
attempt is done on a closing socket (session).
The change ensures that the write call-back will be called with a
proper error code (see failed_send_cb() call in client_httpsend()).
Setting the sock->write_timeout from the TCP, TCPDNS, and TLSDNS send
functions could lead to (harmless) data race when setting the value for
the first time when the isc_nm_send() function would be called from
thread not-matching the socket we are sending to. Move the setting the
sock->write_timeout to the matching async function which is always
called from the matching thread.
Since the fctx hash table is now self-resizing, and resolver tasks are
selected to match the thread that created the fetch context, there
shouldn't be any significant advantage to having multiple tasks per CPU;
a single task per thread should be sufficient.
Additionally, the fetch context is always pinned to the calling netmgr
thread to minimize the contention just to coalesced fetches - if two
threads starts the same fetch, it will be pinned to the first one to get
the bucket.
This commit fixes a crash in generic TLS stream code, which could be
reproduced during some runs of the 'sslyze' tool.
The intention of this commit is twofold.
Firstly, it ensures that the TLS socket object cannot be destroyed too
early. Now it is being deleted alongside the underlying TCP socket
object.
Secondly, it ensures that the TLS socket object cannot be destroyed as
a result of calling 'tls_do_bio()' (the primary function which
performs encryption/decryption during the IO) as the code did not
expect that. This code path is fixed now.
As we are going to use libuv outside of the netmgr, we need the shims to
be readily available for the rest of the codebase.
Move the "netmgr/uv-compat.h" to <isc/uv.h> and netmgr/uv-compat.c to
uv.c, and as a rule of thumb, the users of libuv should include
<isc/uv.h> instead of <uv.h> directly.
Additionally, merge netmgr/uverr2result.c into uv.c and rename the
single function from isc__nm_uverr2result() to isc_uverr2result().
Move the netmgr socket related functions from netmgr/netmgr.c and
netmgr/uv-compat.c to netmgr/socket.c, so they are all present all in
the same place. Adjust the names of couple interal functions
accordingly.
This commit ensures that write callbacks are getting called only after
the data has been sent via the network.
Without this fix, a situation could appear when a write callback could
get called before the actual encrypted data would have been sent to
the network. Instead, it would get called right after it would have
been passed to the OpenSSL (i.e. encrypted).
Most likely, the issue does not reveal itself often because the
callback call was asynchronous, so in most cases it should have been
called after the data has been sent, but that was not guaranteed by
the code logic.
Also, this commit removes one memory allocation (netievent) from a hot
path, as there is no need to call this callback asynchronously
anymore.
The connect()ed UDP socket provides feedback on a variety of ICMP
errors (eg port unreachable) which bind can then use to decide what to
do with errors (report them to the client, try again with a different
nameserver etc). However, Linux's implementation does not report what
it considers "transient" conditions, which is defined as Destination
host Unreachable, Destination network unreachable, Source Route Failed
and Message Too Big.
Explicitly enable IP_RECVERR / IPV6_RECVERR (via libuv uv_udp_bind()
flag) to learn about ICMP destination network/host unreachable.
When we compile with libuv that has some capabilities via flags passed
to f.e. uv_udp_listen() or uv_udp_bind(), the call with such flags would
fail with invalid arguments when older libuv version is linked at the
runtime that doesn't understand the flag that was available at the
compile time.
Enforce minimal libuv version when flags have been available at the
compile time, but are not available at the runtime. This check is less
strict than enforcing the runtime libuv version to be same or higher
than compile time libuv version.
For some applications, it's useful to not listen on full battery of
threads. Add workers argument to all isc_nm_listen*() functions and
convenience ISC_NM_LISTEN_ONE and ISC_NM_LISTEN_ALL macros.
This commit adds isc_nmsocket_set_tlsctx() - an asynchronous function
that replaces the TLS context within a given TLS-enabled listener
socket object. It is based on the newly added reference counting
functionality.
The intention of adding this function is to add functionality to
replace a TLS context without recreating the whole socket object,
including the underlying TCP listener socket, as a BIND process might
not have enough permissions to re-create it fully on reconfiguration.
Previously, HAVE_SO_REUSEPORT_LB has been defined only in the private
netmgr-int.h header file, making the configuration of load balanced
sockets inoperable.
Move the missing HAVE_SO_REUSEPORT_LB define the isc/netmgr.h and add
missing isc_nm_getloadbalancesockets() implementation.
Previously, the option to enable kernel load balancing of the sockets
was always enabled when supported by the operating system (SO_REUSEPORT
on Linux and SO_REUSEPORT_LB on FreeBSD).
It was reported that in scenarios where the networking threads are also
responsible for processing long-running tasks (like RPZ processing, CATZ
processing or large zone transfers), this could lead to intermitten
brownouts for some clients, because the thread assigned by the operating
system might be busy. In such scenarious, the overall performance would
be better served by threads competing over the sockets because the idle
threads can pick up the incoming traffic.
Add new configuration option (`load-balance-sockets`) to allow enabling
or disabling the load balancing of the sockets.
Previously, the task privileged mode has been used only when the named
was starting up and loading the zones from the disk as the "first" thing
to do. The privileged task was setup with quantum == 2, which made the
taskmgr/netmgr spin around the privileged queue processing two events at
the time.
The same effect can be achieved by setting the quantum to UINT_MAX (e.g.
practically unlimited) for the loadzone task, hence the privileged task
mode was removed in favor of just processing all the events on the
loadzone task in a single task_run().
In couple places, we have missed INSIST(0) or ISC_UNREACHABLE()
replacement on some branches with UNREACHABLE(). Replace all
ISC_UNREACHABLE() or INSIST(0) calls with UNREACHABLE().
This commit adds support for ISC_R_TLSBADPEERCERT error code, which is
supposed to be used to signal for TLS peer certificates verification
in dig and other code.
The support for this error code is added to our TLS and TLS DNS
implementations.
This commit also adds isc_nm_verify_tls_peer_result_string() function
which is supposed to be used to get a textual description of the
reason for getting a ISC_R_TLSBADPEERCERT error.
Previously, it was possible to assign a bit of memory space in the
nmhandle to store the client data. This was complicated and prevents
further refactoring of isc_nmhandle_t caching (future work).
Instead of caching the data in the nmhandle, allocate the hot-path
ns_client_t objects from per-thread clientmgr memory context and just
assign it to the isc_nmhandle_t via isc_nmhandle_set().
Previously, the unreachable code paths would have to be tagged with:
INSIST(0);
ISC_UNREACHABLE();
There was also older parts of the code that used comment annotation:
/* NOTREACHED */
Unify the handling of unreachable code paths to just use:
UNREACHABLE();
The UNREACHABLE() macro now asserts when reached and also uses
__builtin_unreachable(); when such builtin is available in the compiler.
Gcc 7+ and Clang 10+ have implemented __attribute__((fallthrough)) which
is explicit version of the /* FALLTHROUGH */ comment we are currently
using.
Add and apply FALLTHROUGH macro that uses the attribute if available,
but does nothing on older compilers.
In one case (lib/dns/zone.c), using the macro revealed that we were
using the /* FALLTHROUGH */ comment in wrong place, remove that comment.
Instead of passing the "workers" variable back and forth along with
passing the single isc_nm_t instance, add isc_nm_getnworkers() function
that returns the number of netmgr threads are running.
Change the ns_interfacemgr and ns_taskmgr to utilize the newly acquired
knowledge.
When sock->closehandle_cb is set, we need to run nmhandle_detach_cb()
asynchronously to ensure correct order of multiple packets processing in
the isc__nm_process_sock_buffer(). When not run asynchronously, it
would cause:
a) out-of-order processing of the return codes from processbuffer();
b) stack growth because the next TCP DNS message read callback will
be called from within the current TCP DNS message read callback.
The sock->closehandle_cb is set to isc__nm_resume_processing() for TCP
sockets which calls isc__nm_process_sock_buffer(). If the read callback
(called from isc__nm_process_sock_buffer()->processbuffer()) doesn't
attach to the nmhandle (f.e. because it wants to drop the processing or
we send the response directly via uv_try_write()), the
isc__nm_resume_processing() (via .closehandle_cb) would call
isc__nm_process_sock_buffer() recursively.
The below shortened code path shows how the stack can grow:
1: ns__client_request(handle, ...);
2: isc_nm_tcpdns_sequential(handle);
3: ns_query_start(client, handle);
4: query_lookup(qctx);
5: query_send(qctcx->client);
6: isc__nmhandle_detach(&client->reqhandle);
7: nmhandle_detach_cb(&handle);
8: sock->closehandle_cb(sock); // isc__nm_resume_processing
9: isc__nm_process_sock_buffer(sock);
10: processbuffer(sock); // isc__nm_tcpdns_processbuffer
11: isc_nmhandle_attach(req->handle, &handle);
12: isc__nm_readcb(sock, req, ISC_R_SUCCESS);
13: isc__nm_async_readcb(NULL, ...);
14: uvreq->cb.recv(...); // ns__client_request
Instead, if 'sock->closehandle_cb' is set, we need to run detach the
handle asynchroniously in 'isc__nmhandle_detach', so that on line 8 in
the code flow above does not start this recursion. This ensures the
correct order when processing multiple packets in the function
'isc__nm_process_sock_buffer()' and prevents the stack growth.
When not run asynchronously, the out-of-order processing leaves the
first TCP socket open until all requests on the stream have been
processed.
If the pipelining is disabled on the TCP via `keep-response-order`
configuration option, named would keep the first socket in lingering
CLOSE_WAIT state when the client sends an incomplete packet and then
closes the connection from the client side.
Previously, the established TCP connections (both client and server)
would be gracefully closed waiting for the write timeout.
Don't wait for TCP connections to gracefully shutdown, but directly
reset them for faster shutdown.
Previously, there was a single per-socket write timer that would get
restarted for every new write. This turned out to be insufficient
because the other side could keep reseting the timer, and never reading
back the responses.
Change the single write timer to per-send timer which would in turn
reset the TCP connection on the first send timeout.
The C17 standard deprecated ATOMIC_VAR_INIT() macro (see [1]). Follow
the suite and remove the ATOMIC_VAR_INIT() usage in favor of simple
assignment of the value as this is what all supported stdatomic.h
implementations do anyway:
* MacOSX.plaform: #define ATOMIC_VAR_INIT(__v) {__v}
* Gcc stdatomic.h: #define ATOMIC_VAR_INIT(VALUE) (VALUE)
1. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1138r0.pdf
Previously the socket code would set the TCPv6 maximum segment size to
minimum value to prevent IP fragmentation for TCP. This was not yet
implemented for the network manager.
Implement network manager functions to set and use minimum MTU socket
option and set the TCP_MAXSEG socket option for both IPv4 and IPv6 and
use those to clamp the TCP maximum segment size for TCP, TCPDNS and
TLSDNS layers in the network manager to 1220 bytes, that is 1280 (IPv6
minimum link MTU) minus 40 (IPv6 fixed header) minus 20 (TCP fixed
header)
We already rely on a similar value for UDP to prevent IP fragmentation
and it make sense to use the same value for IPv4 and IPv6 because the
modern networks are required to support IPv6 packet sizes. If there's
need for small TCP segment values, the MTU on the interfaces needs to be
properly configured.
The IPV6_USE_MIN_MTU socket option directs the IP layer to limit the
IPv6 packet size to the minimum required supported MTU from the base
IPv6 specification, i.e. 1280 bytes. Many implementations of TCP
running over IPv6 neglect to check the IPV6_USE_MIN_MTU value when
performing MSS negotiation and when constructing a TCP segment despite
MSS being defined to be the MTU less the IP and TCP header sizes (60
bytes for IPv6). This leads to oversized IPv6 packets being sent
resulting in unintended Path Maximum Transport Unit Discovery (PMTUD)
being performed and to fragmented IPv6 packets being sent.
Add and use a function to set socket option to limit the MTU on IPv6
sockets to the minimum MTU (1280) both for UDP and TCP.
The current implementation of isc_queue uses Michael-Scott lock-free
queue that in turn uses hazard pointers. It was discovered that the way
we use the isc_queue, such complicated mechanism isn't really needed,
because most of the time, we either execute the work directly when on
nmthread (in case of UDP) or schedule the work from the matching
nmthreads.
Replace the current implementation of the isc_queue with a simple locked
ISC_LIST. There's a slight improvement - since copying the whole list
is very lightweight - we move the queue into a new list before we start
the processing and locking just for moving the queue and not for every
single item on the list.
NOTE: There's a room for future improvements - since we don't guarantee
the order in which the netievents are processed, we could have two lists
- one unlocked that would be used when scheduling the work from the
matching thread and one locked that would be used from non-matching
thread.
The isc__nmsocket_reset() was missing a case for raw TCP sockets (used
by RNDC and DoH) which would case a assertion failure when write timeout
would be triggered.
TCP sockets are now also properly handled in isc__nmsocket_reset().
When isc__nm_uvreq_t gets deactivated, it could be just put onto array
stack to be reused later to save some initialization time.
Unfortunately, this might hide some use-after-free errors.
Disable the inactive uvreqs caching when compiled with Address or
Thread Sanitizer.
When isc_nmhandle_t gets deactivated, it could be just put onto array
stack to be reused later to safe some initialization time.
Unfortunately, this might hide some use-after-free errors.
Disable the inactive handles caching when compiled with Address or
Thread Sanitizer.
The isc__nmsocket_t has locked array of isc_nmhandle_t that's not used
for anything. The isc__nmhandle_get() adds the isc_nmhandle_t to the
locked array (and resized if necessary) and removed when
isc_nmhandle_put() finally destroys the handle. That's all it does, so
it serves no useful purpose.
Remove the .ah_handles, .ah_size, and .ah_frees members of the
isc__nmsocket_t and .ah_pos member of the isc_nmhandle_t struct.
When the TCP, TCPDNS or TLSDNS connection times out, the isc__nm_uvreq_t
would be pushed into sock->inactivereqs before the uv_tcp_connect()
callback finishes. Because the isc__nmsocket_t keeps the list of
inactive isc__nm_uvreq_t, this would cause use-after-free only when the
sock->inactivereqs is full (which could never happen because the failure
happens in connection timeout callback) or when the sock->inactivereqs
mechanism is completely removed (f.e. when running under Address or
Thread Sanitizer).
Delay isc__nm_uvreq_t deallocation to the connection callback and only
signal the connection callback should be called by shutting down the
libuv socket from the connection timeout callback.
The keep-response-order option has been obsoleted, and in this commit,
remove the keep-response-order ACL map rendering the option no-op, the
call the isc_nm_sequential() and the now unused isc_nm_sequential()
function itself.
There was an artificial limit of 23 on the number of simultaneous
pipelined queries in the single TCP connection. The new network
managers is capable of handling "unlimited" (limited only by the TCP
read buffer size ) queries similar to "unlimited" handling of the DNS
queries receive over UDP.
Don't limit the number of TCP queries that we can process within a
single TCP read callback.
When invalid DNS message is received, there was a handling mechanism for
DoH that would be called to return proper HTTP response.
Reuse this mechanism and reset the TCP connection when the client is
blackholed, DNS message is completely bogus or the ns_client receives
response instead of query.
In some situations (unit test and forthcoming XFR timeouts MR), we need
to modify the write timeout independently of the read timeout. Add a
isc_nmhandle_setwritetimeout() function that could be called before
isc_nm_send() to specify a custom write timeout interval.
When the outgoing TCP write buffers are full because the other party is
not reading the data, the uv_write() could wait indefinitely on the
uv_loop and never calling the callback. Add a new write timer that uses
the `tcp-idle-timeout` value to interrupt the TCP connection when we are
not able to send data for defined period of time.
The uv_tcp_close_reset() function was added in libuv 1.32.0 and since we
support older libuv releases, we have to add a shim uv_tcp_close_reset()
implementation loosely based on libuv.
Before adding the write timer, we have to remove the generic sock->timer
to sock->read_timer. We don't touch the function names to limit the
impact of the refactoring.
When libuv functions fail, they return correct return value that could
be useful for more detailed debugging. Currently, we usually just check
whether the return value is 0 and invoke assertion error if it doesn't
throwing away the details why the call has failed. Unfortunately, this
often happen on more exotic platforms.
Add a UV_RUNTIME_CHECK() macro that can be used to print more detailed
error message (via uv_strerror() before ending the execution of the
program abruptly with the assertion.
When isc_quota_attach_cb() API returns ISC_R_QUOTA (meaning hard quota
was reached) the accept_connection() would return without logging a
message about quota reached.
Change the connection callback to log the quota reached message.
Some operating systems (OpenBSD and DragonFly BSD) don't restrict the
IPv6 sockets to sending and receiving IPv6 packets only. Explicitly
enable the IPV6_V6ONLY socket option on the IPv6 sockets to prevent
failures from using the IPv4-mapped IPv6 address.
The server_send_error_response() function is supposed to be used only
in case of failures and never in case of legitimate requests. Ensure
that ISC_HTTP_ERROR_SUCCESS is never passed there by mistake.
Previously, the netmgr/udp.c tried to detect the recvmmsg detection in
libuv with #ifdef UV_UDP_<foo> preprocessor macros. However, because
the UV_UDP_<foo> are not preprocessor macros, but enum members, the
detection didn't work. Because the detection didn't work, the code
didn't have access to the information when we received the final chunk
of the recvmmsg and tried to free the uvbuf every time. Fortunately,
the isc__nm_free_uvbuf() had a kludge that detected attempt to free in
the middle of the receive buffer, so the code worked.
However, libuv 1.37.0 changed the way the recvmmsg was enabled from
implicit to explicit, and we checked for yet another enum member
presence with preprocessor macro, so in fact libuv recvmmsg support was
never enabled with libuv >= 1.37.0.
This commit changes to the preprocessor macros to autoconf checks for
declaration, so the detection now works again. On top of that, it's now
possible to cleanup the alloc_cb and free_uvbuf functions because now,
the information whether we can or cannot free the buffer is available to
us.
This commit converts the license handling to adhere to the REUSE
specification. It specifically:
1. Adds used licnses to LICENSES/ directory
2. Add "isc" template for adding the copyright boilerplate
3. Changes all source files to include copyright and SPDX license
header, this includes all the C sources, documentation, zone files,
configuration files. There are notes in the doc/dev/copyrights file
on how to add correct headers to the new files.
4. Handle the rest that can't be modified via .reuse/dep5 file. The
binary (or otherwise unmodifiable) files could have license places
next to them in <foo>.license file, but this would lead to cluttered
repository and most of the files handled in the .reuse/dep5 file are
system test files.
The isc__nm_tcp_resumeread() was using maybe_enqueue function to enqueue
netmgr event which could case the read callback to be executed
immediately if there was enough data waiting in the TCP queue.
If such thing would happen, the read callback would be called before the
previous read callback was finished and the worker receive buffer would
be still marked "in use" causing a assertion failure.
This would affect only raw TCP channels, e.g. rndc and http statistics.
The isc_queue_new() was using dirty tricks to allocate the head and tail
members of the struct aligned to the cacheline. We can now use
isc_mem_get_aligned() to allocate the structure to the cacheline
directly.
Use ISC_OS_CACHELINE_SIZE (64) instead of arbitrary ALIGNMENT (128), one
cacheline size is enough to prevent false sharing.
Cleanup the unused max_threads variable - there was actually no limit on
the maximum number of threads. This was changed a while ago.
Using the TLS context cache for server-side contexts could reduce the
number of contexts to initialise in the configurations when e.g. the
same 'tls' entry is used in multiple 'listen-on' statements for the
same DNS transport, binding to multiple IP addresses.
In such a case, only one TLS context will be created, instead of a
context per IP address, which could reduce the initialisation time, as
initialising even a non-ephemeral TLS context introduces some delay,
which can be *visually* noticeable by log activity.
Also, this change lays down a foundation for Mutual TLS (when the
server validates a client certificate, additionally to a client
validating the server), as the TLS context cache can be extended to
store additional data required for validation (like intermediates CA
chain).
Additionally to the above, the change ensures that the contexts are
not being changed after initialisation, as such a practice is frowned
upon. Previously we would set the supported ALPN tags within
isc_nm_listenhttp() and isc_nm_listentlsdns(). We do not do that for
client-side contexts, so that appears to be an overlook. Now we set
the supported ALPN tags right after server-side contexts creation,
similarly how we do for client-side ones.
Commit 9ee60e7a17 erroneously introduced
duplicate conditions to several existing conditional statements
responsible for determining error codes passed to connection callbacks
upon failure. Fix the affected expressions to ensure connection
callbacks are invoked with:
- the ISC_R_SHUTTINGDOWN error code when a global netmgr shutdown is
in progress,
- the ISC_R_CANCELED error code when a specific operation has been
canceled.
This does not fix any known bugs, it only adjusts the changes introduced
by commit 9ee60e7a17 so that they match
its original intent.
On FreeBSD, the pthread primitives are not solely allocated on stack,
but part of the object lives on the heap. Missing pthread_*_destroy
causes the heap memory to grow and in case of fast lived object it's
possible to run out-of-memory.
Properly destroy the leaking mutex (worker->lock) and
the leaking condition (sock->cond).
Previously, when TCP accept failed, we have logged a message with
ISC_LOG_ERROR level. One common case, how this could happen is that the
client hits TCP client quota and is put on hold and when resumed, the
client has already given up and closed the TCP connection. In such
case, the named would log:
TCP connection failed: socket is not connected
This message was quite confusing because it actually doesn't say that
it's related to the accepting the TCP connection and also it logs
everything on the ISC_LOG_ERROR level.
Change the log message to "Accepting TCP connection failed" and for
specific error states lower the severity of the log message to
ISC_LOG_INFO.
This commit adds an isc_nm_socket_type() function which can be used to
obtain a handle's socket type.
This change obsoletes isc_nm_is_tlsdns_handle() and
isc_nm_is_http_handle(). However, it was decided to keep the latter as
we eventually might end up supporting multiple HTTP versions.
This commit makes the TLS stream code to not issue mostly useless
debug log message on error during TLS I/O. This message was cluttering
logs a lot, as it can be generated on (almost) any non-clean TLS
connection termination, even in the cases when the actual query
completed successfully. Nor does it provide much value for end-users,
yet it can occasionally be seen when using dig and quite often when
running BIND over a publicly available network interface.
This commit removes unneeded isc__nmsocket_prep_destroy() call on ALPN
negotiation failure, which was eventually causing the TLS handle to
leak.
This call is not needed, as not attaching to the transport (TLS)
handle should be enough. At this point it seems like a kludge from
earlier days of the TLS code.
This commit fixes a peculiar corner case in the client-side DoT code
because of which a crash could occur during a zone transfer. A junk
DNS message should be sent at the end of a zone transfer via TLS to
trigger the crash (abort).
This commit, hopefully, fixes that.
Also, this commit adds similar changes to the TCP DNS code, as it
shares the same origin and most of the logic.
Change 5756 (GL #2854) introduced build errors when using
'configure --disable-doh'. To fix this, isc_nm_is_http_handle() is
now defined in all builds, not just builds that have DoH enabled.
Missing code comments were added both for that function and for
isc_nm_is_tlsdns_handle().
This commit adds an isc_nm_set_min_answer_ttl() function which is
intended to to be used to give a hint to the underlying transport
regarding the answer TTL.
The interface is intentionally kept generic because over time more
transports might benefit from this functionality, but currently it is
intended for DoH to set "max-age" value within "Cache-Control" HTTP
header (as recommended in the RFC8484, section 5.1 "Cache
Interaction").
It is no-op for other DNS transports for the time being.
it is possible for udp_recv_cb() to fire after the socket
is already shutting down and statichandle is NULL; we need to
create a temporary handle in this case.
route/netlink sockets don't have stats counters associated with them,
so it's now necessary to check whether socket stats exist before
incrementing or decrementing them. rather than relying on the caller
for this, we now just pass the socket and an index, and the correct
stats counter will be updated if it exists.
isc_nm_routeconnect() opens a route/netlink socket, then calls a
connect callback, much like isc_nm_udpconnect(), with a handle that
can then be monitored for network changes.
Internally the socket is treated as a UDP socket, since route/netlink
sockets follow the datagram contract.
After support for route/netlink sockets is merged, not all sockets
will have stats counters associated with them, so it's now necessary
to check whether socket stats exist before incrementing or decrementing
them. rather than relying on the caller for this, we now just pass the
socket and an index, and the correct stats counter will be updated if
it exists.
The __builtin_expect() can be used to provide the compiler with branch
prediction information. The Gcc manual says[1] on the subject:
In general, you should prefer to use actual profile feedback for
this (-fprofile-arcs), as programmers are notoriously bad at
predicting how their programs actually perform.
Stop using __builtin_expect() and ISC_LIKELY() and ISC_UNLIKELY() macros
to provide the branch prediction information as the performance testing
shows that named performs better when the __builtin_expect() is not
being used.
1. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fexpect
This commit fixes a crash in DoT code when it was attempting to call a
read callback on the later stages of the connection when it is not
available.
It also fixes [GL #2884] (back-trace provided in the bug report is
exactly the same as was seen when fixing this problem).
This commit makes BIND verify that zone transfers are allowed to be
done over the underlying connection. Currently, it makes sense only
for DoT, but the code is deliberately made to be protocol-agnostic.
The intention of having this function is to have a predicate to check
if a zone transfer could be performed over the given handle. In most
cases we can assume that we can do zone transfers over any stream
transport except DoH, but this assumption will not work for zone
transfers over DoT (XoT), as the RFC9103 requires ALPN to happen,
which might not be the case for all deployments of DoT.
- The `timeout_action` parameter to dns_dispatch_addresponse() been
replaced with a netmgr callback that is called when a dispatch read
times out. this callback may optionally reset the read timer and
resume reading.
- Added a function to convert isc_interval to milliseconds; this is used
to translate fctx->interval into a value that can be passed to
dns_dispatch_addresponse() as the timeout.
- Note that netmgr timeouts are accurate to the millisecond, so code to
check whether a timeout has been reached cannot rely on microsecond
accuracy.
- If serve-stale is configured, then a timeout received by the resolver
may trigger it to return stale data, and then resume waiting for the
read timeout. this is no longer based on a separate stale timer.
- The code for canceling requests in request.c has been altered so that
it can run asynchronously.
- TCP timeout events apply to the dispatch, which may be shared by
multiple queries. since in the event of a timeout we have no query ID
to use to identify the resp we wanted, we now just send the timeout to
the oldest query that was pending.
- There was some additional refactoring in the resolver: combining
fctx_join() and fctx_try_events() into one function to reduce code
duplication, and using fixednames in fetchctx and fetchevent.
- Incidental fix: new_adbaddrinfo() can't return NULL anymore, so the
code can be simplified.
- The read timer must always be stopped when reading stops.
- Read callbacks can now call isc_nm_read() again in TCP, TCPDNS and
TLSDNS; previously this caused an assertion.
- The wrong failure code could be sent after a UDP recv failure because
the if statements were in the wrong order. the check for a NULL
address needs to be after the check for an error code, otherwise the
result will always be set to ISC_R_EOF.
- When aborting a read or connect because the netmgr is shutting down,
use ISC_R_SHUTTINGDOWN. (ISC_R_CANCELED is now reserved for when the
read has been canceled by the caller.)
- A new function isc_nmhandle_timer_running() has been added enabling a
callback to check whether the timer has been reset after processing a
timeout.
- Incidental netmgr fix: always use isc__nm_closing() instead of
referencing sock->mgr->closing directly
- Corrected a few comments that used outdated function names.
Previously isc_nm_read() required references on the handle to be at
least 2, under the assumption that it would only ever be called from a
connect or accept callback. however, it can also be called from a read
callback, in which case the reference count might be only 1.
On TCPDNS/TLSDNS read callback, the socket buffer could be reallocated
if the received contents would be larger than the buffer. The existing
code would not preserve the contents of the existing buffer which lead
to the loss of the already received data.
This commit changes the isc_mem_put()+isc_mem_get() with isc_mem_reget()
to preserve the existing contents of the socket buffer.
The netmgr, has an internal cache for freed active handles. This cache
was allocated using isc_mem_allocate()/isc_mem_free() API because it was
simpler to reallocate the cache when we needed to grow it. The new
isc_mem_reget() function could be used here reducing the need to use
isc_mem_allocate() API which is tad bit slower than isc_mem_get() API.
This commit adds new function isc_nm_http_makeuri() which is supposed
to unify DoH URI construction throughout the codebase.
It handles IPv6 addresses, hostnames, and IPv6 addresses given as
hostnames properly, and replaces similar ad-hoc code in the codebase.
this commit removes isc__nm_tcpdns_keepalive() and
isc__nm_tlsdns_keepalive(); keepalive for these protocols and
for TCP will now be set directly from isc_nmhandle_keepalive().
protocols that have an underlying TCP socket (i.e., TLS stream
and HTTP), now have protocol-specific routines, called by
isc_nmhandle_keeaplive(), to set the keepalive value on the
underlying socket.
previously, receiving a keepalive option had no effect on how
long named would keep the connection open; there was a place to
configure the keepalive timeout but it was never used. this commit
corrects that.
this also fixes an error in isc__nm_{tcp,tls}dns_keepalive()
in which the sense of a REQUIRE test was reversed; previously this
error had not been noticed because the functions were not being
used.
- fix some duplicated and out-of-order prototypes declared in
netmgr-int.h
- rename isc_nm_tcpdns_keepalive to isc__nm_tcpdns_keepalive as
it's for internal use
This commit changes the DoH code in such a way that it makes no
assumptions regarding which headers are expected to be processed
first. In particular, the code expected the :method: pseudo-header to
be processed early, which might not be true.
Instead of disabling the fragmentation on the UDP sockets, we now
disable the Path MTU Discovery by setting IP(V6)_MTU_DISCOVER socket
option to IP_PMTUDISC_OMIT on Linux and disabling IP(V6)_DONTFRAG socket
option on FreeBSD. This option sets DF=0 in the IP header and also
ignores the Path MTU Discovery.
As additional mitigation on Linux, we recommend setting
net.ipv4.ip_no_pmtu_disc to Mode 3:
Mode 3 is a hardend pmtu discover mode. The kernel will only accept
fragmentation-needed errors if the underlying protocol can verify
them besides a plain socket lookup. Current protocols for which pmtu
events will be honored are TCP, SCTP and DCCP as they verify
e.g. the sequence number or the association. This mode should not be
enabled globally but is only intended to secure e.g. name servers in
namespaces where TCP path mtu must still work but path MTU
information of other protocols should be discarded. If enabled
globally this mode could break other protocols.
This commit changes TLS stream behaviour in such a way, that it is now
optimised for small writes. In the case there is a need to write less
or equal to 512 bytes, we could avoid calling the memory allocator at
the expense of possibly slight increase in memory usage. In case of
larger writes, the behviour remains unchanged.
At least at this point doing memory copying is not required. Probably
it was a workaround for some problem in the earlier days of DoH, at
this point it appears to be a waste of CPU cycles.
This commit significantly simplifies the code in http_send_outgoing()
as it was unnecessary complicated, because it was dealing with
multiple statically and dynamically allocated buffers, making it
extremely hard to follow, as well as making it to do unnecessary
memory copying in some situations. This commit fixes these issues,
while retaining the high level buffering logic.
When an HTTP/2 client terminates a session it means that it is about
to close the underlying connection. However, we were not doing that.
As a result, with the latest changes to the test suite, which made it
to limit amount of requests per a transport connection, the tests
using quota would hang for quite a while. This commit fixes that.
The function should not be called here because it is, in general,
supposed to be called at the end of the transport level callbacks to
perform I/O, and thus, calling it here is clearly a mistake because it
breaks other code expectations. As a result of the call to
http_do_bio() from within isc__nm_http_request() the unit tests were
running slower than expected in some situations.
In this particular situation http_do_bio() is going to be called at
the end of the transport_connect_cb() (initially), or http_readcb(),
sending all of the scheduled requests at once.
This change affects only the test suite because it is the only place
in the codebase where isc__nm_http_request() is used in order to
ensure that the server is able to handle multiple HTTP/2 streams at
once.
This commit fixes a crash in DoH caused by transport handle to be
detached too early when sending outgoing data.
We need to attach to the session->handle earlier because as an
indirect result of the nghttp2_session_mem_send() the session might
get closed and the handle detached. However, there is still might be
some outgoing data to handle. Besides, even when the underlying socket
was closed via the handle, we still should try to attempt to send
outgoing data via isc_nm_send() to let it call write callback, passed
to the http_send_outgoing().
This commit gets rid of custom code taking care of response buffering
by replacing the custom code with isc_buffer_t. Also, it gets rid of
an unnecessary memory copying when sending a response.
This commit replaces the ad-hoc 64K buffer for incoming POST data with
isc_buffer_t backed by dynamically allocated buffer sized accordingly
to the value in the "Content-Length" header.
The commit replaces an ad-hoc incoming DNS-message buffer in the
client-side DoH code with isc_buffer_t.
The commit also fixes a timing issue in the unit tests revealed by the
change.
This commit replaces a static ad-hoc HTTP/2 session's temporary buffer
with a realloc-able isc_buffer_t object, which is being allocated on
as needed basis, lowering the memory consumption somewhat. The buffer
is needed in very rare cases, so allocating it prematurely is not
wise.
Also, it fixes a bug in http_readcb() where the ad-hoc buffer appeared
to be improperly used, leading to a situation when the processed data
from the receiving regions can be processed twice, while unprocessed
data will never be processed.
This commit gets rid of RW locks in a hot path of the DoH code. In the
original design, it was implied that we add new endpoints after the
HTTP listener was created. Such a design implies some locking. We do
not need such flexibility, though. Instead, we could build a set of
endpoints before the HTTP listener gets created. Such a design does
not need RW locks at all.
This commit makes number of concurrent HTTP/2 streams per connection
configurable as a mean to fight DDoS attacks. As soon as the limit is
reached, BIND terminates the whole session.
The commit adds a global configuration
option (http-streams-per-connection) which can be overridden in an
http <name> {...} statement like follows:
http local-http-server {
...
streams-per-connection 100;
...
};
For now the default value is 100, which should be enough (e.g. NGINX
uses 128, but it is a full-featured WEB-server). When using lower
numbers (e.g. ~70), it is possible to hit the limit with
e.g. flamethrower.
This commit adds the code (and some tests) which allows verifying
validity of HTTP paths both in incoming HTTP requests and in BIND's
configuration file.
An unhandled code path left GET query string data uninitialised (equal
to NULL) and led to a crash during the requests' base64 data
decoding. This commit fixes that.
It was discovered that setting the thread affinity on both the netmgr
and netthread threads lead to inconsistent recursive performance because
sometimes the netmgr and netthread threads would compete over single
resource and sometimes not.
Removing setting the affinity causes a slight dip in the authoritative
performance around 5% (the measured range was from 3.8% to 7.8%), but
the recursive performance is now consistently good.
In the jemalloc merge request, we missed the fact that ah_frees and ah_handles
are reallocated which is not compatible with using isc_mem_get() for allocation
and isc_mem_put() for deallocation. This commit reverts that part and restores
use of isc_mem_allocate() and isc_mem_free().
Current mempools are kind of hybrid structures - they serve two
purposes:
1. mempool with a lock is basically static sized allocator with
pre-allocated free items
2. mempool without a lock is a doubly-linked list of preallocated items
The first kind of usage could be easily replaced with jemalloc small
sized arena objects and thread-local caches.
The second usage not-so-much and we need to keep this (in
libdns:message.c) for performance reasons.
The isc_mem_allocate() comes with additional cost because of the memory
tracking. In this commit, we replace the usage with isc_mem_get()
because we track the allocated sizes anyway, so it's possible to also
replace isc_mem_free() with isc_mem_put().
This commit makes BIND return HTTP status codes for malformed or too
small requests.
DNS request processing code would ignore such requests. Such an
approach works well for other DNS transport but does not make much
sense for HTTP, not allowing it to complete the request/response
sequence.
Suppose execution has reached the point where DNS message handling
code has been called. In that case, it means that the HTTP request has
been successfully processed, and, thus, we are expected to respond to
it either with a message containing some DNS payload or at least to
return an error status code. This commit ensures that BIND behaves
this way.
This error code fits better than the more generic "Internal Server
Error" (500) which implies that the problem is on the server.
Also, do not end the whole HTTP/2 session on a bad request.
We were too strict regarding the value and presence of "Accept" HTTP
header, slightly breaking compatibility with the specification.
According to RFC8484 client SHOULD add "Accept" header to the requests
but MUST be able to handle "application/dns-message" media type
regardless of the value of the header. That basically suggests we
ignore its value.
Besides, verifying the value of the "Accept" header is a bit tricky
because it could contain multiple media types, thus requiring proper
parsing. That is doable but does not provide us with any benefits.
Among other things, not verifying the value also fixes compatibility
with clients, which could advertise multiple media types as supported,
which we should accept. For example, it is possible for a perfectly
valid request to contain "application/dns-message", "application/*",
and "*/*" in the "Accept" header value. Still, we would treat such a
request as invalid.
The commit fixes BIND hanging when browsers end HTTP/2 streams
prematurely (for example, by sending RST_STREAM). It ensures that
isc__nmsocket_prep_destroy() will be called for an HTTP/2 stream,
allowing it to be properly disposed.
The problem was impossible to reproduce using dig or DoH benchmarking
software (e.g. flamethrower) because these do not tend to end HTTP/2
streams prematurely.
This commit adds two new autoconf options `--enable-doh` (enabled by
default) and `--with-libnghttp2` (mandatory when DoH is enabled).
When DoH support is disabled the library is not linked-in and support
for http(s) protocol is disabled in the netmgr, named and dig.
In DNS Flag Day 2020, we started setting the DF (Don't Fragment socket
option on the UDP sockets. It turned out, that this code was incomplete
leading to dropping the outgoing UDP packets.
This has been now remedied, so it is possible to disable the
fragmentation on the UDP sockets again as the sending error is now
handled by sending back an empty response with TC (truncated) bit set.
This reverts commit 66eefac78c.
When the fragmentation is disabled on UDP sockets, the uv_udp_send()
call can fail with UV_EMSGSIZE for messages larger than path MTU.
Previously, this error would end with just discarding the response. In
this commit, a proper handling of such case is added and on such error,
a new DNS response with truncated bit set is generated and sent to the
client.
This change allows us to disable the fragmentation on the UDP
sockets again.
Previously, each protocol (TCPDNS, TLSDNS) has specified own function to
disable pipelining on the connection. An oversight would lead to
assertion failure when opcode is not query over non-TCPDNS protocol
because the isc_nm_tcpdns_sequential() function would be called over
non-TCPDNS socket. This commit removes the per-protocol functions and
refactors the code to have and use common isc_nm_sequential() function
that would either disable the pipelining on the socket or would handle
the request in per specific manner. Currently it ignores the call for
HTTP sockets and causes assertion failure for protocols where it doesn't
make sense to call the function at all.
The warning was produced by an ASAN build:
runtime error: null pointer passed as argument 2, which is declared to
never be null
This commit fixes it by checking if nghttp2_session_mem_send() has
actually returned anything.
This change sets the mentioned fields properly and gets rid of klusges
added in the times when we were keeping pointers to isc_sockaddr_t
instead of copies. Among other things it helps to avoid a situation
when garbage instead of an address appears in dig output.
We cannot use DoH for zone transfers. According to RFC8484 a DoH
request contains exactly one DNS message (see Section 6: Definition of
the "application/dns-message" Media Type,
https://datatracker.ietf.org/doc/html/rfc8484#section-6). This makes
DoH unsuitable for zone transfers as often (and usually!) these need
more than one DNS message, especially for larger zones.
As zone transfers over DoH are not (yet) standardised, nor discussed
in RFC8484, the best thing we can do is to return "not implemented."
Technically DoH can be used to transfer small zones which fit in one
message, but that is not enough for the generic case.
Also, this commit makes the server-side DoH code ensure that no
multiple responses could be attempted to be sent over one HTTP/2
stream. In HTTP/2 one stream is mapped to one request/response
transaction. Now the write callback will be called with failure error
code in such a case.
Support a situation in header processing callback when client side
code could receive a belated response or part of it. That could
happen when the HTTP/2 session was already closed, but there were some
response data from server in flight. Other client-side nghttp2
callbacks code already handled this case.
The bug became apparent after HTTP/2 write buffering was supported,
leading to rare unit test failures.
This commit ensures that sock->h2.connect.cstream gets nullified when
the object in question is deleted. This fixes a nasty crash in dig
exposed when receiving large responses leading to double free()ing.
Also, it refactors how the client-side code keeps track of client
streams (hopefully) preventing from similar errors appearing in the
future.
This commit makes NM code to report HTTP as a stream protocol. This
makes it possible to handle large responses properly. Like:
dig +https @127.0.0.1 A cmts1-dhcp.longlines.com
The Windows support has been completely removed from the source tree
and BIND 9 now no longer supports native compilation on Windows.
We might consider reviewing mingw-w64 port if contributed by external
party, but no development efforts will be put into making BIND 9 compile
and run on Windows again.
While cleaning up the usage of HAVE_UV_<func> macros, we forgot to
cleanup the HAVE_UV_UDP_CONNECT in the actual code and
HAVE_UV_TRANSLATE_SYS_ERROR and this was causing Windows build to fail
on uv_udp_send() because the socket was already connected and we were
falsely assuming that it was not.
The platforms with autoconf support were not affected, because we were
still checking for the functions from the configure.
This commit adds the ability to consolidate HTTP/2 write requests if
there is already one in flight. If it is the case, the code will
consolidate multiple subsequent write request into a larger one
allowing to utilise the network in a more efficient way by creating
larger TCP packets as well as by reducing TLS records overhead (by
creating large TLS records instead of multiple small ones).
This optimisation is especially efficient for clients, creating many
concurrent HTTP/2 streams over a transport connection at once. This
way, the code might create a small amount of multi-kilobyte requests
instead of many 50-120 byte ones.
In fact, it turned out to work so well that I had to add a work-around
to the code to ensure compatibility with the flamethrower, which, at
the time of writing, does not support TLS records larger than two
kilobytes. Now the code tries to flush the write buffer after 1.5
kilobyte, which is still pretty adequate for our use case.
Essentially, this commit implements a recommendation given by nghttp2
library:
https://nghttp2.org/documentation/nghttp2_session_mem_send.html
The libuv has a support for running long running tasks in the dedicated
threadpools, so it doesn't affect networking IO.
This commit adds isc_nm_work_enqueue() wrapper that would wraps around
the libuv API and runs it on top of associated worker loop.
The only limitation is that the function must be called from inside
network manager thread, so the call to the function should be wrapped
inside a (bound) task.
Instead of having a configure check for every missing function that has
been added in later version of libuv, we now use UV_VERSION_HEX to
decide whether we need the shim or not.