Commit graph

624 commits

Author SHA1 Message Date
Ondřej Surý
8e240bbb5f Fix isc_buffer_init capacity mismatch in DoH data chunk callback
isc_buffer_init() is given MAX_DNS_MESSAGE_SIZE (65535) as capacity but
only h2->content_length bytes are allocated.  This makes the buffer
believe it has more space than actually allocated.  A secondary bounds
check (new_bufsize <= h2->content_length) prevents actual overflow, but
the buffer invariant is violated.

Pass h2->content_length as the capacity to match the allocation.
2026-03-18 11:39:01 +01:00
Ondřej Surý
2ab3d7c075
Fix missing server socket detach in TLS accept error path
When TLS creation fails in tlslisten_acceptcb(), tlssock->server
was not detached before detaching tlssock itself.
2026-03-14 13:58:32 +01:00
Ondřej Surý
295139f8ca
Rename isc_net_getudpportrange() to isc_net_getportrange()
This better reflects the true nature of the function as we are reading
the ephemeral port range which is not related to UDP at all.
2026-02-20 14:06:23 +01:00
Ondřej Surý
04c81b55d2
Implement IP_LOCAL_PORT_RANGE socket option for Linux
For Linux >= 6.8:

Since 2023, Linux has introduced a change to the IP_LOCAL_PORT_RANGE
socket option that eliminates the need for the random window
shifting (implemented as a fallback in the next commit).

By setting IP_LOCAL_PORT_RANGE option, we tell the kernel to use better
approach to the source port selection.

For Linux << 6.8:

This implement selecting port by random shifting range leveraging the
IP_LOCAL_PORT_RANGE socket option.  The network manager is initialized
with the ephemeral port range (on startup and on reconfig) and then for
every outgoing TCP connection, we define a custom port range (1000
ports) and then randomly shift the custom range within the system range.

This helps the kernel to reduce the search space to the custom window
between <random_offset, random_offset + 1000>.

Reference:
https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/#kernel
2026-02-20 14:06:23 +01:00
Ondřej Surý
2c48fcaeed
Improve the source port selection on Linux
Since 2015, Linux has introduced a new socket option to overcome TCP
limitations: When an application needs to force a source IP on an active
TCP socket it has to use bind(IP, port=x).  As most applications do not
want to deal with already used ports, x is often set to 0, meaning the
kernel is in charge to find an available port.  But kernel does not know
yet if this socket is going to be a listener or be connected. This
IP_BIND_ADDRESS_NO_PORT socket option ask the kernel to ignore the 0
port provided by application in bind(IP, port=0) and only remember the
given IP address. The port will be automatically chosen at connect()
time, in a way that allows sharing a source port as long as the 4-tuples
are unique.

Enable IP_BIND_ADDRESS_NO_PORT on the outgoing TCP sockets to overcome
this TCP limitation.
2026-02-20 14:06:23 +01:00
Evan Hunt
d4ebea1037 use a standard CLEANUP macro
CLEANUP is a macro similar to CHECK but unconditional, jumping
to cleanup even if the result is ISC_R_SUCCESS. It is now used
in place of DST_RET, CLEANUP_WITH, and CHECK(<non-success constant>).
2025-12-03 13:45:43 -08:00
Evan Hunt
6b33b7fc77 switch to RETERR where it wasn't being used
replace all instances of the pattern:

        result = <statement>
        if (result != ISC_R_SUCCESS) {
                return result;
        }

with:

        RETERR(<statement>);
2025-12-03 13:45:43 -08:00
Evan Hunt
38e94cc7da switch to CHECK where it wasn't being used
replace all instances of the pattern:

        result = <statement>
        if (result != ISC_R_SUCCESS) {
                goto cleanup;
        }

with:

        CHECK(<statement>);
2025-12-03 13:45:42 -08:00
Colin Vidal
7c8b517d56 attach socket before async streamdns_resume_processing
Call to `streamdns_resume_processing` is asynchronous but the socket
passed as argument is not attached when scheduling the call.

While there is no reproducible way (so far) to make the socket reference
number down to 0 before `streamdns_resume_processing` is called, attach
the socket before scheduling the call. This guard against an hypothetic
case where, for some reasons, the socket refcount would reach 0, and be
freed from memory when `streamdns_resume_processing` is called.
2025-11-20 18:08:52 +01:00
Michal Nowak
d91e8ed575 Use SET_IF_NOT_NULL in isc__nm_base64* 2025-10-22 12:50:55 +02:00
Ondřej Surý
94b4d105e8
Apply the changes from updated set_if_not_null semantic patch 2025-10-08 17:44:50 +02:00
Alessio Podda
6e7aec2cb7
Use unique names for probes.d files
Enabling LTO in the subsequent commit requires the file names to be
unique and having same probes.d in each of the libraries breaks this
requirement.  Rename probes.d to probes-{isc,dns,ns}.d files and adjust
the includes.
2025-09-24 13:18:13 +02:00
Colin Vidal
2cbe958df6 simplify nchildren count in isc_nm_listenudp
Slight simplification of the logic to define .nchildren listening UDP
socket.
2025-09-16 14:22:15 +02:00
Ondřej Surý
42496f3f4a
Use ControlStatementsExceptControlMacros for SpaceBeforeParens
> Put a space before opening parentheses only after control statement
> keywords (for/if/while...) except this option doesn’t apply to ForEach
> and If macros. This is useful in projects where ForEach/If macros are
> treated as function calls instead of control statements.
2025-08-19 07:58:33 +02:00
Ondřej Surý
f6aed602f0
Refactor the network manager to be a singleton
There is only a single network manager running on top of the loop
manager (except for tests).  Refactor the network manager to be a
singleton (a single instance) and change the unit tests, so that the
shorter read timeouts apply only to a specific handle, not the whole
extra 'connect_nm' network manager instance.
2025-07-23 22:45:38 +02:00
Ondřej Surý
b8d00e2e18
Change the loopmgr to be singleton
All the applications built on top of the loop manager were required to
create just a single instance of the loop manager.  Refactor the loop
manager to not expose this instance to the callers and keep the loop
manager object internal to the isc_loop compilation unit.

This significantly simplifies a number of data structures and calls to
the isc_loop API.
2025-07-23 22:44:16 +02:00
Ondřej Surý
cca4b26d31
Use regular reference counting macro for isc_nm_t structure
Instead of having hand crafted attach/detach/destroy functions, replace
them with the standard ISC_REFCOUNT macro.  This also have advantage
that delayed netmgr detach (from dns_dispatch) now doesn't cause
assertion failure.  This can happen with delayed (call_rcu) shutdown of
dns_adb.
2025-07-09 21:22:48 +02:00
Ondřej Surý
1032681af0 Convert the isc/tid.h to use own signed integer isc_tid_t type
Change the internal type used for isc_tid unit to isc_tid_t to hide the
specific integer type being used for the 'tid'.  Internally, the signed
integer type is being used.  This allows us to have negatively indexed
arrays that works both for threads with assigned tid and the threads
with unassigned tid.  This should be used only in specific situations.
2025-06-28 13:32:12 +02:00
Mark Andrews
422b9118e8 Use clang-format-20 to update formatting 2025-06-25 12:44:22 +10:00
Aydın Mercan
5cd6c173ff
replace the build system with meson
Meson is a modern build system that has seen a rise in adoption and some
version of it is available in almost every platform supported.

Compared to automake, meson has the following advantages:

* Meson provides a significant boost to the build and configuration time
  by better exploiting parallelism.

* Meson is subjectively considered to be better in readability.

These merits alone justify experimenting with meson as a way of
improving development time and ergonomics. However, there are some
compromises to ensure the transition goes relatively smooth:

* The system tests currently rely on various files within the source
  directory. Changing this requirement is a non-trivial task that can't
  be currently justified. Currently the last compiled build directory
  writes into the source tree which is in turn used by pytest.

* The minimum version supported has been fixed at 0.61. Increasing this
  value will require choosing a baseline of distributions that can
  package with meson. On the contrary, there will likely be an attempt
  to decrease this value to ensure almost universal support for building
  BIND 9 with meson.
2025-06-11 10:30:12 +03:00
Ondřej Surý
7f498cc60d
Give every memory pool a name
Instead of giving the memory pools names with an explicit call to
isc_mempool_setname(), add the name to isc_mempool_create() call to have
all the memory pools an unconditional name.
2025-05-29 05:46:46 +02:00
Evan Hunt
dd9a685f4a simplify code around isc_mem_put() and isc_mem_free()
it isn't necessary to set a pointer to NULL after calling
isc_mem_put() or isc_mem_free(), because those macros take
care of it automatically.
2025-05-28 17:22:32 -07:00
Evan Hunt
8487e43ad9 make all ISC_LIST_FOREACH calls safe
previously, ISC_LIST_FOREACH and ISC_LIST_FOREACH_SAFE were
two separate macros, with the _SAFE version allowing entries
to be unlinked during the loop. ISC_LIST_FOREACH is now also
safe, and the separate _SAFE macro has been removed.

similarly, the ISC_LIST_FOREACH_REV macro is now safe, and
ISC_LIST_FOREACH_REV_SAFE has also been removed.
2025-05-23 13:09:10 -07:00
Aram Sargsyan
74a8acdc8d Separate the single setter/getter functions for TCP timeouts
Previously all kinds of TCP timeouts had a single getter and setter
functions. Separate each timeout to its own getter/setter functions,
because in majority of cases only one is required at a time, and it's
not optimal expanding those functions every time a new timeout value
is implemented.
2025-04-23 17:03:05 +00:00
Aram Sargsyan
70ad94257d Implement tcp-primaries-timeout
The new 'tcp-primaries-timeout' configuration option works the same way
as the existing 'tcp-initial-timeout' option, but applies only to the
TCP connections made to the primary servers, so that the timeout value
can be set separately for them. The default is 15 seconds.

Also, while accommodating zone.c's code to support the new option, make
a light refactoring with the way UDP timeouts are calculated by using
definitions instead of hardcoded values.
2025-04-23 17:03:05 +00:00
Evan Hunt
ad7f744115 use ISC_LIST_FOREACH in more places
use the ISC_LIST_FOREACH pattern in places where lists had
been iterated using a different pattern from the typical
`for` loop: for example, `while (!ISC_LIST_EMPTY(...))` or
`while ((e = ISC_LIST_HEAD(...)) != NULL)`.
2025-03-31 13:45:14 -07:00
Evan Hunt
522ca7bb54 switch to ISC_LIST_FOREACH everywhere
the pattern `for (x = ISC_LIST_HEAD(...); x != NULL; ISC_LIST_NEXT(...)`
has been changed to `ISC_LIST_FOREACH` throughout BIND, except in a few
cases where the change would be excessively complex.

in most cases this was a straightforward change. in some places,
however, the list element variable was referenced after the loop
ended, and the code was refactored to avoid this necessity.

also, because `ISC_LIST_FOREACH` uses typeof(list.head) to declare
the list elements, compilation failures can occur if the list object
has a `const` qualifier.  some `const` qualifiers have been removed
from function parameters to avoid this problem, and where that was not
possible, `UNCONST` was used.
2025-03-31 13:45:10 -07:00
Artem Boldariev
eaad0aefe6 DoH: Bump the active streams processing limit
This commit bumps the total number of active streams (= the opened
streams for which a request is received, but response is not ready) to
60% of the total streams limit.

The previous limit turned out to be too tight as revealed by
longer (≥1h) runs of "stress:long:rpz:doh+udp:linux:*" tests.
2025-03-03 11:32:29 +02:00
Artem Boldariev
217a1ebd79 DoH: remove obsolete INSIST() check
The check, while not active by default, is not valid since the commit
8b8f4d500d.

See 'if (total == 0) { ...' below branch to understand why.
2025-03-03 11:32:11 +02:00
Artem Boldariev
c5f7968856 DoH: Flush HTTP write buffer on an outgoing DNS message
Previously, the code would try to avoid sending any data regardless of
what it is unless:

a) The flush limit is reached;
b) There are no sends in flight.

This strategy is used to avoid too numerous send requests with little
amount of data. However, it has been proven to be too aggressive and,
in fact, harms performance in some cases (e.g., on longer (≥1h) runs
of "stress:long:rpz:doh+udp:linux:*").

Now, additionally to the listed cases, we also:

c) Flush the buffer and perform a send operation when there is an
outgoing DNS message passed to the code (which is indicated by the
presence of a send callback).

That helps improve performance for "stress:long:rpz:doh+udp:linux:*"
tests.
2025-03-03 11:32:11 +02:00
Artem Boldariev
0e1b02868a DoH: Limit the number of delayed IO processing requests
Previously, a function for continuing IO processing on the next UV
tick was introduced (http_do_bio_async()). The intention behind this
function was to ensure that http_do_bio() is eventually called at
least once in the future. However, the current implementation allows
queueing multiple such delayed requests needlessly. There is currently
no need for these excessive requests as http_do_bio() can requeue them
if needed. At the same time, each such request can lead to a memory
allocation, particularly in BIND 9.18.

This commit ensures that the number of enqueued delayed IO processing
requests never exceeds one in order to avoid potentially bombarding IO
threads with the delayed requests needlessly.
2025-03-03 11:32:11 +02:00
Artem Boldariev
0956fb9b9e DoH: Simplify http_do_bio()
This commit significantly simplifies the code flow in the
http_do_bio() function, which is responsible for processing incoming
and outgoing HTTP/2 data. It seems that the way it was structured
before was indirectly caused by the presence of the missing callback
calls bug, fixed in 8b8f4d500d.

The change introduced by this commit is known to remove a bottleneck
and allows reproducible and measurable performance improvement for
long runs (>= 1h) of "stress:long:rpz:doh+udp:linux:*" tests.

Additionally, it fixes a similar issue with potentially missing send
callback calls processing and hardens the code against use-after-free
errors related to the session object (they can potentially occur).
2025-03-03 11:32:11 +02:00
Ondřej Surý
c5075a9a61
Remove convenience list macros from isc/util.h
The short convenience list macros were used very sparingly and
inconsistenly in the code base.  As the consistency is prefered over
the convenience, all shortened list macro were removed in favor of
their ISC_LIST API targets.
2025-03-01 07:33:40 +01:00
Ondřej Surý
2aa70fff76
Remove unused isc_mutexblock and isc_condition units
The isc_mutexblock and isc_condition units were no longer in use and
were removed.
2025-03-01 07:33:09 +01:00
Artem Boldariev
2adabe835a DoH: http_send_outgoing() return value is not used
The value returned by http_send_outgoing() is not used anywhere, so we
make it not return anything (void). Probably it is an omission from
older times.
2025-02-19 17:52:36 +02:00
Artem Boldariev
8b8f4d500d DoH: Fix missing send callback calls
When handling outgoing data, there were a couple of rarely executed
code paths that would not take into account that the callback MUST be
called.

It could lead to potential memory leaks and consequent shutdown hangs.
2025-02-19 17:52:36 +02:00
Artem Boldariev
a22bc2d7d4 DoH: change how the active streams number is calculated
This commit changes the way how the number of active HTTP streams is
calculated and allows it to scale with the values of the maximum
amount of streams per connection, instead of effectively capping at
STREAM_CLIENTS_PER_CONN.

The original limit, which is intended to define the pipelining limit
for TCP/DoT. However, it appeared to be too restrictive for DoH, as it
works quite differently and implements pipelining at protocol level by
the means of multiplexing multiple streams. That renders each stream
to be effectively a separate connection from the point of view of the
rest of the codebase.
2025-02-19 17:52:36 +02:00
Artem Boldariev
05e8a50818 DoH: Track the amount of in flight outgoing data
Previously we would limit the amount of incoming data to process based
solely on the presence of not completed send requests. That worked,
however, it was found to severely degrade performance in certain
cases, as was revealed during extended testing.

Now we switch to keeping track of how much data is in flight (or ready
to be in flight) and limit the amount of processed incoming data when
the amount of in flight data surpasses the given threshold, similarly
to like we do in other transports.
2025-02-19 17:52:36 +02:00
Artem Boldariev
937b5f8349 DoH: reduce excessive bad request logging
We started using isc_nm_bad_request() more actively throughout
codebase. In the case of HTTP/2 it can lead to a large count of
useless "Bad Request" messages in the BIND log, as often we attempt to
send such request over effectively finished HTTP/2 sessions.

This commit fixes that.
2025-01-15 14:09:17 +00:00
Artem Boldariev
4ae4e255cf Do not stop timer in isc_nm_read_stop() in manual timer mode
A call to isc_nm_read_stop() would always stop reading timer even in
manual timer control mode which was added with StreamDNS in mind. That
looks like an omission that happened due to how timers are controlled
in StreamDNS where we always stop the timer before pausing reading
anyway (see streamdns_on_complete_dnsmessage()). That would not work
well for HTTP, though, where we might want pause reading without
stopping the timer in the case we want to split incoming data into
multiple chunks to be processed independently.

I suppose that it happened due to NM refactoring in the middle of
StreamDNS development (at the time isc_nm_cancelread() and
isc_nm_pauseread() were removed), as the StreamDNS code seems to be
written as if timers are not stoping during a call to
isc_nm_read_stop().
2025-01-15 14:09:17 +00:00
Artem Boldariev
609a41517b DoH: introduce manual read timer control
This commit introduces manual read timer control as used by StreamDNS
and its underlying transports. Before that, DoH code would rely on the
timer control provided by TCP, which would reset the timer any time
some data arrived. Now, the timer is restarted only when a full DNS
message is processed in line with other DNS transports.

That change is required because we should not stop the timer when
reading from the network is paused due to throttling. We need a way to
drop timed-out clients, particularly those who refuse to read the data
we send.
2025-01-15 14:09:17 +00:00
Artem Boldariev
3425e4b1d0 DoH: floodding clients detection
This commit adds logic to make code better protected against clients
that send valid HTTP/2 data that is useless from a DNS server
perspective.

Firstly, it adds logic that protects against clients who send too
little useful (=DNS) data. We achieve that by adding a check that
eventually detects such clients with a nonfavorable useful to
processed data ratio after the initial grace period. The grace period
is limited to processing 128 KiB of data, which should be enough for
sending the largest possible DNS message in a GET request and then
some. This is the main safety belt that would detect even flooding
clients that initially behave well in order to fool the checks server.

Secondly, in addition to the above, we introduce additional checks to
detect outright misbehaving clients earlier:

The code will treat clients that open too many streams (50) without
sending any data for processing as flooding ones; The clients that
managed to send 1.5 KiB of data without opening a single stream or
submitting at least some DNS data will be treated as flooding ones.
Of course, the behaviour described above is nothing else but
heuristical checks, so they can never be perfect. At the same time,
they should be reasonable enough not to drop any valid clients,
realatively easy to implement, and have negligible computational
overhead.
2025-01-15 14:09:17 +00:00
Artem Boldariev
9846f395ad DoH: process data chunk by chunk instead of all at once
Initially, our DNS-over-HTTP(S) implementation would try to process as
much incoming data from the network as possible. However, that might
be undesirable as we might create too many streams (each effectively
backed by a ns_client_t object). That is too forgiving as it might
overwhelm the server and trash its memory allocator, causing high CPU
and memory usage.

Instead of doing that, we resort to processing incoming data using a
chunk-by-chunk processing strategy. That is, we split data into small
chunks (currently 256 bytes) and process each of them
asynchronously. However, we can process more than one chunk at
once (up to 4 currently), given that the number of HTTP/2 streams has
not increased while processing a chunk.

That alone is not enough, though. In addition to the above, we should
limit the number of active streams: these streams for which we have
received a request and started processing it (the ones for which a
read callback was called), as it is perfectly fine to have more opened
streams than active ones. In the case we have reached or surpassed the
limit of active streams, we stop reading AND processing the data from
the remote peer. The number of active streams is effectively decreased
only when responses associated with the active streams are sent to the
remote peer.

Overall, this strategy is very similar to the one used for other
stream-based DNS transports like TCP and TLS.
2025-01-15 14:09:17 +00:00
Michał Kępień
d6f9785ac6
Enable extraction of exact local socket addresses
Extracting the exact address that each wildcard/TCP socket is bound to
locally requires issuing the getsockname() system call, which libuv
exposes via its uv_*_getsockname() functions.  This is only required for
detailed logging and comes at a noticeable performance cost, so it
should not happen by default.  However, it is useful for debugging
certain problems (e.g. cryptic system test failures), so a convenient
way of enabling that behavior should exist.

Update isc_nmhandle_localaddr() so that it calls uv_*_getsockname() when
the ISC_SOCKET_DETAILS preprocessor macro is set at compile time.
Ensure proper handling of sockets that wrap other sockets.

Set the new ISC_SOCKET_DETAILS macro by default when --enable-developer
is passed to ./configure.  This enables detailed logging in the system
tests run in GitLab CI without affecting performance in non-development
BIND 9 builds.

Note that setting the ISC_SOCKET_DETAILS preprocessor macro at compile
time enables all callers of isc_nmhandle_localaddr() to extract the
exact address of a given local socket, which results e.g. in dnstap
captures containing more accurate information.

Mention the new preprocessor macro in the section of the ARM that
discusses why exact socket addresses may not be logged by default.
2024-12-29 12:32:05 +01:00
Artem Boldariev
6691a1530d TLS SNI - add low level support for SNI to the networking code
This commit adds support for setting SNI hostnames in outgoing
connections over TLS.

Most of the changes are related to either adapting the code to accept
and extra argument in *connect() functions and a couple of changes to
the TLS Stream to actually make use of the new SNI hostname
information.
2024-12-26 17:23:12 +02:00
Matthijs Mekking
aa24b77d8b Fix nsupdate hang when processing a large update
The root cause is the fix for CVE-2024-0760 (part 3), which resets
the TCP connection on a failed send. Specifically commit
4b7c61381f stops reading on the socket
because the TCP connection is throttling.

When the tcpdns_send_cb callback thinks about restarting reading
on the socket, this fails because the socket is a client socket.
And nsupdate is a client and is using the same netmgr code.

This commit removes the requirement that the socket must be a server
socket, allowing reading on the socket again after being throttled.
2024-12-05 15:40:48 +01:00
Artem Boldariev
300f05110d Extended TCP accept()/close() logging
This commit adds extra log messages issued when accepting or closing a
TCP connection (provided that debugging logging level >=99 is
enabled).
2024-11-27 21:14:08 +02:00
Aydın Mercan
d987e2d745
add separate query counters for new protocols
Add query counters for DoT, DoH, unencrypted DoH and their proxied
counterparts. The protocols don't increment TCP/UDP counters anymore
since they aren't the same as plain DNS-over-53.
2024-11-25 13:07:29 +03:00
Ondřej Surý
1a19ce39db
Remove redundant semicolons after the closing braces of functions 2024-11-19 12:27:22 +01:00
Ondřej Surý
0258850f20
Remove redundant parentheses from the return statement 2024-11-19 12:27:22 +01:00