Commit graph

8742 commits

Author SHA1 Message Date
Aurelien DARRAGON
aa53887398 MINOR: counters: add shared counters helpers to get and drop shared pointers
create include/haproxy/counters.h and src/counters.c files to anticipate
for further helpers as some counters specific tasks needs to be carried
out and since counters are shared between multiple object types (ie:
listener, proxy, server..) we need generic helpers.

Add some shared counters helper which are not yet used but will be updated
in upcoming commits.
2025-06-05 09:59:04 +02:00
Aurelien DARRAGON
a0dcab5c45 MAJOR: counters: add shared counters base infrastructure
Shareable counters are not tagged as shared counters and are dynamically
allocated in separate memory area as a prerequisite for being stored
in shared memory area. For now, GUID and threads groups are not taken into
account, this is only a first step.

also we ensure all counters are now manipulated using atomic operations,
namely, "last_change" counter is now read from and written to using atomic
ops.

Despite the numerous changes caused by the counters being moved away from
counters struct, no change of behavior should be expected.
2025-06-05 09:58:58 +02:00
Christopher Faulet
8ee650a88b CLEANUP: applet: Update comment for applet_put* functions
These functions were copied from the channel API and modified to work with
applets using the new API or the legacy one. However, the comments were
updated accordingly. It is the purpose of this patch.
2025-06-03 15:03:30 +02:00
Aurelien DARRAGON
368d01361a MEDIUM: server: add and use srv_init() function
rename _srv_postparse() internal function to srv_init() function and group
srv_init_per_thr() plus idle conns list init inside it. This way we can
perform some simplifications as srv_init() performs multiple server
init steps after parsing.

SRV_F_CHECKED flag was added, it is automatically set when srv_init()
runs successfully. If the flag is already set and srv_init() is called
again, nothing is done. This permis to manually call srv_init() earlier
than the default POST_CHECK hook when needed without risking to do things
twice.
2025-06-02 17:51:33 +02:00
Aurelien DARRAGON
889ef6f67b MEDIUM: server: automatically add server to proxy list in new_server()
while new_server() takes the parent proxy as argument and even assigns
srv->proxy to the parent proxy, it didn't actually inserted the server
to the parent proxy server list on success.

The result is that sometimes we add the server to the list after
new_server() is called, and sometimes we don't.

This is really error-prone and because of that hooks such as
REGISTER_POST_SERVER_CHECK() which as run for all servers listed in
all proxies may not be relied upon for servers which are not actually
inserted in their parent proxy server list. Plus it feels very strange
to have a server that points to a proxy, but then the proxy doesn't know
about it because it cannot find it in its server list.

To prevent errors and make proxy->srv list reliable, we move the insertion
logic directly under new_server(). This requires to know if we are called
during parsing or during runtime to either insert or append the server to
the parent proxy list. For that we use PR_FL_CHECKED flag from the parent
proxy (if the flag is set, then the proxy was checked so we are past the
init phase, thus we assume we are called during runtime)

This implies that during startup if new_server() has to be cancelled on
error paths we need to call srv_detach() (which is now exposed in server.h)
before srv_drop().

The consequence of this commit is that REGISTER_POST_SERVER_CHECK() should
not run reliably on all servers created using new_server() (without having
to manually loop on global servers_list)
2025-06-02 17:51:30 +02:00
Aurelien DARRAGON
943958c3ff MINOR: proxy: add a true list containing all proxies
We have global proxies_list pointer which is announced as the list of
"all existing proxies", but in fact it only represents regular proxies
declared on the config file through "listen, frontend or backend" keywords

It is ambiguous, and we currently don't have a straightforwrd method to
iterate over all proxies (either public or internal ones) within haproxy

Instead we still have to manually iterate over multiple lists (main
proxies, log-forward proxies, peer proxies..) which is error-prone.

In this patch we add a struct list member (8 bytes) inside struct proxy
in order to store every proxy (except default ones) within a global
"proxies" list which is actually representative for all proxies existing
under haproxy process, like we already have for servers.
2025-06-02 17:51:21 +02:00
Aurelien DARRAGON
d04843167c MINOR: stats: add stat_col flags
Add stat_col flags member to store .generic bit and prepare for upcoming
flags. No functional change expected.
2025-06-02 17:51:08 +02:00
Willy Tarreau
9f4cd435d3 [RELEASE] Released version 3.3-dev0
Released version 3.3-dev0 with the following main changes :
    - MINOR: version: mention that it's development again
2025-05-28 16:46:34 +02:00
Willy Tarreau
8809251ee0 MINOR: version: mention that it's development again
This essentially reverts a6458fd426.
2025-05-28 16:46:15 +02:00
Willy Tarreau
a6458fd426 MINOR: version: mention that it's 3.2 LTS now.
The version will be maintained up to around Q2 2030. Let's
also update the INSTALL file to mention this.
2025-05-28 16:31:27 +02:00
Christopher Faulet
99e755d673 MINOR: listeners: Add support for a label on bind line
It is now possile to set a label on a bind line. All sockets attached to
this bind line inherits from this label. The idea is to be able to groud of
sockets. For now, there is no mechanism to create these groups, this must be
done by hand.
2025-05-26 19:00:00 +02:00
Willy Tarreau
3494775a1f MINOR: ssl: support strict-sni in ssl-default-bind-options
Several users already reported that it would be nice to support
strict-sni in ssl-default-bind-options. However, in order to support
it, we also need an option to disable it.

This patch moves the setting of the option from the strict_sni field
to a flag in the ssl_options field so that it can be inherited from
the default bind options, and adds a new "no-strict-sni" directive to
allow to disable it on a specific "bind" line.

The test file "del_ssl_crt-list.vtc" which already tests both options
was updated to make use of the default option and the no- variant to
confirm everything continues to work.
2025-05-22 15:31:54 +02:00
Willy Tarreau
a1577a89a0 MINOR: glitches: add global setting "tune.glitches.kill.cpu-usage"
It was mentioned during the development of glitches that it would be
nice to support not killing misbehaving connections below a certain
CPU usage so that poor implementations that routinely misbehave without
impact are not killed. This is now possible by setting a CPU usage
threshold under which we don't kill them via this parameter. It defaults
to zero so that we continue to kill them by default.
2025-05-21 15:47:42 +02:00
Amaury Denoyelle
00d90e8839 MINOR: quic: adjust quic_conn-t.h include list
Adjust include list in quic_conn-t.h. This file is included in many QUIC
source, so it is useful to keep as lightweight as possible. Note that
connection/QUIC MUX are transformed into forward declaration for better
layer separation.
2025-05-21 14:44:27 +02:00
Amaury Denoyelle
01e3b2119a MINOR: quic: add some missing includes
Insert some missing includes statement in QUIC source files. This was
detected after the next commit which adjust the include list used in
quic_conn-t.h file.
2025-05-21 14:44:27 +02:00
Amaury Denoyelle
f286288471 MINOR: quic: refactor handling of streams after MUX release
quic-conn layer has to handle itself STREAM frames after MUX release. If
the stream was already seen, it is probably only a retransmitted frame
which can be safely ignored. For other streams, an active closure may be
needed.

Thus it's necessary that quic-conn layer knows the highest stream ID
already handled by the MUX after its release. Previously, this was done
via <nb_streams> member array in quic-conn structure.

Refactor this by replacing <nb_streams> by two members called
<stream_max_uni>/<stream_max_bidi>. Indeed, it is unnecessary for
quic-conn layer to monitor locally opened uni streams, as the peer
cannot by definition emit a STREAM frame on it. Also, bidirectional
streams are always opened by the remote side.

Previously, <nb_streams> were set by quic-stream layer. Now,
<stream_max_uni>/<stream_max_bidi> members are only set one time, just
prior to QUIC MUX release. This is sufficient as quic-conn do not use
them if the MUX is available.

Note that previously, IDs were used relatively to their type, thus
incremented by 1, after shifting the original value. For simplification,
use the plain stream ID, which is incremented by 4.
2025-05-21 14:26:45 +02:00
Amaury Denoyelle
07d41a043c MINOR: quic: move function to check stream type in utils
Move general function to check if a stream is uni or bidirectional from
QUIC MUX to quic_utils module. This should prevent unnecessary include
of QUIC MUX header file in other sources.
2025-05-21 14:17:41 +02:00
Amaury Denoyelle
cf45bf1ad8 CLEANUP: quic: remove unused cbuf module
Cbuf are not used anymore. Remove the related source and header files,
as well as include statements in the rest of QUIC source files.
2025-05-21 14:16:37 +02:00
Frederic Lecaille
b3ac1a636c MINOR: quic: implement all remaining callbacks for OpenSSL 3.5 QUIC API
The quic_conn struct is modified for two reasons. The first one is to store
the encoded version of the local tranport parameter as this is done for
USE_QUIC_OPENSSL_COMPAT. Indeed, the local transport parameter "should remain
valid until after the parameters have been sent" as mentionned by
SSL_set_quic_tls_cbs(3) manual. In our case, the buffer is a static buffer
attached to the quic_conn object. qc_ssl_set_quic_transport_params() function
whose role is to call SSL_set_tls_quic_transport_params() (aliased by
SSL_set_quic_transport_params() to set these local tranport parameter into
the TLS stack from the buffer attached to the quic_conn struct.

The second quic_conn struct modification is the addition of the  new ->prot_level
(SSL protection level) member added to the quic_conn struct to store "the most
recent write encryption level set via the OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn
callback (if it has been called)" as mentionned by SSL_set_quic_tls_cbs(3) manual.

This patches finally implements the five remaining callacks to make the haproxy
QUIC implementation work.

OSSL_FUNC_SSL_QUIC_TLS_crypto_send_fn() (ha_quic_ossl_crypto_send) is easy to
implement. It calls ha_quic_add_handshake_data() after having converted
qc->prot_level TLS protection level value to the correct ssl_encryption_level_t
(boringSSL API/quictls) value.

OSSL_FUNC_SSL_QUIC_TLS_crypto_recv_rcd_fn() (ha_quic_ossl_crypto_recv_rcd())
provide the non-contiguous addresses to the TLS stack, without releasing
them.

OSSL_FUNC_SSL_QUIC_TLS_crypto_release_rcd_fn() (ha_quic_ossl_crypto_release_rcd())
release these non-contiguous buffer relying on the fact that the list of
encryption level (qc->qel_list) is correctly ordered by SSL protection level
secret establishements order (by the TLS stack).

OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn() (ha_quic_ossl_got_transport_params())
is a simple wrapping function over ha_quic_set_encryption_secrets() which is used
by boringSSL/quictls API.

OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() (ha_quic_ossl_got_transport_params())
role is to store the peer received transport parameters. It simply calls
quic_transport_params_store() and set them into the TLS stack calling
qc_ssl_set_quic_transport_params().

Also add some comments for all the OpenSSL 3.5 QUIC API callbacks.

This patch have no impact on the other use of QUIC API provided by the others TLS
stacks.
2025-05-20 15:00:06 +02:00
Frederic Lecaille
dc6a3c329a MINOR: quic: Allow the use of the new OpenSSL 3.5.0 QUIC TLS API (to be completed)
This patch allows the use of the new OpenSSL 3.5.0 QUIC TLS API when it is
available and detected at compilation time. The detection relies on the presence of the
OSSL_FUNC_SSL_QUIC_TLS_CRYPTO_SEND macro from openssl-compat.h. Indeed this
macro is defined by OpenSSL since 3.5.0 version. It is not defined by quictls.
This helps in distinguishing these two TLS stacks. When the detection succeeds,
HAVE_OPENSSL_QUIC is also defined by openssl-compat.h. Then, this is this new macro
which is used to detect the availability of the new OpenSSL 3.5.0 QUIC TLS API.

Note that this detection is done only if USE_QUIC_OPENSSL_COMPAT is not asked.
So, USE_QUIC_OPENSSL_COMPAT and HAVE_OPENSSL_QUIC are exclusive.

At the same location, from openssl-compat.h, ssl_encryption_level_t enum is
defined. This enum was defined by quictls and expansively used by the haproxy
QUIC implementation. SSL_set_quic_transport_params() is replaced by
SSL_set_quic_tls_transport_params. SSL_set_quic_early_data_enabled() (quictls) is also replaced
by SSL_set_quic_tls_early_data_enabled() (OpenSSL). SSL_quic_read_level() (quictls)
is not defined by OpenSSL. It is only used by the traces to log the current
TLS stack decryption level (read). A macro makes it return -1 which is an
usused values.

The most of the differences between quictls and OpenSSL QUI APIs are in quic_ssl.c
where some callbacks must be defined for these two APIs. This is why this
patch modifies quic_ssl.c to define an array of OSSL_DISPATCH structs: <ha_quic_dispatch>.
Each element of this arry defines a callback. So, this patch implements these
six callabcks:

  - ha_quic_ossl_crypto_send()
  - ha_quic_ossl_crypto_recv_rcd()
  - ha_quic_ossl_crypto_release_rcd()
  - ha_quic_ossl_yield_secret()
  - ha_quic_ossl_got_transport_params() and
  - ha_quic_ossl_alert().

But at this time, these implementations which must return an int return 0 interpreted
as a failure by the OpenSSL QUIC API, except for ha_quic_ossl_alert() which
is implemented the same was as for quictls. The five remaining functions above
will be implemented by the next patches to come.

ha_quic_set_encryption_secrets() and ha_quic_add_handshake_data() have been moved
to be defined for both quictls and OpenSSL QUIC API.

These callbacks are attached to the SSL objects (sessions) calling qc_ssl_set_cbs()
new function. This latter callback the correct function to attached the correct
callbacks to the SSL objects (defined by <ha_quic_method> for quictls, and
<ha_quic_dispatch> for OpenSSL).

The calls to SSL_provide_quic_data() and SSL_process_quic_post_handshake()
have been also disabled. These functions are not defined by OpenSSL QUIC API.
At this time, the functions which call them are still defined when HAVE_OPENSSL_QUIC
is defined.
2025-05-20 15:00:06 +02:00
Willy Tarreau
411b04c7d3 IMPORT: slz: use a better hash for machines with a fast multiply
The current hash involves 3 simple shifts and additions so that it can
be mapped to a multiply on architecures having a fast multiply. This is
indeed what the compiler does on x86_64. A large range of values was
scanned to try to find more optimal factors on machines supporting such
a fast multiply, and it turned out that new factor 0x1af42f resulted in
smoother hashes that provided on average 0.4% better compression on both
the Silesia corpus and an mbox file composed of very compressible emails
and uncompressible attachments. It's even slightly better than CRC32C
while being faster on Skylake. This patch enables this factor on archs
with a fast multiply.

This is slz upstream commit 82ad1e75c13245a835c1c09764c89f2f6e8e2a40.
2025-05-16 16:43:53 +02:00
Willy Tarreau
0a91c6dcae BUILD: debug: mark ha_crash_now() as attribute(noreturn)
Building on MIPS64 with clang16 incorrectly reports some uninitialized
value warnings in stats-proxy.c due to some calls to ABORT_NOW() where
the compiler didn't know the code wouldn't return. Let's properly mark
the function as noreturn, and take this opportunity for also marking it
unused to avoid possible warnings depending on the build options (if
ABORT_NOW is not used). No backport needed though it will not harm.
2025-05-16 16:43:53 +02:00
Christopher Faulet
f45a632bad BUG/MEDIUM: stconn: Disable 0-copy forwarding for filters altering the payload
It is especially a problem with Lua filters, but it is important to disable
the 0-copy forwarding if a filter alters the payload, or at least to be able
to disable it. While the filter is registered on the data filtering, it is
not an issue (and it is the common case) because, there is now way to
fast-forward data at all. But it may be an issue if a filter decides to
alter the payload and to unregister from data filtering. In that case, the
0-copy forwarding can be re-enabled in a hardly precdictable state.

To fix the issue, a SC flags was added to do so. The HTTP compression filter
set it and lua filters too if the body length is changed (via
HTTPMessage.set_body_len()).

Note that it is an issue because of a bad design about the HTX. Many info
about the message are stored in the HTX structure itself. It must be
refactored to move several info to the stream-endpoint descriptor. This
should ease modifications at the stream level, from filter or a TCP/HTTP
rules.

This should be backported as far as 3.0. If necessary, it may be backported
on lower versions, as far as 2.6. In that case, it must be reviewed and
adapted.
2025-05-16 15:11:37 +02:00
Christopher Faulet
a3940614c2 BUG/MEDIUM: mux-spop: Remove frame parsing states from the SPOP connection state
SPOP_CS_FRAME_H and SPOP_CS_FRAME_P states, that were used to handle frame
parsing, were removed. The demux process now relies on the demux stream ID
to know if it is waiting for the frame header or the frame
payload. Concretly, when the demux stream ID is not set (dsi == -1), the
demuxer is waiting for the next frame header. Otherwise (dsi >= 0), it is
waiting for the frame payload. It is especially important to be able to
properly handle DISCONNECT frames sent by the agents.

SPOP_CS_RUNNING state is introduced to know the hello handshake was finished
and the SPOP connection is able to open SPOP streams and exchange NOTIFY/ACK
frames with the agents.

It depends on the following fixes:

  * MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
  * BUG/MINOR: mux-spop: Make the demux stream ID a signed integer

This change will be mandatory for the next fix. It must be backported to 3.1
with the commits above.
2025-05-13 19:51:40 +02:00
Willy Tarreau
e049bd00ab MEDIUM: config: change default limits to 1024 threads and 32 groups
A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll
be facing new limits during the lifetime of 3.2 with our current 16
groups and 256 threads max:

  $ cat test.cfg
  global
      cpu-policy perforamnce

  $ ./haproxy -dc -c -f test.cfg
  ...
  Thread CPU Bindings:
    Tgrp/Thr  Tid        CPU set
    1/1-32    1-32       32: 0-15,320-335
    2/1-32    33-64      32: 16-31,336-351
    3/1-32    65-96      32: 32-47,352-367
    4/1-32    97-128     32: 48-63,368-383
    5/1-32    129-160    32: 64-79,384-399
    6/1-32    161-192    32: 80-95,400-415
    7/1-32    193-224    32: 96-111,416-431
    8/1-32    225-256    32: 112-127,432-447

Raising the default limit to 1024 threads and 32 groups is sufficient
to buy us enough margin for a long time (hopefully, please don't laugh,
you, reader from the future):

  $ ./haproxy -dc -c -f test.cfg
  ...
  Thread CPU Bindings:
    Tgrp/Thr  Tid        CPU set
    1/1-32    1-32       32: 0-15,320-335
    2/1-32    33-64      32: 16-31,336-351
    3/1-32    65-96      32: 32-47,352-367
    4/1-32    97-128     32: 48-63,368-383
    5/1-32    129-160    32: 64-79,384-399
    6/1-32    161-192    32: 80-95,400-415
    7/1-32    193-224    32: 96-111,416-431
    8/1-32    225-256    32: 112-127,432-447
    9/1-32    257-288    32: 128-143,448-463
    10/1-32   289-320    32: 144-159,464-479
    11/1-32   321-352    32: 160-175,480-495
    12/1-32   353-384    32: 176-191,496-511
    13/1-32   385-416    32: 192-207,512-527
    14/1-32   417-448    32: 208-223,528-543
    15/1-32   449-480    32: 224-239,544-559
    16/1-32   481-512    32: 240-255,560-575
    17/1-32   513-544    32: 256-271,576-591
    18/1-32   545-576    32: 272-287,592-607
    19/1-32   577-608    32: 288-303,608-623
    20/1-32   609-640    32: 304-319,624-639

We can change this default now because it has no functional effect
without any configured cpu-policy, so this will only be an opt-in
and it's better to do it now than to have an effect during the
maintenance phase. A tiny effect is a doubling of the number of
pool buckets and stick-table shards internally, which means that
aside slightly reducing contention in these areas, a dump of tables
can enumerate keys in a different order (hence the adjustment in the
vtc).

The only really visible effect is a slightly higher static memory
consumption (29->35 MB on a small config), but that difference
remains even with 50k servers so that's pretty much acceptable.

Thanks to Erwan Velu for the quick tests and the insights!
2025-05-13 18:15:33 +02:00
Amaury Denoyelle
f3b9676416 MINOR: quic: display stream age
Add a field to save the creation date of qc_stream_desc instance. This
is useful to display QUIC stream age in "show quic stream" output.
2025-05-13 15:44:22 +02:00
Amaury Denoyelle
1ccede211c MINOR: mux-quic: account Rx data per stream
Add counters to measure Rx buffers usage per QCS. This reused the newly
defined bdata_ctr type already used for Tx accounting.

Note that for now, <tot> value of bdata_ctr is not used. This is because
it is not easy to account for data accross contiguous buffers.

These values are displayed both on log/traces and "show quic" output.
2025-05-13 15:41:51 +02:00
Amaury Denoyelle
a1dc9070e7 MINOR: quic: account Tx data per stream
Add accounting at qc_stream_desc level to be able to report the number
of allocated Tx buffers and the sum of their data. This represents data
ready for emission or already emitted and waiting on ACK.

To simplify this accounting, a new counter type bdata_ctr is defined in
quic_utils.h. This regroups both buffers and data counter, plus a
maximum on the buffer value.

These values are now displayed on QCS info used both on logline and
traces, and also on "show quic" output.
2025-05-13 15:41:41 +02:00
Willy Tarreau
ebab479cdf MINOR: http: add a function to validate characters of :authority
As discussed here:
  https://github.com/httpwg/http2-spec/pull/936
  https://github.com/haproxy/haproxy/issues/2941

It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/').

This patch adds a specific function which was checks all such characters
and their ranges on an ist, and benefits from modern compilers
optimizations that arrange the comparisons into an evaluation tree for
faster match. That's the version that gave the most consistent performance
across various compilers, though some hand-crafted versions using bitmaps
stored in register could be slightly faster but super sensitive to code
ordering, suggesting that the results might vary with future compilers.
This one takes on average 1.2ns per character at 3 GHz (3.6 cycles per
char on avg). The resulting impact on H2 request processing time (small
requests) was measured around 0.3%, from 6.60 to 6.618us per request,
which is a bit high but remains acceptable given that the test only
focused on req rate.

The code was made usable both for H2 and H3.
2025-05-12 18:02:47 +02:00
William Lallemand
96b1f1fd26 MINOR: tools: ha_freearray() frees an array of string
ha_freearray() is a new function which free() an array of strings
terminated by a NULL entry.

The pointer to the array will be free and set to NULL.
2025-05-09 19:12:05 +02:00
Willy Tarreau
8a96216847 MEDIUM: sock-inet: re-check IPv6 connectivity every 30s
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.

With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).

The variables remain exported so that we could present them in show info
or anywhere else.

This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
2025-05-09 15:45:44 +02:00
Willy Tarreau
1404f6fb7b DEBUG: pools: add a new integrity mode "backup" to copy the released area
This way we can preserve the entire contents of the released area for
later inspection. This automatically enables comparison at reallocation
time as well (like "integrity" does). If used in combination with
integrity, the comparison is disabled but the check of non-corruption
of the area mangled by integrity is still operated.
2025-05-09 14:57:00 +02:00
William Lallemand
e7574cd5f0 MINOR: acme: add the global option 'acme.scheduler'
The automatic scheduler is useful but sometimes you don't want to use,
or schedule manually.

This patch adds an 'acme.scheduler' option in the global section, which
can be set to either 'auto' or 'off'. (auto is the default value)

This also change the ouput of the 'acme status' command so it does not
shows scheduled values. The state will be 'Stopped' instead of
'Scheduled'.
2025-05-09 14:00:39 +02:00
Willy Tarreau
0ae14beb2a DEBUG: pool: permit per-pool UAF configuration
The new MEM_F_UAF flag can be set just after a pool's creation to make
this pool UAF for debugging purposes. This allows to maintain a better
overall performance required to reproduce issues while still having a
chance to catch UAF. It will only be used by developers who will manually
add it to areas worth being inspected, though.
2025-05-09 13:59:02 +02:00
Amaury Denoyelle
294bf26c06 MINOR: quic: extend return value during TP parsing
Extend API used for QUIC transport parameter decoding. This is done via
the introduction of a dedicated enum to report the various error
condition detected. No functional change should occur with this patch,
as the only returned code is QUIC_TP_DEC_ERR_TRUNC, which results in the
connection closure via a TLS alert.

This patch will be necessary to properly reject transport parameters
with the proper CONNECTION_CLOSE error code. As such, it should be
backported up to 2.6 with the following series.
2025-05-07 15:19:52 +02:00
Willy Tarreau
feaac66b5e DEBUG: threads: merge successive idempotent lock operations in history
In order to make the lock history a bit more useful, let's try to merge
adjacent lock/unlock sequences that don't change anything for other
threads. For this we can replace the last unlock with the new operation
on the same label, and even just not store it if it was the same as the
one before the unlock, since in the end it's the same as if the unlock
had not been done.

Now loops that used to be filled with "R:LISTENER U:LISTENER" show more
useful info such as:

  S:IDLE_CONNS U:IDLE_CONNS S:PEER U:PEER S:IDLE_CONNS U:IDLE_CONNS R:LISTENER U:LISTENER
  U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE
  R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS W:STK_TABLE_UPDT U:STK_TABLE_UPDT S:PEER

It's worth noting that it can sometimes induce confusion when recursive
locks of the same label are used (a few exist on peers or stick-tables),
as in such a case the two operations would be needed. However these ones
are already undebuggable, so instead they will just have to be renamed
to make sure they use a distinct label.
2025-05-05 18:36:12 +02:00
Willy Tarreau
743dce95d2 DEBUG: threads: don't keep lock label "OTHER" in the per-thread history
Most threads are filled with "R:OTHER U:OTHER" in their history. Since
anything non-important can use other it's not observable but it pollutes
the history. Let's just drop OTHER entirely during the recording.
2025-05-05 18:10:57 +02:00
William Lallemand
878a3507df BUILD: acme: need HAVE_ASN1_TIME_TO_TM
Restrict the build of the ACME feature to libraries which provide
ASN1_TIME_to_tm() function.
2025-05-02 16:01:32 +02:00
William Lallemand
626de9538e MINOR: ssl: add function to extract X509 notBefore date in time_t
Add x509_get_notbefore_time_t() which returns the notBefore date in
time_t format.
2025-05-02 16:01:32 +02:00
Olivier Houchard
388539faa3 MEDIUM: stick-tables: defer adding updates to a tasklet
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
2025-05-02 15:27:55 +02:00
Olivier Houchard
faa18c1ad8 BUG/MEDIUM: quic: Let it be known if the tasklet has been released.
quic_conn_release() may, or may not, free the tasklet associated with
the connection. So make it return 1 if it was, and 0 otherwise, so that
if it was called from the tasklet handler itself, the said handler can
act accordingly and return NULL if the tasklet was destroyed.
This should be backported if 9240cd4a27
is backported.
2025-05-02 11:09:28 +02:00
William Lallemand
18d2371e0d MINOR: acme: change the default max retries to 5
Change the default max retries constant to 5 instead of 3.
Some servers can be be a bit long to execute the challenge.
2025-05-02 09:40:12 +02:00
Olivier Houchard
b138eab302 BUG/MEDIUM: connections: Report connection closing in conn_create_mux()
Add an extra parametre to conn_create_mux(), "closed_connection".
If a pointer is provided, then let it know if the connection was closed.
Callers have no way to determine that otherwise, and we need to know
that, at least in ssl_sock_io_cb(), as if the connection was closed we
need to return NULL, as the tasklet was free'd, otherwise that can lead
to memory corruption and crashes.
This should be backported if 9240cd4a27
is backported too.
2025-04-30 17:17:36 +02:00
Olivier Houchard
4abfade371 MINOR: tasks: Remove unused tasklet_remove_from_tasklet_list
Remove tasklet_remove_from_tasklet_list, as the function hasn't been
used for a long time, and there is little reason to keep it.
2025-04-30 17:09:06 +02:00
Olivier Houchard
2bab043c8c MEDIUM: tasks: Remove TASK_IN_LIST and use TASK_QUEUED instead.
TASK_QUEUED was used to mean "the task has been scheduled to run",
TASK_IN_LIST was used to mean "the tasklet has been scheduled to run",
remove TASK_IN_LIST and just use TASK_QUEUED for tasklets instead.

This commit is just cosmetic, and should not have any impact.
2025-04-30 17:08:57 +02:00
William Lallemand
563ca94ab8 MINOR: ssl/cli: "acme ps" shows the acme tasks
Implement a way to display the running acme tasks over the CLI.

It currently only displays a "Running" status with the certificate name
and the acme section from the configuration.

The displayed running tasks are limited to the size of a buffer for now,
it will require a backref list later to be called multiple times to
resume the list.
2025-04-30 17:12:50 +02:00
Aurelien DARRAGON
97363015a5 MINOR: add hlua_yield_asap() helper
When called, this function will try to enforce a yield (if available) as
soon as possible. Indeed, automatic yield is already enforced every X
Lua instructions. However, there may be some cases where we know after
running heavy operation that we should yield already to avoid taking too
much CPU at once.

This is what this function offers, instead of asking the user to manually
yield using "core.yield()" from Lua itself after using an expensive
Lua method offered by haproxy, we can directly enforce the yield without
the need to do it in the Lua script.
2025-04-30 17:00:27 +02:00
Amaury Denoyelle
df50d3e39f MINOR: mux-quic: limit emitted MSD frames count per qcs
The previous commit has implemented a new calcul method for
MAX_STREAM_DATA frame emission. Now, a frame may be emitted as soon as a
buffer was consumed by a QCS instance.

This will probably increase the number of MAX_STREAM_DATA frame
emission. It may even cause a series of frame emitted for the same
stream with increasing values under high load, which is completely
unnecessary.

To improve this, limit the number of MAX_STREAM_DATA frames built to one
per QCS instance. This is implemented by storing a reference to this
frame in QCS structure via a new member <tx.msd_frm>.

Note that to properly reset QCS msd_frm member, emission of flow-control
frames have been changed. Now, each frame is emitted individually. On
one side, it is better as it prevent to emit frames related to different
streams in a single datagram, which is not desirable in case of packet
loss. However, this can also increase sendto() syscall invocation.
2025-04-30 16:08:47 +02:00
Amaury Denoyelle
14a3fb679f MEDIUM: mux-quic: increase flow-control on each bufsize
Recently, QCS Rx allocation buffer method has been improved. It is now
possible to allocate multiple buffers per QCS instances, which was
necessary to improve HTTP/3 POST throughput.

However, a limitation remained related to the emission of
MAX_STREAM_DATA. These frames are only emitted once at least half of the
receive capacity has been consumed by its QCS instance. This may be too
restrictive when a client need to upload a large payload.

Improve this by adjusting MAX_STREAM_DATA allocation. If QCS capacity is
still limited to 1 or 2 buffers max, the old calcul is still used. This
is necessary when user has limited upload throughput via their
configuration. If QCS capacity is more than 2 buffers, a new frame is
emitted if at least a buffer was consumed.

This patch has reduced number of STREAM_DATA_BLOCKED frames received in
POST tests with some specific clients.
2025-04-30 16:08:47 +02:00
Remi Tricot-Le Breton
047fb37b19 MINOR: Add 'conn' param to ssl_sock_chose_sni_ctx
This is only useful in the traces, the conn parameter won't be used
otherwise.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
6519cec2ed MINOR: ssl: Add traces about sigalg extension parsing in clientHello callback
We had to parse the sigAlg extension by hand in order to properly select
the certificate used by the SSL frontends. These traces allow to dump
the allowed sigAlg list sent by the client in its clientHello.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
105c1ca139 MINOR: ssl: Add traces to the switchctx callback
This callback allows to pick the used certificate on an SSL frontend.
The certificate selection is made according to the information sent by
the client in the clientHello. The traces that were added will allow to
better understand what certificate was chosen and why. It will also warn
us if the chosen certificate was the default one.
The actual certificate parsing happens in ssl_sock_chose_sni_ctx. It's
in this function that we actually get the filename of the certificate
used.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
dbdd0630e1 MINOR: ssl: Add ocsp stapling callback traces
If OCSP stapling fails because of a missing or invalid OCSP response we
used to silently disable stapling for the given session. We can now know
a bit more what happened regarding OCSP stapling.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
0fb05540b2 MINOR: ssl: Add traces to verify callback
Those traces allow to know which errors were met during certificate
chain validation as well as which ones were ignored.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
4a8fa28e36 MINOR: ssl: Add traces around SSL_do_handshake call
Those traces dump information about the multiple SSL_do_handshake calls
(renegotiation and regular call). Some errors coud also be dumped in
case of rejected early data.
Depending on the chosen verbosity, some information about the current
handshake can be dumped as well (servername, tls version, chosen cipher
for instance).
In case of failed handshake, the error codes and messages will also be
dumped in the log to ease debugging.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
9f146bdab3 MINOR: ssl: Add traces to ssl_sock_io_cb function
Add new SSL traces.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
475bb8d843 MINOR: ssl: Add traces to recv/send functions
Those traces will allow to identify sessions on which early data is used
as well as some forcefully closed connections.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
9bb8d6dcd1 MINOR: ssl: Add traces to ssl init/close functions
Add a dedicated trace for some unlikely allocation failures and async
errors. Those traces will ostly be used to identify the start and end of
a given SSL connection.
2025-04-30 11:11:26 +02:00
Remi Tricot-Le Breton
08e40f4589 MINOR: Add "sigalg" to "sigalg name" helper function
This function can be used to convert a TLSv1.3 sigAlg entry (2bytes)
from the signature_agorithms client hello extension into a string.

In order to ease debugging, some TLSv1.2 combinations can also be
dumped. In TLSv1.2 those signature algorithms pairs were built out of a
one byte signature identifier combined to a one byte hash identifier.
In TLSv1.3 those identifiers are two bytes blocs that must be treated as
such.
2025-04-30 11:11:26 +02:00
Willy Tarreau
566b384e4e MINOR: tools: make my_strndup() take a size_t len instead of and int
In relation to issue #2954, it appears that turning some size_t length
calculations to the int that uses my_strndup() upsets coverity a bit.
Instead of dealing with such warnings each time, better address it at
the root. An inspection of all call places show that the size passed
there is always positive so we can safely use an unsigned type, and
size_t will always suit it like for strndup() where it's available.
2025-04-30 05:17:43 +02:00
Aurelien DARRAGON
5288b39011 BUG/MINOR: dns: prevent ds accumulation within dss
when dns session callback (dns_session_release()) is called upon error
(ie: when some pending queries were not sent), we try our best to
re-create the applet in order to preserve the pending queries and give
them a chance to be retried. This is done at the end of
dns_session_release().

However, doing so exposes to an issue: if the error preventing queries
from being sent is still encountered over and over the dns session could
stay there indefinitely. Meanwhile, other dns sessions may be created on
the same dns_stream_server periodically. If previous failing dns sessions
don't terminate but we also keep creating new ones, we end up accumulating
failing sessions on a given dns_stream_server, which can eventually cause
ressource shortage.

This issue was found when trying to address ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers")

To fix it, we track the number of failed consecutive sessions for a given
dns server. When we reach the threshold (set to 100), we consider that the
link to the dns server is broken (at least temporarily) and we force
dns_session_new() to fail, so that we stop creating new sessions until one
of the existing one eventually succeeds.

A workaround for this fix consists in setting the "maxconn" parameter on
nameserver directive (under resolvers section) to a reasonnable value so
that no more than "maxconn" sessions may co-exist on the same server at
a given time.

This may be backported to all stable versions.
("CLEANUP: dns: remove unused dns_stream_server struct member") may be
backported to ease the backport.
2025-04-29 21:20:54 +02:00
Aurelien DARRAGON
14ebe95a10 CLEANUP: dns: remove unused dns_stream_server struct member
dns_stream_server "max_slots" is unused, let's get rid of it
2025-04-29 21:20:44 +02:00
Aurelien DARRAGON
1ced5ef2fd MINOR: applet: add appctx_schedule() macro
Just like task_schedule() but for applets to wakeup an applet at a
specific time, leverages _task_schedule() internally
2025-04-29 21:19:37 +02:00
William Lallemand
5555926fdd MEDIUM: acme: use a map to store tokens and thumbprints
The stateless mode which was documented previously in the ACME example
is not convenient for all use cases.

First, when HAProxy generates the account key itself, you wouldn't be
able to put the thumbprint in the configuration, so you will have to get
the thumbprint and then reload.
Second, in the case you are using multiple account key, there are
multiple thumbprint, and it's not easy to know which one you want to use
when responding to the challenger.

This patch allows to configure a map in the acme section, which will be
filled by the acme task with the token corresponding to the challenge,
as the key, and the thumbprint as the value. This way it's easy to reply
the right thumbprint.

Example:
    http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].%[path,field(-1,/),map(virt@acme)]\n" if { path_beg '/.well-known/acme-challenge/' }
2025-04-29 16:15:55 +02:00
Amaury Denoyelle
0f9b3daf98 MEDIUM: quic: limit global Tx memory
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().

Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
2025-04-29 15:19:32 +02:00
Amaury Denoyelle
e841164a44 MINOR: quic: account for global congestion window
Use the newly defined cshared type to account for the sum of congestion
window of every QUIC connection. This value is stored in global counter
quic_mem_global defined in proto_quic module.
2025-04-29 15:19:32 +02:00
Amaury Denoyelle
3891456d20 MINOR: thread: define cshared type
Define a new type "struct cshared". This can be used as a tool to
manipulate a global counter with thread-safety ensured. Each thread
would declare its thread-local cshared type, which would point to a
global counter.

Each thread can then add/substract value to their owned thread-local
cshared instance via cshared_add(). If the difference exceed a
configured limit, either positively or negatively, the global counter is
updated and thread-local instance is reset to 0. Each thread can safely
read the global counter value using cshared_read().
2025-04-29 15:10:06 +02:00
Amaury Denoyelle
7bad88c35c BUG/MINOR: quic: ensure cwnd limits are always enforced
Congestion window is limit by a minimal and maximum values which can
never be exceeded. Min value is hardcoded to 2 datagrams as recommended
by the specification. Max value is specified via haproxy configuration.

These values must be respected each time the congestion window size is
adjusted. However, in some rare occasions, limit were not always
enforced. Fix this by implementing wrappers to set or increment the
congestion window. These functions ensure limits are always applied
after the operation.

Additionnally, wrappers also ensure that if window reached a new maximum
value, it is saved in <cwnd_last_max> field.

This should be backported up to 2.6, after a brief period of
observation.
2025-04-29 15:10:06 +02:00
Amaury Denoyelle
2eb1b0cd96 MINOR: quic: rename min/max fields for congestion window algo
There was some possible confusion between fields related to congestion
window size min and max limit which cannot be exceeded, and the maximum
value previously reached by the window.

Fix this by adopting a new naming scheme. Enforced limit are now renamed
<limit_max>/<limit_min>, while the previously reached max value is
renamed <cwnd_last_max>.

This should be backported up to 3.1.
2025-04-29 15:10:06 +02:00
Willy Tarreau
2cdb3cb91e MINOR: tcp: add support for setting TCP_NOTSENT_LOWAT on both sides
TCP_NOTSENT_LOWAT is very convenient as it indicates when to report
EAGAIN on the sending side. It takes a margin on top of the estimated
window, meaning that it's no longer needed to store too many data in
socket buffers. Instead there's just enough to fill the send window
and a little bit of margin to cover the scheduling time to restart
sending. Experiments on a 100ms network have shown a 10-fold reduction
in the memory used by socket buffers by just setting this value to
tune.bufsize, without noticing any performance degradation. Theoretically
the responsiveness on multiplexed protocols such as H2 should also be
improved.
2025-04-29 12:13:42 +02:00
Willy Tarreau
f25b4abc9b MINOR: cli: split APPCTX_CLI_ST1_PROMPT into two distinct flags
The CLI's "prompt" command toggles two distinct things:
  - displaying or hiding the prompt at the beginning of the line
  - single-command vs interactive mode

These are two independent concepts and the prompt mode doesn't
always cope well with tools that would like to upload data without
having to read the prompt on return. Also, the master command line
works in interactive mode by default with no prompt, which is not
consistent (and not convenient for tools). So let's start by splitting
the bit in two, and have a new APPCTX_CLI_ST1_INTER flag dedicated
to the interactive mode. For now the "prompt" command alone continues
to toggle the two at once.
2025-04-28 20:21:06 +02:00
Willy Tarreau
5ac280f2a7 MINOR: compiler: add more macros to detect macro definitions
We add __equals_0(NAME) which is only true if NAME is defined as zero,
and __def_as_empty(NAME) which is only true if NAME is defined as an
empty string.
2025-04-28 20:21:06 +02:00
Willy Tarreau
12c7189bc8 MEDIUM: thread: set DEBUG_THREAD to 1 by default
Setting DEBUG_THREAD to 1 allows recording the lock history for each
thread. Tests have shown that (as predicted) the cost of updating a
single thread-local variable is not perceptible in the noise, especially
when compared to the cost of obtaining a lock. Since this can provide
useful value when debugging deadlocks, let's enable it by default when
threads are enabled.
2025-04-28 16:50:34 +02:00
Willy Tarreau
d9a659ed96 MINOR: threads/cli: display the lock history on "show threads"
This will display the lock labels and modes for each non-empty step
at the end of "show threads" when these are defined. This allows to
emit up to the last 8 locking operation for each thread on 64 bit
machines.
2025-04-28 16:50:34 +02:00
Willy Tarreau
b8a1c2380b MEDIUM: threads: keep history of taken locks with DEBUG_THREAD > 0
by only storing a word in each thread context, we can keep the history
of all taken/dropped locks by label. This is expected to be very cheap
and to permit to store up to 8 consecutive lock operations in 64 bits.
That should significantly help detect recursive locks as well as figure
what thread was likely to hinder another one waiting for a lock.

For now we only store the final state of the lock, we don't store the
attempt to get it. It's just a matter of space since we already need
4 ops (rd,sk,wr,un) which take 2 bits, leaving max 64 labels. We're
already around 45. We could also multiply by 5 and still keep 8 bits
total per lock, that would limit us to 51 locks max. It seems that
most of the time if we get a watchdog panic, anyway the victim thread
will be perfectly located so that we don't need a specific value for
this. Another benefit is that we perform a single memory write per
lock.
2025-04-28 16:50:34 +02:00
Willy Tarreau
23371b3e7c MINOR: threads: turn the full lock debugging to DEBUG_THREAD=2
At level 1 it now does nothing. This is reserved for some subsequent
patches which will implement lighter debugging.
2025-04-28 16:50:34 +02:00
Willy Tarreau
903a6b14ef MINOR: threads: prepare DEBUG_THREAD to receive more values
We now default the value to zero and make sure all tests properly take
care of values above zero. This is in preparation for supporting several
degrees of debugging.
2025-04-28 16:50:34 +02:00
William Lallemand
bb768b3e26 MEDIUM: acme: use Retry-After value for retries
Parse the Retry-After header in response and store it in order to use
the value as the next delay for the next retry, fallback to 3s if the
value couldn't be parse or does not exist.
2025-04-24 20:14:47 +02:00
Willy Tarreau
69b051d1dc MINOR: resolvers: add "dns-accept-family auto" to rely on detected IPv6
Instead of always having to force IPv4 or IPv6, let's now also offer
"auto" which will only enable IPv6 if the system has a default gateway
for it. This means that properly configured dual-stack systems will
default to "ipv4,ipv6" while those lacking a gateway will only use
"ipv4". Note that no real connectivity test is performed, so firewalled
systems may still get it wrong and might prefer to rely on a manual
"ipv4" assignment.
2025-04-24 17:52:28 +02:00
Willy Tarreau
5d41d476f3 MINOR: sock-inet: detect apparent IPv6 connectivity
In order to ease dual-stack deployments, we could at least try to
check if ipv6 seems to be reachable. For this we're adding a test
based on a UDP connect (no traffic) on port 53 to the base of
public addresses (2001::) and see if the connect() is permitted,
indicating that the routing table knows how to reach it, or fails.
Based on this result we're setting a global variable that other
subsystems might use to preset their defaults.
2025-04-24 17:52:28 +02:00
Willy Tarreau
2c46c2c042 MINOR: resolvers: add command-line argument -4 to force IPv4-only DNS
In order to ease troubleshooting and testing, the new "-4" command line
argument enforces queries and processing of "A" DNS records only, i.e.
those representing IPv4 addresses. This can be useful when a host lack
end-to-end dual-stack connectivity. This overrides the global
"dns-accept-family" directive and is equivalent to value "ipv4".
2025-04-24 17:52:28 +02:00
Willy Tarreau
940fa19ad8 MEDIUM: resolvers: add global "dns-accept-family" directive
By default, DNS resolvers accept both IPv4 and IPv6 addresses. This can be
influenced by the "resolve-prefer" keywords on server lines as well as the
family argument to the "do-resolve" action, but that is only a preference,
which does not block the other family from being used when it's alone. In
some environments where dual-stack is not usable, stumbling on an unreachable
IPv6-only DNS record can cause significant trouble as it will replace a
previous IPv4 one which would possibly have continued to work till next
request. The "dns-accept-family" global option permits to enforce usage of
only one (or both) address families. The argument is a comma-delimited list
of the following words:
  - "ipv4": query and accept IPv4 addresses ("A" records)
  - "ipv6": query and accept IPv6 addresses ("AAAA" records)

When a single family is used, no request will be sent to resolvers for the
other family, and any response for the othe family will be ignored. The
default value is "ipv4,ipv6", which effectively enables both families.
2025-04-24 17:52:28 +02:00
Christopher Faulet
29632bcabf CLEANUP: applet: Remove unsued rule pointer in appctx structure
Thanks to previous commits, the "rule" field in the appctx structure is no
longer used. So we can safely remove it.
2025-04-24 16:22:31 +02:00
Christopher Faulet
b734d7c156 MINOR: cli/applet: Move appctx fields only used by the CLI in a private context
There are several fields in the appctx structure only used by the CLI. To
make things cleaner, all these fields are now placed in a dedicated context
inside the appctx structure. The final goal is to move it in the service
context and add an API for cli commands to get a command coontext inside the
cli context.
2025-04-24 15:09:37 +02:00
Christopher Faulet
742dc01537 CLEANUP: applet: Update st0/st1 comment in appctx structure
Today, these states are used by almost all applets. So update the comments
of these fields.
2025-04-24 15:09:37 +02:00
Christopher Faulet
44ace9a1b7 MINOR: cli: Rename some CLI applet states to reflect recent refactoring
CLI_ST_GETREQ state was renamed into CLI_ST_PARSE_CMDLINE and CLI_ST_PARSEREQ
into CLI_ST_PROCESS_CMDLINE to reflect the real action performed in these
states.
2025-04-24 15:09:37 +02:00
Christopher Faulet
20ec1de214 MAJOR: cli: Refacor parsing and execution of pipelined commands
Before this patch, when pipelined commands were received, each command was
parsed and then excuted before moving to the next command. Pending commands
were not copied in the input buffer of the applet. The major issue with this
way to handle commands is the impossibility to consume inputs from commands
with an I/O handler, like "show events" for instance. It was working thanks
to a "bug" if such commands were the last one on the command line. But it
was impossible to use them followed by another command. And this prevents us
to implement any streaming support for CLI commands.

So we decided to refactor the command line parsing to have something similar
to a basic shell. Now an entire line is parsed, including the payload,
before starting commands execution. The command line is copied in a
dedicated buffer. "appctx->chunk" buffer is used for this purpose. It was an
unsed field, so it is safe to use it here. Once the command line copied, the
commands found on this line are executed. Because the applet input buffer
was flushed, any input can be safely consumed by the CLI applet and is
available for the command I/O handler. Thanks to this change, "show event
-w" command can be followed by a command. And in theory, it should be
possible to implement commands supporting input data streaming. For
instance, the Tetris like lua applet can be used on the CLI now.

Note that the payload, if any, is part of the command line and must be fully
received before starting the commands processing. It means there is still
the limitation to a buffer, but not only for the payload but for the whole
command line. The payload is still necessarily at the end of the command
line and is passed as argument to the last command. Internally, the
"appctx->cli_payload" field was introduced to point on the payload in the
command line buffer.

This patch is quite huge but it cannot easily be splitted. It should not
introduced significant changes.
2025-04-24 15:09:37 +02:00
Willy Tarreau
1af592c511 MINOR: stick-table: use a separate lock label for updates
Too many locks were sharing STK_TABLE_LOCK making it hard to analyze.
Let's split the already heavily used update lock.
2025-04-24 14:02:22 +02:00
William Lallemand
af73f98a3e MEDIUM: acme: rename "uri" into "directory"
Rename the "uri" option of the acme section into "directory".
2025-04-24 10:52:46 +02:00
William Lallemand
d700a242b4 MINOR: httpclient: add an "https" log-format
Add an experimental "https" log-format for the httpclient, it is not
used by the httpclient by default, but could be define in a customized
proxy.

The string is basically a httpslog, with some of the fields replaced by
their backend equivalent or - when not available:

"%ci:%cp [%tr] %ft -/- %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
2025-04-23 15:32:46 +02:00
Christopher Faulet
a56feffc6f CLEANUP: h1: Remove now useless h1_parse_cont_len_header() function
Since the commit "MINOR: hlua/h1: Use http_parse_cont_len_header() to parse
content-length value", this function is no longer used. So it can be safely
removed.
2025-04-22 16:14:47 +02:00
Christopher Faulet
5200203677 MINOR: proxy: Add options to drop HTTP trailers during message forwarding
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.

On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.

This patch should fix the issue #2930.
2025-04-22 16:14:46 +02:00
Christopher Faulet
044ef9b3d6 CLEANUP: Slightly reorder some proxy option flags to free slots
PR_O_TCPCHK_SSL and PR_O_CONTSTATS was shifted to free a slot. The idea is
to have 2 contiguous slots to be able to insert two new options.
2025-04-22 16:14:46 +02:00
Amaury Denoyelle
4309a6fbf8 BUG/MINOR: quic: do not crash on CRYPTO ncbuf alloc failure
To handle out-of-order received CRYPTO frames, a ncbuf instance is
allocated. This is done via the helper quic_get_ncbuf().

Buffer allocation was improperly checked. In case b_alloc() fails, it
crashes due to a BUG_ON(). Fix this by removing it. The function now
returns NULL on allocation failure, which is already properly handled in
its caller qc_handle_crypto_frm().

This should fix the last reported crash from github issue #2935.

This must be backported up to 2.6.
2025-04-18 18:11:17 +02:00
Olivier Houchard
3758eab71c MEDIUM: lb_fwrr: Use one ebtree per thread group.
When using the round-robin load balancer, the major source of contention
is the lbprm lock, that has to be held every time we pick a server.
To mitigate that, make it so there are one tree per thread-group, and
one lock per thread-group. That means we now have a lb_fwrr_per_tgrp
structure that will contain the two lb_fwrr_groups (active and backup) as well
as the lock to protect them in the per-thread lbprm struct, and all
fields in the struct server are now moved to the per-thread structure
too.
Those changes are mostly mechanical, and brings good performances
improvment, on a 64-cores AMD CPU, with 64 servers configured, we could
process about 620000 requests par second, and we now can process around
1400000 requests per second.
2025-04-17 17:38:23 +02:00
Olivier Houchard
f36f6cfd26 MINOR: proxies: Add a per-thread group lbprm struct.
Add a new structure in the per-thread groups proxy structure, that will
contain whatever is per-thread group in lbprm.
It will be accessed as p->per_tgrp[tgid].lbprm.
2025-04-17 17:38:23 +02:00
Olivier Houchard
7ca1c94ff0 MINOR: lb_fwrr: Move the next weight out of fwrr_group.
Move the "next_weight" outside of fwrr_group, and inside struct lb_fwrr
directly, one for the active servers, one for the backup servers.
We will soon have one fwrr_group per thread group, but next_weight will
be global to all of them.
2025-04-17 17:38:23 +02:00
Olivier Houchard
444125a764 MINOR: servers: Provide a pointer to the server in srv_per_tgroup.
Add a pointer to the server into the struct srv_per_tgroup, so that if
we only have access to that srv_per_tgroup, we can come back to the
corresponding server.
2025-04-17 17:38:23 +02:00
Willy Tarreau
36ec70c526 MINOR: sched: add a new function is_sched_alive() to report scheduler's health
This verifies that the scheduler is still ticking without having to
access the activity[] array nor keeping local copies of the ctxsw
counter. It just tests and sets a flag that is reset after each
return from a ->process() function.
2025-04-17 16:25:47 +02:00
Willy Tarreau
874ba2afed CLEANUP: debug: no longer set nor use TH_FL_DUMPING_OTHERS
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
2025-04-17 16:25:47 +02:00
Willy Tarreau
c16d5415a8 MINOR: debug: make ha_stuck_warning() only work for the current thread
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
2025-04-17 16:25:47 +02:00
Willy Tarreau
b24d7f248e MINOR: pass a valid buffer pointer to ha_thread_dump_one()
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.

It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
2025-04-17 16:25:47 +02:00
Willy Tarreau
5ac739cd0c MINOR: debug: remove unused case of thr!=tid in ha_thread_dump_one()
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.

Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
2025-04-17 16:25:47 +02:00
Willy Tarreau
6d8a523d14 MINOR: tinfo: keep a copy of the pointer to the thread dump buffer
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
2025-04-17 16:25:47 +02:00
Willy Tarreau
337017e2f9 BUG/MINOR: threads: set threads_idle and threads_harmless even with no threads
Some signal handlers rely on these to decide about the level of detail to
provide in dumps, so let's properly fill the info about entering/leaving
idle. Note that for consistency with other tests we're using bitops with
t->ltid_bit, while we could simply assign 0/1 to the fields. But it makes
the code more readable and the whole difference is only 88 bytes on a 3MB
executable.

This bug is not important, and while older versions are likely affected
as well, it's not worth taking the risk to backport this in case it would
wake up an obscure bug.
2025-04-17 16:25:47 +02:00
Amaury Denoyelle
52246249ab MEDIUM: listener/mux-h2: implement idle-ping on frontend side
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.

H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
2025-04-17 14:49:36 +02:00
Amaury Denoyelle
a78a04cfae MEDIUM: server/mux-h2: implement idle-ping on backend side
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.

The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().

On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.

Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
2025-04-17 14:49:36 +02:00
William Lallemand
e778049ffc MINOR: acme: register the task in the ckch_store
This patch registers the task in the ckch_store so we don't run 2 tasks
at the same time for a given certificate.

Move the task creation under the lock and check if there was already a
task under the lock.
2025-04-16 17:12:43 +02:00
William Lallemand
c291a5c73c BUILD: incompatible pointer type suspected with -DDEBUG_UNIT
src/jws.c: In function '__jws_init':
src/jws.c:594:38: error: passing argument 2 of 'hap_register_unittest' from incompatible pointer type [-Wincompatible-pointer-types]
  594 |         hap_register_unittest("jwk", jwk_debug);
      |                                      ^~~~~~~~~
      |                                      |
      |                                      int (*)(int,  char **)
In file included from include/haproxy/api.h:36,
                 from include/import/ebtree.h:251,
                 from include/import/ebmbtree.h:25,
                 from include/haproxy/jwt-t.h:25,
                 from src/jws.c:5:
include/haproxy/init.h:37:52: note: expected 'int (*)(void)' but argument is of type 'int (*)(int,  char **)'
   37 | void hap_register_unittest(const char *name, int (*fct)());
      |                                              ~~~~~~^~~~~~

GCC 15 is warning because the function pointer does have its
arguments in the register function.

Should fix issue #2929.
2025-04-15 15:49:44 +02:00
Willy Tarreau
b708345c17 DEBUG: counters: add the ability to enable/disable updating the COUNT_IF counters
These counters can have a noticeable cost on large machines, though not
dramatic. There's no single good choice to keep them enabled or disabled.
This commit adds multiple choices:
  - DEBUG_COUNTERS set to 2 will automatically enable them by default, while
    1 will disable them by default
  - the global "debug.counters on/off" will allow to change the setting at
    boot, regardless of DEBUG_COUNTERS as long as it was at least 1.
  - the CLI "debug counters on/off" will also allow to change the value at
    run time, allowing to observe a phenomenon while it's happening, or to
    disable counters if it's suspected that their cost is too high

Finally, the "debug counters" command will append "(stopped)" at the end
of the CNT lines when these counters are stopped.

Not that the whole mechanism would easily support being extended to all
counter types by specifying the types to apply to, but it doesn't seem
useful at all and would require the user to also type "cnt" on debug
lines. This may easily be changed in the future if it's found relevant.
2025-04-14 19:02:13 +02:00
Willy Tarreau
a142adaba0 DEBUG: counters: make COUNT_IF() only appear at DEBUG_COUNTERS>=1
COUNT_IF() is convenient but can be heavy since some of them were found
to trigger often (roughly 1 counter per request on avg). This might even
have an impact on large setups due to the cost of a shared cache line
bouncing between multiple cores. For now there's no way to disable it,
so let's only enable it when DEBUG_COUNTERS is 1 or above. A future
change will make it configurable.
2025-04-14 19:02:13 +02:00
Willy Tarreau
61d633a3ac DEBUG: rename DEBUG_GLITCHES to DEBUG_COUNTERS and enable it by default
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
2025-04-14 19:02:13 +02:00
William Lallemand
39c05cedff BUILD: acme: enable the ACME feature when JWS is present
The ACME feature depends on the JWS, which currently does not work with
every SSL libraries. This patch only enables ACME when JWS is enabled.
2025-04-12 01:39:03 +02:00
William Lallemand
5500bda9eb MINOR: acme: implement retrieval of the certificate
Once the Order status is "valid", the certificate URL is accessible,
this patch implements the retrieval of the certificate which is stocked
in ctx->store.
2025-04-12 01:39:03 +02:00
William Lallemand
27fff179fe MINOR: acme: verify the order status once finalized
This implements a call to the order status to check if the certificate
is ready.
2025-04-12 01:39:03 +02:00
William Lallemand
680222b382 MINOR: acme: finalize by sending the CSR
This patch does the finalize step of the ACME task.
This encodes the CSR into base64 format and send it to the finalize URL.

https://www.rfc-editor.org/rfc/rfc8555#section-7.4
2025-04-12 01:29:27 +02:00
William Lallemand
de5dc31a0d MINOR: acme: generate the CSR in a X509_REQ
Generate the X509_REQ using the generated private key and the SAN from
the configuration. This is only done once before the task is started.

It could probably be done at the beginning of the task with the private
key generation once we have a scheduler instead of a CLI command.
2025-04-12 01:29:27 +02:00
William Lallemand
00ba62df15 MINOR: acme: implement a check on the challenge status
This patch implements a check on the challenge URL, once haproxy asked
for the challenge to be verified, it must verify the status of the
challenge resolution and if there weren't any error.
2025-04-12 01:29:27 +02:00
William Lallemand
711a13a4b4 MINOR: acme: send the request for challenge ready
This patch sends the "{}" message to specify that a challenge is ready.
It iterates on every challenge URL in the authorization list from the
acme_ctx.

This allows the ACME server to procede to the challenge validation.
https://www.rfc-editor.org/rfc/rfc8555#section-7.5.1
2025-04-12 01:29:27 +02:00
William Lallemand
ae0bc88f91 MINOR: acme: get the challenges object from the Auth URL
This patch implements the retrieval of the challenges objects on the
authorizations URLs. The challenges object contains a token and a
challenge url that need to be called once the challenge is setup.

Each authorization URLs contain multiple challenge objects, usually one
per challenge type (HTTP-01, DNS-01, ALPN-01... We only need to keep the
one that is relevent to our configuration.
2025-04-12 01:29:27 +02:00
William Lallemand
4842c5ea8c MINOR: acme: newOrder request retrieve authorizations URLs
This patch implements the newOrder action in the ACME task, in order to
ask for a new certificate, a list of SAN is sent as a JWS payload.
the ACME server replies a list of Authorization URLs. One Authorization
is created per SAN on a Order.

The authorization URLs are stored in a linked list of 'struct acme_auth'
in acme_ctx, so we can get the challenge URLs from them later.

The location header is also store as it is the URL of the order object.

https://datatracker.ietf.org/doc/html/rfc8555#section-7.4
2025-04-12 01:29:27 +02:00
William Lallemand
04d393f661 MINOR: acme: generate new account
The new account action in the ACME task use the same function as the
chkaccount, but onlyReturnExisting is not sent in this case!
2025-04-12 01:29:27 +02:00
William Lallemand
7f9bf4d5f7 MINOR: acme: check if the account exist
This patch implements the retrival of the KID (account identifier) using
the pkey.

A request is sent to the newAccount URL using the onlyReturnExisting
option, which allow to get the kid of an existing account.

acme_jws_payload() implement a way to generate a JWS payload using the
nonce, pkey and provided URI.
2025-04-12 01:29:27 +02:00
William Lallemand
0aa6dedf72 MINOR: acme: handle the nonce
ACME requests are supposed to be sent with a Nonce, the first Nonce
should be retrieved using the newNonce URI provided by the directory.

This nonce is stored and must be replaced by the new one received in the
each response.
2025-04-12 01:29:27 +02:00
William Lallemand
471290458e MINOR: acme: get the ACME directory
The first request of the ACME protocol is getting the list of URLs for
the next steps.

This patch implements the first request and the parsing of the response.

The response is a JSON object so mjson is used to parse it.
2025-04-12 01:29:27 +02:00
William Lallemand
b8209cf697 MINOR: acme/cli: add the 'acme renew' command
The "acme renew" command launch the ACME task for a given certificate.

The CLI parser generates a new private key using the parameters from the
acme section..
2025-04-12 01:29:27 +02:00
William Lallemand
bf6a39c4d1 MINOR: acme: add private key configuration
This commit allows to configure the generated private keys, you can
configure the keytype (RSA/ECDSA), the number of bits or the curves.

Example:

    acme LE
        uri https://acme-staging-v02.api.letsencrypt.org/directory
        account account.key
        contact foobar@example.com
        challenge HTTP-01
        keytype ECDSA
        curves P-384
2025-04-12 01:29:27 +02:00
William Lallemand
2e8c350b95 MINOR: acme: add configuration for the crt-store
Add new acme keywords for the ckch_conf parsing, which will be used on a
crt-store, a crt line in a frontend, or even a crt-list.

The cfg_postparser_acme() is called in order to check if a section referenced
elsewhere really exists in the config file.
2025-04-12 01:29:27 +02:00
William Lallemand
077e2ce84c MINOR: acme: add the acme section in the configuration parser
Add a configuration parser for the new acme section, the section is
configured this way:

    acme letsencrypt
        uri https://acme-staging-v02.api.letsencrypt.org/directory
        account account.key
        contact foobar@example.com
        challenge HTTP-01

When unspecified, the challenge defaults to HTTP-01, and the account key
to "<section_name>.account.key".

Section are stored in a linked list containing acme_cfg structures, the
configuration parsing is mostly resolved in the postsection parser
cfg_postsection_acme() which is called after the parsing of an acme section.
2025-04-12 01:29:27 +02:00
William Lallemand
20718f40b6 MEDIUM: ssl/ckch: add filename and linenum argument to crt-store parsing
Add filename and linenum arguments to the crt-store / ckch_conf parsing.

It allows to use them in the parsing function so we could emits error.
2025-04-12 01:29:27 +02:00
Willy Tarreau
00c967fac4 MINOR: master/cli: support bidirectional communications with workers
Some rare commands in the worker require to keep their input open and
terminate when it's closed ("show events -w", "wait"). Others maintain
a per-session context ("set anon on"). But in its default operation
mode, the master CLI passes commands one at a time to the worker, and
closes the CLI's input channel so that the command can immediately
close upon response. This effectively prevents these two specific cases
from being used.

Here the approach that we take is to introduce a bidirectional mode to
connect to the worker, where everything sent to the master is immediately
forwarded to the worker (including the raw command), allowing to queue
multiple commands at once in the same session, and to continue to watch
the input to detect when the client closes. It must be a client's choice
however, since doing so means that the client cannot batch many commands
at once to the master process, but must wait for these commands to complete
before sending new ones. For this reason we use the prefix "@@<pid>" for
this. It works exactly like "@" except that it maintains the channel
open during the whole execution. Similarly to "@<pid>" with no command,
"@@<pid>" will simply open an interactive CLI session to the worker, that
will be ended by "quit" or by closing the connection. This can be convenient
for the user, and possibly for clients willing to dedicate a connection to
the worker.
2025-04-11 16:09:17 +02:00
Aurelien DARRAGON
fbfeb591f7 MINOR: proxy: add deinit_proxy() helper func
Same as free_proxy(), but does not free the base proxy pointer (ie: the
proxy itself may not be allocated)

Goal is to be able to cleanup statically allocated dummy proxies.
2025-04-10 22:10:31 +02:00
Aurelien DARRAGON
e1cec655ee MINOR: proxy: add setup_new_proxy() function
Split alloc_new_proxy() in two functions: the preparing part is now
handled by setup_new_proxy() which can be called individually, while
alloc_new_proxy() takes care of allocating a new proxy struct and then
calling setup_new_proxy() with the freshly allocated proxy.
2025-04-10 22:10:31 +02:00
Willy Tarreau
f4634e5a38 MINOR: ring/cli: support delimiting events with a trailing \0 on "show events"
At the moment it is not supported to produce multi-line events on the
"show events" output, simply because the LF character is used as the
default end-of-event mark. However it could be convenient to produce
well-formatted multi-line events, e.g. in JSON or other formats. UNIX
utilities have already faced similar needs in the past and added
"-print0" to "find" and "-0" to "xargs" to mention that the delimiter
is the NUL character. This makes perfect sense since it's never present
in contents, so let's do exactly the same here.

Thus from now on, "show events <ring> -0" will delimit messages using
a \0 instead of a \n, permitting a better and safer encapsulation.
2025-04-08 14:36:35 +02:00
Willy Tarreau
0be6d73e88 MINOR: ring: support arbitrary delimiters through ring_dispatch_messages()
In order to support delimiting output events with other characters than
just the LF, let's pass the delimiter through the API. The default remains
the LF, used by applet_append_line(), and ignored by the log forwarder.
2025-04-08 14:36:35 +02:00
Willy Tarreau
f01ff2478f BUILD: atomics: fix build issue on non-x86/non-arm systems
Commit f435a2e518 ("CLEANUP: atomics: also replace __sync_synchronize()
with __atomic_thread_fence()") replaced the builtins used for barriers,
but the different API required an argument while the macros didn't specify
any, resulting in double parenthesis that were causing obscure build errors
such as "called object type 'void' is not a function or function pointer".
Let's just specify the args for the macro. No backport is needed.
2025-04-07 09:38:22 +02:00
Aurelien DARRAGON
11d4d0957e MEDIUM: task: make notification_* API thread safe by default
Some notification_* functions were not thread safe by default as they
assumed only one producer would emit events for registered tasks.

While this suited well with the Lua sockets use-case, this proved to
be a limitation with some other event sources (ie: lua Queue class)

instead of having to deal with both the non thread safe and thread
safe variants (_mt suffix), which is error prone, let's make the
entire API thread safe regarding the event list.

Pruning functions still require that only one thread executes them,
with Lua this is always the case because there is one cleanup list
per context.
2025-04-03 17:52:50 +02:00
Aurelien DARRAGON
748dba4859 MINOR: hlua_fcn: register queue class using hlua_register_metatable()
Most lua classes are registered by leveraging the
hlua_register_metatable() helper. Let's use that for the Queue class as
well for consitency.
2025-04-03 17:52:17 +02:00
Aurelien DARRAGON
b77b1a2c3a MINOR: task: add thread safe notification_new and notification_wake variants
notification_new and notification_wake were historically meant to be
called by a single thread doing both the init and the wakeup for other
tasks waiting on the signals.

In this patch, we extend the API so that notification_new and
notification_wake have thread-safe variants that can safely be used with
multiple threads registering on the same list of events and multiple
threads pushing updates on the list.
2025-04-03 17:52:03 +02:00
Amaury Denoyelle
f0f1816f1a MINOR: check: implement check-pool-conn-name srv keyword
This commit is a direct follow-up of the previous one. It defines a new
server keyword check-pool-conn-name. It is used as the default value for
the name parameter of idle connection hash generation.

Its behavior is similar to server keyword pool-conn-name, but reserved
for checks reuse. If check-pool-conn-name is set, it is used in priority
to match a connection for reuse. If unset, a fallback is performed on
check-sni.
2025-04-03 17:19:07 +02:00
Amaury Denoyelle
43367f94f1 MINOR: check/backend: support conn reuse with SNI
Support for connection reuse during server checks was implemented
recently. This is activated with the server keyword check-reuse-pool.

Similarly to stream processing via connect_backend(), a connection hash
is calculated when trying to perform reuse for checks. This is necessary
to retrieve for a connection which shares the check connect parameters.
However, idle connections can additionnally be tagged using a
pool-conn-name or SNI under connect_backend(). Check reuse does not test
these values, which prevent to retrieve a matching connection.

Improve this by using "check-sni" value as idle connection hash input
for check reuse. be_calculate_conn_hash() API has been adjusted so that
name value can be passed as input, both when using streams or checks.

Even with the current patch, there is still some scenarii which could
not be covered for checks connection reuse. most notably, when using
dynamic pool-conn-name/SNI value. It is however at least sufficient to
cover simpler cases.
2025-04-03 17:19:07 +02:00
Willy Tarreau
f435a2e518 CLEANUP: atomics: also replace __sync_synchronize() with __atomic_thread_fence()
The drop of older compilers also allows us to focus on clearer
barriers, so let's use them.
2025-04-03 11:59:31 +02:00
Willy Tarreau
34e3b83f9c CLEANUP: atomics: remove support for gcc < 4.7
The old __sync_* API is no longer necessary since we do not support
gcc before 4.7 anymore. Let's just get rid of this code, the file is
still ugly enough without it.
2025-04-03 11:55:35 +02:00
Ilia Shipitsin
27a6353ceb CLEANUP: assorted typo fixes in the code, commits and doc 2025-04-03 11:37:25 +02:00
William Lallemand
b351f06ff1 REORG: ssl: move curves2nid and nid2nist to ssl_utils
curves2nid and nid2nist are generic functions that could be used outside
the JWS scope, this patch put them at the right place so they can be
reused.
2025-04-02 19:34:09 +02:00
Amaury Denoyelle
f1fb396d71 MEDIUM: check: implement check-reuse-pool
Implement the possibility to reuse idle connections when performing
server checks. This is done thanks to the recently introduced functions
be_calculate_conn_hash() and be_reuse_connection().

One side effect of this change is that be_calculate_conn_hash() can now
be called with a NULL stream instance. As such, part of the functions
are adjusted accordingly.

Note that to simplify configuration, connection reuse is not performed
if any specific check connection parameters are defined on the server
line or via the tcp-check connect rule. This is performed via newly
defined tcpcheck_use_nondefault_connect().
2025-04-02 14:57:40 +02:00
Amaury Denoyelle
e34f748e3a MINOR: check define check-reuse-pool server keyword
Define a new server keyword check-reuse-pool, and its counterpart with a
"no" prefix. For the moment, only parsing is implemented. The real
behavior adjustment will be implemented in the next patch.
2025-04-02 14:57:40 +02:00
Amaury Denoyelle
20eb57b486 MINOR: backend: remove stream usage on connection reuse
Adjust newly defined be_reuse_connection() API. The stream argument is
removed. This will allows checks to be able to invoke it without relying
on a stream instance.
2025-04-02 14:57:40 +02:00
Amaury Denoyelle
ee94a6cfc1 MINOR: backend: extract conn reuse from connect_server()
Following the previous patch, the part directly related to connection
reuse is extracted from connect_server(). It is now define in a new
function be_reuse_connection().
2025-04-02 14:57:40 +02:00
Amaury Denoyelle
c7cc6b6401 MINOR: backend: extract conn hash calculation from connect_server()
On connection reuse, a hash is first calculated. It is generated from
various connection parameters, to retrieve a matching connection.

Extract hash calculation from connect_server() into a new dedicated
function be_calculate_conn_hash(). The objective is to be able to
perform connection reuse for checks, without connect_server() invokation
which relies on a stream instance.
2025-04-02 14:57:40 +02:00
Willy Tarreau
4ec5509541 BUILD: compiler: undefine the CONCAT() macro if already defined
As Ilya reported in issue #2911, the CONCAT() macro breaks on NetBSD
which defines its own as __CONCAT() (which is exactly the same). Let's
just undefine it before ours to fix the issue instead of renaming, but
keep ours so that we don't have doubts about what we're running with.

Note that the patch introducing this breaking change was backported
to 3.0.
2025-04-02 11:36:43 +02:00
Ilia Shipitsin
78b849b839 CLEANUP: assorted typo fixes in the code and comments
code, comments and doc actually.
2025-04-02 11:12:20 +02:00
Olivier Houchard
9fe72bba3c MAJOR: leastconn; Revamp the way servers are ordered.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).

The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.

Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
2025-04-01 18:05:30 +02:00
Olivier Houchard
ba521a1d88 MINOR: threads: Add HA_RWLOCK_TRYRDTOWR()
Add HA_RWLOCK_TRYRDTOWR(), that tries to upgrade a lock
from reader to writer, and fails if any seeker or writer already
holds it.
2025-04-01 18:05:30 +02:00
Olivier Houchard
2a9436f96b MINOR: lbprm: Add method to deinit server and proxy
Add two new methods to lbprm, server_deinit() and proxy_deinit(),
in case something should be done at the lbprm level when
removing servers and proxies.
2025-04-01 18:05:30 +02:00
Olivier Houchard
17059098e7 MINOR: mt_list: Implement mt_list_try_lock_prev().
Implement mt_list_try_lock_prev(), that does the same thing
as mt_list_lock_prev(), exceot if the list is locked, it
returns { NULL, NULL } instaed of waiting.
2025-04-01 18:05:30 +02:00
William Lallemand
fdcb97614c MINOR: ssl/ckch: add substring parser for ckch_conf
Add a substring parser for the ckch_conf keyword parser, this will split
a string into multiple substring, and strdup them in a array.
2025-04-01 15:38:32 +02:00
William Lallemand
f8fe84caca MINOR: jws: emit the JWK thumbprint
jwk_thumbprint() is a function which is a function which implements
RFC7368 and emits a JWK thumbprint using a EVP_PKEY.

EVP_PKEY_EC_to_pub_jwk() and EVP_PKEY_RSA_to_pub_jwk() were changed in
order to match what is required to emit a thumbprint (ie, no spaces or
lines and the lexicographic order of the fields)
2025-04-01 11:57:55 +02:00
Willy Tarreau
1e9a2529aa MINOR: cpu-topo: pass an extra argument to ha_cpu_policy
This extra argument will allow common functions to distinguish between
multiple policies. For now it's not used.
2025-03-31 16:21:37 +02:00
Willy Tarreau
571573874a MINOR: cpu-set: add a new function to print cpu-sets in human-friendly mode
The new function "print_cpu_set()" will print cpu sets in a human-friendly
way, with commas and dashes for intervals. The goal is to keep them compact
enough.
2025-03-31 16:21:37 +02:00
Willy Tarreau
3955f151b1 MINOR: cpu-set: compare two cpu sets with ha_cpuset_isequal()
This function returns true if two CPU sets are equal.
2025-03-31 16:21:37 +02:00
Valentine Krasnobaeva
b303861469 MINOR: compiler: add __nonstring macro
GCC 15 throws the following warning on fixed-size char arrays if they do not
contain terminated NUL:

src/tools.c:2041:25: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
 2041 | const char hextab[16] = "0123456789ABCDEF";

We are using a couple of such definitions for some constants. Converting them
to flexible arrays, like: hextab[] = "0123456789ABCDEF" may have consequences,
as enlarged arrays won't fit anymore where they were possibly located due to
the memory alignement constraints.

GCC adds 'nonstring' variable attribute for such char arrays, but clang and
other compilers don't have it. Let's wrap 'nonstring' with our
__nonstring macro, which will test if the compiler supports this attribute.

This fixes the issue #2910.
2025-03-31 13:50:28 +02:00
Willy Tarreau
6b17310757 MEDIUM: pools: be a bit smarter when merging comparable size pools
By default, pools of comparable sizes are merged together. However, the
current algorithm is dumb: it rounds the requested size to the next
multiple of 16 and compares the sizes like this. This results in many
entries which are already multiples of 16 not being merged, for example
1024 and 1032 are separate, 65536 and 65540 are separate, 48 and 56 are
separate (though 56 merges with 64).

This commit changes this to consider not just the entry size but also the
average entry size, that is, it compares the average size of all objects
sharing the pool with the size of the object looking for a pool. If the
object is not more than 1% bigger nor smaller than the current average
size or if it neither 16 bytes smaller nor larger, then it can be merged.
Also, it always respects exact matches in order to avoid merging objects
into larger pools or worse, extending existing ones for no reason, and
when there's a tie, it always avoids extending an existing pool.

Also, we now visit all existing pools in order to spot the best one, we
do not stop anymore at the smallest one large enough. Theoretically this
could cost a bit of CPU but in practice it's O(N^2) with N quite small
(typically in the order of 100) and the cost at each step is very low
(compare a few integer values). But as a side effect, pools are no
longer sorted by size, "show pools bysize" is needed for this.

This causes the objects to be much better grouped together, accepting to
use a little bit more sometimes to avoid fragmentation, without causing
everyone to be merged into the same pool. Thanks to this we're now
seeing 36 pools instead of 48 by default, with some very nice examples
of compact grouping:

  - Pool qc_stream_r (80 bytes) : 13 users
      >  qc_stream_r : size=72 flags=0x1 align=0
      >  quic_cstrea : size=80 flags=0x1 align=0
      >  qc_stream_a : size=64 flags=0x1 align=0
      >  hlua_esub   : size=64 flags=0x1 align=0
      >  stconn      : size=80 flags=0x1 align=0
      >  dns_query   : size=64 flags=0x1 align=0
      >  vars        : size=80 flags=0x1 align=0
      >  filter      : size=64 flags=0x1 align=0
      >  session pri : size=64 flags=0x1 align=0
      >  fcgi_hdr_ru : size=72 flags=0x1 align=0
      >  fcgi_param_ : size=72 flags=0x1 align=0
      >  pendconn    : size=80 flags=0x1 align=0
      >  capture     : size=64 flags=0x1 align=0

  - Pool h3s (56 bytes) : 17 users
      >  h3s         : size=56 flags=0x1 align=0
      >  qf_crypto   : size=48 flags=0x1 align=0
      >  quic_tls_se : size=48 flags=0x1 align=0
      >  quic_arng   : size=56 flags=0x1 align=0
      >  hlua_flt_ct : size=56 flags=0x1 align=0
      >  promex_metr : size=48 flags=0x1 align=0
      >  conn_hash_n : size=56 flags=0x1 align=0
      >  resolv_requ : size=48 flags=0x1 align=0
      >  mux_pt      : size=40 flags=0x1 align=0
      >  comp_state  : size=40 flags=0x1 align=0
      >  notificatio : size=48 flags=0x1 align=0
      >  tasklet     : size=56 flags=0x1 align=0
      >  bwlim_state : size=48 flags=0x1 align=0
      >  xprt_handsh : size=48 flags=0x1 align=0
      >  email_alert : size=56 flags=0x1 align=0
      >  caphdr      : size=41 flags=0x1 align=0
      >  caphdr      : size=41 flags=0x1 align=0

  - Pool quic_cids (32 bytes) : 13 users
      >  quic_cids   : size=16 flags=0x1 align=0
      >  quic_tls_ke : size=32 flags=0x1 align=0
      >  quic_tls_iv : size=12 flags=0x1 align=0
      >  cbuf        : size=32 flags=0x1 align=0
      >  hlua_queuew : size=24 flags=0x1 align=0
      >  hlua_queue  : size=24 flags=0x1 align=0
      >  promex_modu : size=24 flags=0x1 align=0
      >  cache_st    : size=24 flags=0x1 align=0
      >  spoe_appctx : size=32 flags=0x1 align=0
      >  ehdl_sub_tc : size=32 flags=0x1 align=0
      >  fcgi_flt_ct : size=16 flags=0x1 align=0
      >  sig_handler : size=32 flags=0x1 align=0
      >  pipe        : size=24 flags=0x1 align=0

  - Pool quic_crypto (1032 bytes) : 2 users
      >  quic_crypto : size=1032 flags=0x1 align=0
      >  requri      : size=1024 flags=0x1 align=0

  - Pool quic_conn_r (65544 bytes) : 2 users
      >  quic_conn_r : size=65536 flags=0x1 align=0
      >  dns_msg_buf : size=65540 flags=0x1 align=0

On a very unscientific test consisting in sending 1 million H1 requests
and 1 million H2 requests to the stats page, we're seeing an ~6% lower
memory usage with the patch:

  before the patch:
    Total: 48 pools, 4120832 bytes allocated, 4120832 used (~3555680 by thread caches).

  after the patch:
    Total: 36 pools, 3880648 bytes allocated, 3880648 used (~3299064 by thread caches).

This should be taken with care however since pools allocate and release
in batches.
2025-03-25 18:01:01 +01:00
Pierre-Andre Savalle
8ed1e91efd MEDIUM: lb-chash: add directive hash-preserve-affinity
When using hash-based load balancing, requests are always assigned to
the server corresponding to the hash bucket for the balancing key,
without taking maxconn or maxqueue into account, unlike in other load
balancing methods like 'first'. This adds a new backend directive that
can be used to take maxconn and possibly maxqueue in that context. This
can be used when hashing is desired to achieve cache locality, but
sending requests to a different server is preferable to queuing for a
long time or failing requests when the initial server is saturated.

By default, affinity is preserved as was the case previously. When
'hash-preserve-affinity' is set to 'maxqueue', servers are considered
successively in the order of the hash ring until a server that does not
have a full queue is found.

When 'maxconn' is set on a server, queueing cannot be disabled, as
'maxqueue=0' means unlimited.  To support picking a different server
when a server is at 'maxconn' irrespective of the queue,
'hash-preserve-affinity' can be set to 'maxconn'.
2025-03-25 18:01:01 +01:00
Amaury Denoyelle
cf9e40bd8a MINOR: quic: define max-stream-data configuration as a ratio 2025-03-25 16:30:35 +01:00
Amaury Denoyelle
68c10d444d MINOR: mux-quic: define config for max-data
Define a new global configuration tune.quic.frontend.max-data. This
allows users to explicitely set the value for the corresponding QUIC TP
initial-max-data, with direct impact on haproxy memory consumption.
2025-03-25 16:30:09 +01:00
Amaury Denoyelle
a71007c088 MINOR: quic: move global tune options into quic_tune
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.

This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
2025-03-24 10:01:46 +01:00
Willy Tarreau
9091c5317f MINOR: cli/pools: record the list of pool registrations even when merging them
By default, create_pool() tries to merge similar pools into one. But when
dealing with certain bugs, it's hard to say which ones were merged together.
We do have the information at registration time, so let's just create a
list of registrations ("pool_registration") attached to each pool, that
will store that information. It can then be consulted on the CLI using
"show pools detailed", where the names, sizes, alignment and flags are
reported.
2025-03-21 17:09:30 +01:00
Aurelien DARRAGON
7ec6f4412c MINOR: stats: add alt_name field to stat_col struct
alt_name will be used by metric exporters to know how the metric should be
presented to the user. If the alt_name is NULL, the metric should be
ignored. For now only promex exporter will make use of this.
2025-03-21 17:04:54 +01:00
Olivier Houchard
98967aa09f MEDIUM: mt_list: Reduce the max number of loops with exponential backoff
Reduce the max number of loops in the mt_list code while waiting for
a lock to be available with exponential backoff. It's been observed that
the current value led to severe performances degradation at least on
some hardware, hopefully this value will be acceptable everywhere.
2025-03-21 11:30:59 +01:00
Aurelien DARRAGON
af68343a56 MINOR: stats: use stat_col storage stat_cols_info
Use stat_col storage for stat_cols_info[] array instead of name_desc.

As documented in 65624876f ("MINOR: stats: introduce a more expressive
stat definition method"), stat_col supersedes name_desc storage but
it remains backward compatible. Here we migrate to the new API to be
able to further extend stat_cols_info[] in following patches.
2025-03-20 11:38:32 +01:00
Aurelien DARRAGON
9c60fc9fe1 MINOR: stats: STATS_PX_CAP___B_ macro
STATS_PX_CAP___B_ points to STATS_PX_CAP_BE, it is just an alias
for consistency, like STATS_PX_CAP____S which points to
STATS_PX_CAP_SRV.
2025-03-20 11:37:47 +01:00
Aurelien DARRAGON
3c1b00b127 MINOR: stats: add .generic explicit field in stat_col struct
Further extend logic implemented in 65624876 ("MINOR: stats: introduce a
more expressive stat definition method") and 4e9e8418 ("MINOR: stats:
prepare stats-file support for values other than FN_COUNTER"): we don't
rely anymore on the presence of the capability to know if the metric is
generic or not. This is because it prevents us from setting a capability
on static statistics. Yet it could be useful to set the capability even
on static metrics, thus we add a dedicated .generic bit to tell haproxy
that the metric is generic and can be handled automatically by the API.

Also, ME_NEW_* helpers are not explicitly associated to generic metric
definition (as it was already the case before) to avoid ambiguities.
It may change in the future as we may need to use the new definition
method to define static metrics (without the generic bit set). But for
now it isn't the case as this need definition was implemented for generic
metrics support in the first place. If we want to define static metrics
using the API, we could add a new set of helpers for instance.
2025-03-20 11:37:21 +01:00
William Lallemand
2fb6270910 MEDIUM: ssl/ckch: make the ckch_conf more generic
The ckch_store_load_files() function makes specific processing for
PARSE_TYPE_STR as if it was a type only used for paths.

This patch changes a little bit the way it's done,
PARSE_TYPE_STR is only meant to strdup() a string and stores the
resulting pointer in the ckch_conf structure.

Any processing regarding the path is now done in the callback.

Since the callbacks were basically doing the same thing, they were
transformed into the DECLARE_CKCH_CONF_LOAD() macros which allows to
do some templating of these functions.

The resulting ckch_conf_load_* functions will do the same as before,
except they will also do the path processing instead of letting
ckch_store_load_files() do it, which means we don't need the "base"
member anymore in the struct ckch_conf_kws.
2025-03-19 18:08:40 +01:00
William Lallemand
b0ad777902 MINOR: tools: path_base() concatenates a path with a base path
With the SSL configuration, crt-base, key-base are often used, these
keywords concatenates the base path with the path when the path does not
start by  '/'.

This is done at several places in the code, so a function to do this
would be better to standardize the code.
2025-03-19 17:59:31 +01:00
William Lallemand
29b4b985c3 MINOR: jws: use jwt_alg type instead of a char
This patch implements the function EVP_PKEY_to_jws_algo() which returns
a jwt_alg compatible with the private key.

This value can then be passed to jws_b64_protected() and
jws_b64_signature() which modified to take an jwt_alg instead of a char.
2025-03-17 18:06:34 +01:00
William Lallemand
de67f25a7e MINOR: jws: add new functions in jws.h
Add signatures of jws_b64_payload(), jws_b64_protected(),
jws_b64_signature(), jws_flattened() which allows to create a complete
JWS flattened object.
2025-03-17 11:51:52 +01:00
Willy Tarreau
156430ceb6 MINOR: cpu-topo: add a CPU policy setting to the global section
We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).
2025-03-14 18:33:16 +01:00
Willy Tarreau
c93ee25054 MINOR: cpu-topo: add "only-node" and "drop-node" to cpu-set
These are processed after the topology is detected, and they allow to
restrict binding to or evict CPUs matching the indicated node(s).
2025-03-14 18:33:16 +01:00
Willy Tarreau
aa4776210b MINOR: cpu-topo: create an array of the clusters
The goal here is to keep an array of the known CPU clusters, because
we'll use that often to decide of the performance of a cluster and
its relevance compared to other ones. We'll store the number of CPUs
in it, the total capacity etc. For the capacity, we count one unit
per core, and 1/3 of it per extra SMT thread, since this is roughly
what has been measured on modern CPUs.

In order to ease debugging, they're also dumped with -dc.
2025-03-14 18:30:31 +01:00
Willy Tarreau
4a6eaf6c5e MINOR: cpu-topo: add a function to sort by cluster+capacity
The purpose here is to detect heterogenous clusters which are not
properly reported, based on the exposed information about the cores
capacity. The algorithm here consists in sorting CPUs by capacity
within a cluster, and considering as equal all those which have 5%
or less difference in capacity with the previous one. This allows
large clusters of more than 5% total between extremities, while
keeping apart those where the limit is more pronounced. This is
quite common in embedded environments with big.little systems, as
well as on some laptops.
2025-03-14 18:30:31 +01:00
Willy Tarreau
d169758fa9 MINOR: cpu-topo: make sure we don't leave unassigned IDs in the cpu_topo
It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.

Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.
2025-03-14 18:30:31 +01:00
Willy Tarreau
af648c7b58 MINOR: cpu-topo: assign clusters to cores without and renumber them
Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.

The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.

We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.

Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.
2025-03-14 18:30:31 +01:00
Willy Tarreau
a4471ea56d MINOR: cpu-topo: implement a CPU sorting mechanism by cluster ID
This will be used to detect and fix incorrect setups which report
the same cluster ID for multiple L3 instances.

The arrangement of functions in this file is becoming a real problem.
Maybe we should move all this to cpu_topo for example, and better
distinguish OS-specific and generic code.
2025-03-14 18:30:31 +01:00
Willy Tarreau
a8acdbd9fd MINOR: cpu-topo: implement a sorting mechanism by CPU locality
Once we've kept only the CPUs we want, the next step will be to form
groups and these ones are based on locality. Thus we'll have to sort by
locality. For now the locality is only inferred by the index. No grouping
is made at this point. For this we add the "cpu_reorder_by_locality"
function with a locality-based comparison function.
2025-03-14 18:30:31 +01:00
Willy Tarreau
18133a054d MINOR: cpu-topo: implement a sorting mechanism for CPU index
CPU selection will be performed by sorting CPUs according to
various criteria. For dumps however, that's really not convenient
and we'll need to reorder the CPUs according to their index only.
This is what the new function cpu_reorder_by_index() does. It's
called  in thread_detect_count() before dumping the CPU topology.
2025-03-14 18:30:31 +01:00
Willy Tarreau
1af4942c95 MEDIUM: thread: start to detect thread groups and threads min/max
By mutually refining the thread count and group count, we can try
to detect the most suitable setup for the current machine. Taskset
is implicitly handled correctly. tgroups automatically adapt to the
configured number of threads. cpu-map manages to limit tgroups to
the smallest supported value.

The thread-limit is enforced. Just like in cfgparse, if the thread
count was forced to a higher value, it's reduced and a warning is
emitted. But if it was not set, the thr_max value is bound to this
limit so that further calculations respect it.

We continue to default to the max number of available threads and 1
tgroup by default, with the limit. This normally allows to get rid
of that test in check_config_validity().
2025-03-14 18:30:30 +01:00
Willy Tarreau
f0661e79fe MINOR: global: add a command-line option to enable CPU binding debugging
During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.
2025-03-14 18:30:30 +01:00
Willy Tarreau
ac1db9db7d MINOR: thread: turn thread_cpu_mask_forced() into an init-time variable
The function is not convenient because it doesn't allow us to undo the
startup changes, and depending on where it's being used, we don't know
whether the values read have already been altered (this is not the case
right now but it's going to evolve).

Let's just compute the status during cpu_detect_usable() and set a
variable accordingly. This way we'll always read the init value, and
if needed we can even afford to reset it. Also, placing it in cpu_topo.c
limits cross-file dependencies (e.g. threads without affinity etc).
2025-03-14 18:30:30 +01:00
Willy Tarreau
7cb274439b MINOR: cpu-topo: add CPU topology detection for linux
This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.
2025-03-14 18:30:30 +01:00
Willy Tarreau
8f72ce335a MINOR: cpu-topo: add detection of online CPUs on Linux
This adds a generic function ha_cpuset_detect_online() which for now
only supports linux via /sys. It fills a cpuset with the list of online
CPUs that were detected (or returns a failure).
2025-03-14 18:30:30 +01:00
Willy Tarreau
8c524c7c9d REORG: cpu-topo: move bound cpu detection from cpuset to cpu-topo
The cpuset files are normally used only for cpu manipulations. It happens
that the initial CPU binding detection was initially placed there since
there was no better place, but in practice, being OS-specific, it should
really be in cpu-topo. This simplifies cpuset which doesn't need to know
about the OS anymore.
2025-03-14 18:30:30 +01:00
Willy Tarreau
a6fdc3eaf0 MINOR: cpu-topo: update CPU topology from excluded CPUs at boot
Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.

It's important to proceed this way so that when we have no info we
assume they're all available.
2025-03-14 18:30:30 +01:00
Willy Tarreau
bdb731172c MINOR: cpu-topo: add a function to dump CPU topology
The new function cpu_dump_topology() will centralize most debugging
calls, and it can make efforts of not dumping some possibly irrelevant
fields (e.g. non-existing cache levels).
2025-03-14 18:30:30 +01:00
Willy Tarreau
041462c4af MINOR: cpu-topo: rely on _SC_NPROCESSORS_CONF to trim maxcpus
We don't want to constantly deal with as many CPUs as a cpuset can hold,
so let's first try to trim the value to what the system claims to support
via _SC_NPROCESSORS_CONF. It is obviously still subject to the limit of
the cpuset size though. The value is stored globally so that we can
reuse it elsewhere after initialization.
2025-03-14 18:30:30 +01:00
Willy Tarreau
656cedad42 MINOR: cpu-topo: allocate and initialize the ha_cpu_topo array.
This does the bare minimum to allocate and initialize a global
ha_cpu_topo array for the number of supported CPUs and release
it at deinit time.
2025-03-14 18:30:30 +01:00
Willy Tarreau
d165f5d3ab MINOR: cpu-topo: add ha_cpu_topo definition
This structure will be used to store information about each CPU's
topology (package ID, L3 cache ID, NUMA node ID etc). This will be used
in conjunction with CPU affinity setting to try to perform a mostly
optimal binding between threads and CPU numbers by default. Since it
was noticed during tests that absolutely none of the many machines
tested reports different die numbers, the die_id is not stored.
Also, it was found along experiments that the cluster ID will be used
a lot, half of the time as a node-local identifier, and half of the
time as a global identifier. So let's store the two versions at once
(cl_gid, cl_lid).

Some flags are added to indicate causes of exclusion (offline, excluded
at boot, excluded by rules, ignored by policy).
2025-03-14 18:30:30 +01:00
Willy Tarreau
69ac4cd315 MINOR: compiler: add a new __decl_thread_var() macro to declare local variables
__decl_thread() already exists but is more suited for struct members.
When using it in a variables block, it appends the final trailing
semi-colon which is a statement that ends the variable block. Better
clean this up and have one precisely for variable blocks. In this
case we can simply define an unused enum value that will consume the
semi-colon. That's what the new macro __decl_thread_var() does.
2025-03-12 18:08:12 +01:00
Willy Tarreau
bb4addabb7 MINOR: compiler: add a simple macro to concatenate resolved strings
It's often useful to be able to concatenate strings after resolving
them (e.g. __FILE__, __LINE__ etc). Let's just have a CONCAT() macro
to do that, which calls _CONCAT() with the same arguments to make
sure the contents are resolved before being concatenated.
2025-03-12 18:06:55 +01:00
Aurelien DARRAGON
003fe530ae MINOR: log: add "option host" log-forward option
add only the parsing part, options are currently unused
2025-03-12 10:51:35 +01:00
Aurelien DARRAGON
47f14be9f3 MINOR: tools: only print address in sa2str() when port == -1
Support special value for port in sa2str: if port is equal to -1, only
print the address without the port, also ignoring <map_ports> value.
2025-03-12 10:51:20 +01:00
Aurelien DARRAGON
bc76f6dde9 MINOR: log: migrate log-forward options from proxy->options2 to options3
Migrate recently added log-forward section options, currently stored under
proxy->options2 to proxy->options3 since proxy->options2 is running out of
space and we plan on adding more log-forward options.
2025-03-12 10:50:03 +01:00
Aurelien DARRAGON
cc5a66212d MINOR: proxy: add proxy->options3
proxy->options2 is almost full, yet we will add new log-forward options
in upcoming patches so we anticipate that by adding a new {no_}options3
and cfg_opts3[] to further extend proxy options
2025-03-12 10:49:36 +01:00
Amaury Denoyelle
dc7913d814 MAJOR: mux-quic: increase stream flow-control for multi-buffer alloc
Support for multiple Rx buffers per QCS instance has been introduced by
previous patches. However, due to flow-control initial values, client
were still unable to fully used this to increase their upload
throughput.

This patch increases max-stream-data-bidi-remote flow-control initial
values. A new define QMUX_STREAM_RX_BUF_FACTOR will fix the number of
concurrent buffers allocable per QCS. It is set to 90.

Note that connection flow-control initial value did not changed. It is
still configured to be equivalent to bufsize multiplied by the maximum
concurrent streams. This ensures that Rx buffers allocation is still
constrained per connection, so that it won't be possible to have all
active QCS instances using in parallel their maximum Rx buffers count.
2025-03-07 12:06:27 +01:00
Amaury Denoyelle
a4f31ffeeb MINOR: mux-quic: store QCS Rx buf in a single-entry tree
Convert QCS rx buffer pointer to a tree container. Additionnaly, offset
field of qc_stream_rxbuf is thus transformed into a node tree.

For now, only a single Rx buffer is stored at most in QCS tree. Multiple
Rx buffers will be implemented in a future patch to improve QUIC clients
upload throughput.
2025-03-07 12:06:26 +01:00
Amaury Denoyelle
cc3c2d1f12 MINOR: mux-quic: define rxbuf wrapper
Define a new type qc_stream_rxbuf. This is used as a wrapper around QCS
Rx buffer with encapsulation of the ncbuf storage. It is allocated via a
new pool. Several functions are adapted to be able to deal with
qc_stream_rxbuf as a wrapper instead of the previous plain ncbuf
instance.

No functional change should happen with this patch. For now, only a
single qc_stream_rxbuf can be instantiated per QCS. However, this new
type will be useful to implement multiple Rx buffer storage in a future
commit.
2025-03-07 12:06:26 +01:00
Amaury Denoyelle
4b1e63d191 MINOR: mux-quic: define globally stream rxbuf size
QCS uses ncbuf for STREAM data storage. This serves as a limit for
maximum STREAM buffering capacity, advertised via QUIC transport
parameters for initial flow-control values.

Define a new function qmux_stream_rx_bufsz() which can be used to
retrieve this Rx buffer size. This can be used both in MUX/H3 layers and
in QUIC transport parameters.
2025-03-07 12:06:26 +01:00
Amaury Denoyelle
861b11334c MINOR: h3/hq-interop: restore function for standalone FIN receive
Previously, a function qcs_http_handle_standalone_fin() was implemented
to handle a received standalone FIN, bypassing app_ops layer decoding.
However, this was removed as app_ops layer interaction is necessary. For
example, HTTP/3 checks that FIN is never sent on the control uni stream.

This patch reintroduces qcs_http_handle_standalone_fin(), albeit in a
slightly diminished version. Most importantly, it is now the
responsibility of the app_ops layer itself to use it, to avoid the
shortcoming described above.

The main objective of this patch is to be able to support standalone FIN
in HTTP/0.9 layer. This is easily done via the reintroduction of
qcs_http_handle_standalone_fin() usage. This will be useful to perform
testing, as standalone FIN is a corner case which can easily be broken.
2025-03-07 12:06:26 +01:00
Valentine Krasnobaeva
e900ef987e BUG/MEIDUM: startup: return to initial cwd only after check_config_validity()
In check_config_validity() we evaluate some sample fetch expressions
(log-format, server rules, etc). These expressions may use external files like
maps.

If some particular 'default-path' was set in the global section before, it's no
longer applied to resolve file pathes in check_config_validity(). parse_cfg()
at the end of config parsing switches back to the initial cwd.

This fixes the issue #2886.

This patch should be backported in all stable versions since 2.4.0, including
2.4.0.
2025-03-06 10:49:48 +01:00
Roberto Moreda
f98b5c4f59 MINOR: log: add dont-parse-log and assume-rfc6587-ntf options
This commit introduces the dont-parse-log option to disable log message
parsing, allowing raw log data to be forwarded without modification.

Also, it adds the assume-rfc6587-ntf option to frame log messages
using only non-transparent framing as per RFC 6587. This avoids
missparsing in certain cases (mainly with non RFC compliant messages).

The documentation is updated to include details on the new options and
their intended use cases.

This feature was discussed in GH #2856
2025-03-06 09:30:39 +01:00
Aurelien DARRAGON
0746f6bde0 MINOR: cfgparse-listen: add and use cfg_parse_listen_match_option() helper
cfg_parse_listen_match_option() takes cfg_opt array as parameter, as well
current args, expected mode and cap bitfields.

It is expected to be used under cfg_parse_listen() function or similar.
Its goal is to remove code duplication around proxy->options and
proxy->options2 handling, since the same checks are performed for the
two. Also, this function could help to evaluate proxy options for
mode-specific proxies such as log-forward section for instance:
by giving the expected mode and capatiblity as input, the function
would only match compatible options.
2025-03-06 09:30:18 +01:00
Aurelien DARRAGON
d9aa199100 MINOR: proxy: make pr_mode enum bitfield compatible
Current pr_mode enum is a regular enum because a proxy only supports one
mode at a time. However it can be handy for a function to be given a
list of compatible modes for a proxy, and we can't do that using a
bitfield because pr_mode is not bitfield compatible (values share
the same bits).

In this patch we manually define pr_mode values so that they are all
using separate bits and allows a function to take a bitfield of
compatible modes as parameter.
2025-03-06 09:30:11 +01:00
Olivier Houchard
335ef3264b DEBUG: init: Add a macro to register unit tests
Add a new macro, REGISTER_UNITTEST(), that will automatically make sure
we call hap_register_unittest(), instead of having to create a function
that will do so.
2025-03-04 18:18:10 +01:00
William Lallemand
a647839954 DEBUG: init: add a way to register functions for unit tests
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.

This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.

When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.

HAProxy need to be built with DEBUG_UNIT to activate this feature.
2025-03-03 12:43:32 +01:00
William Lallemand
4dc0ba233e MINOR: jws: implement a JWK public key converter
Implement a converter which takes an EVP_PKEY and converts it to a
public JWK key. This is the first step of the JWS implementation.

It supports both EC and RSA keys.

Know to work with:

- LibreSSL
- AWS-LC
- OpenSSL > 1.1.1
2025-03-03 12:43:32 +01:00
Willy Tarreau
730641f7ca BUG/MINOR: server: check for either proxy-protocol v1 or v2 to send hedaer
As reported in issue #2882, using "no-send-proxy-v2" on a server line does
not properly disable the use of proxy-protocol if it was enabled in a
default-server directive in combination with other PP options. The reason
for this is that the sending of a proxy header is determined by a test on
srv->pp_opts without any distinction, so disabling PPv2 while leaving other
options results in a PPv1 header to be sent.

Let's fix this by explicitly testing for the presence of either send-proxy
or send-proxy-v2 when deciding to send a proxy header.

This can be backported to all versions. Thanks to Andre Sencioles (@asenci)
for reporting the issue and testing the fix.
2025-03-03 04:05:47 +01:00
Olivier Houchard
706b008429 MEDIUM: servers: Add strict-maxconn.
Maxconn is a bit of a misnomer when it comes to servers, as it doesn't
control the maximum number of connections we establish to a server, but
the maximum number of simultaneous requests. So add "strict-maxconn",
that will make it so we will never establish more connections than
maxconn.
It extends the meaning of the "restricted" setting of
tune.takeover-other-tg-connections, as it will also attempt to get idle
connections from other thread groups if strict-maxconn is set.
2025-02-26 13:00:18 +01:00
Olivier Houchard
8de8ed4f48 MEDIUM: connections: Allow taking over connections from other tgroups.
Allow haproxy to take over idle connections from other thread groups
than our own. To control that, add a new tunable,
tune.takeover-other-tg-connections. It can have 3 values, "none", where
we won't attempt to get connections from the other thread group (the
default), "restricted", where we only will try to get idle connections
from other thread groups when we're using reverse HTTP, and "full",
where we always try to get connections from other thread groups.
Unless there is a special need, it is advised to use "none" (or
restricted if we're using reverse HTTP) as using connections from other
thread groups may have a performance impact.
2025-02-26 13:00:18 +01:00
Olivier Houchard
c36aae2af1 MINOR: pollers: Add a fixup_tgid_takeover() method.
Add a fixup_tgid_takeover() method to pollers for which it makes sense
(epoll, kqueue and evport). That method can be called after a takeover
of a fd from a different thread group, to make sure the poller's
internal structure reflects the new state.
2025-02-26 13:00:18 +01:00
Olivier Houchard
c5cc09c00d MINOR: fd: Add fd_lock_tgid_cur().
Add fd_lock_tgid_cur(), a function that will lock the tgid, without
modifying its value.
2025-02-26 13:00:18 +01:00
Olivier Houchard
52b97ff8dd MEDIUM: fd: Wait if locked in fd_grab_tgid() and fd_take_tgid().
Wait while the tgid is locked in fd_grab_tgid() and fd_take_tgid().
As that lock is barely used, it should have no impact.
2025-02-26 13:00:18 +01:00
Willy Tarreau
fb7874c286 MINOR: tinfo: split the signal handler report flags into 3
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
2025-02-24 13:37:52 +01:00
Vincent Dechenaux
9011b3621b MINOR: compression: Introduce minimum size
This is the introduction of "minsize-req" and "minsize-res".
These two options allow you to set the minimum payload size required for
compression to be applied.
This helps save CPU on both server and client sides when the payload does
not need to be compressed.
2025-02-22 11:32:40 +01:00
Willy Tarreau
29e246a84c MINOR: freq_ctr: provide non-blocking read functions
Some code called by the debug handlers in the context of a signal handler
accesses to some freq_ctr and occasionally ends up on a locked one from
the same thread that is dumping it. Let's introduce a non-blocking version
that at least allows to return even if the value is in the process of being
updated, it's less problematic than hanging.
2025-02-21 18:26:29 +01:00
Willy Tarreau
ddd173355c MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.

Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
2025-02-21 17:41:38 +01:00
Aurelien DARRAGON
9561b9fb69 BUG/MINOR: sink: add tempo between 2 connection attempts for sft servers
When the connection for sink_forward_{oc}_applet fails or a previous one
is destroyed, the sft->appctx is instantly released.

However process_sink_forward_task(), which may run at any time, iterates
over all known sfts and tries to create sessions for orphan ones.

It means that instantly after sft->appctx is destroyed, a new one will
be created, thus a new connection attempt will be made.

It can be an issue with tcp log-servers or sink servers, because if the
server is unavailable, process_sink_forward() will keep looping without
any temporisation until the applet survives (ie: connection succeeds),
which results in unexpected CPU usage on the threads responsible for
that task.

Instead, we add a tempo logic so that a delay of 1second is applied
between two retries. Of course the initial attempt is not delayed.

This could be backported to all stable versions.
2025-02-21 11:22:35 +01:00
Christopher Faulet
851e52b551 BUG/MEDIUM: spoe/mux-spop: Introduce an NOOP action to deal with empty ACK
In the SPOP protocol, ACK frame with empty payload are allowed. However, in
that case, because only the payload is transferred, there is no data to
return to the SPOE applet. Only the end of input is reported. Thus the
applet is never woken up. It means that the SPOE filter will be blocked
during the processing timeout and will finally return an error.

To workaournd this issue, a NOOP action is introduced with the value 0. It
is only an internal action for now. It does not exist in the SPOP
protocol. When an ACK frame with an empy payload is received, this noop
action is transferred to the SPOE applet, instead of nothing. Thanks to this
trick, the applet is properly notified. This works because unknown actions
are ignored by the SPOE filter.

This patch must be backported to 3.1.
2025-02-20 11:56:27 +01:00
Amaury Denoyelle
06e7674399 MINOR: mux-quic/h3: emit SETTINGS via MUX tasklet handler
Previously, QUIC MUX application layer was installed and initialized via
MUX init. However, the latter stage involve I/O operations, for example
when using HTTP/3 with the emission of a SETTINGS frame.

Change this to prevent any I/O operations during MUX init. As such,
finalize app_ops callback is now called during the first invokation of
qcc_io_send(), in the context of MUX tasklet. To implement this, a new
application state value is added, to detect the transition from NULL to
INIT stage.
2025-02-19 11:03:40 +01:00
Amaury Denoyelle
188fc45b95 MINOR: mux-quic: define a QCC application state member
Introduce a new QCC field to track the current application layer state.
For the moment, only INIT and SHUT state are defined. This allows to
replace the older flag QC_CF_APP_SHUT.

This commit does not bring major changes. It is only necessary to permit
future evolutions on QUIC MUX. The only noticeable change is that QMUX
traces can now display this new field.
2025-02-19 10:59:53 +01:00
William Lallemand
69163cd63e MINOR: ssl/crtlist: split the ckch_conf loading from the crtlist line parsing
ckch_conf loading is not that simple as it requires to check
- if the cert already exists in the ckchs_tree
- if the ckch_conf is compatible with an existing cert in ckchs_tree
- if the cert is a bundle which need to load multiple ckch_store

This logic could be reuse elsewhere, so this commit introduce the new
crtlist_load_crt() function which does that.
2025-02-17 18:26:37 +01:00
Amaury Denoyelle
32691e7c25 MINOR: quic: support frame type as a varint
QUIC frame type is encoded as a variable-length integer. Thus, 64-bit
integer should be used for them. Currently, this was not the case as
type was represented as a 1-byte char inside quic_frame structure. This
does not cause any issue with QUIC from RFC9000, as all frame types fit
in this range. Furthermore, a QUIC implementation is required to use the
smallest size varint when encoding a frame type.

However, the current code is unable to accept QUIC extension with bigger
frame types. This is notably the case for quic-on-streams draft. Thus,
this commit readjusts quic_frame architecture to be able to support
higher frame type values.

First, type field of quic_frame is changed to a 64-bits variable. Both
encoding and decoding frame functions uses variable-length integer
helpers to manipulate the frame type field.

Secondly, the quic_frame builders/parsers infrastructure is still
preserved. However, it could be impossible to define new large frame
type as an index into quic_frame_builders / quic_frame_parsers arrays.
Thus, wrapper functions are now provided to access the builders and
parsers. Both qf_builder() and qf_parser() wrappers can then be extended
to return custom builder/parser instances for larger frame type.

Finally, unknown frame type detection also uses the new wrapper
quic_frame_is_known(). As with builders/parsers, for large frame type,
this function must be manually completed to support a new type value.
2025-02-14 09:00:05 +01:00
William Lallemand
7034f2ca48 MINOR: ssl: store the filenames resulting from a lookup in ckch_conf
With this patch, files resulting from a lookup (*.key, *.ocsp,
*.issuer etc) are now stored in the ckch_conf.

It allows to see the original filename from where it was loaded in "show
ssl cert <filename>"
2025-02-13 17:44:00 +01:00
Amaury Denoyelle
f96af8e463 MINOR: quic: refactor STREAM encoding and splitting
CRYPTO and STREAM frames encoding is similar. If payload is too large,
frame will be splitted and only the first payload part will be written
in the output QUIC packet. This process is complexified by the presence
of a variable-length integer Length field prior to the payload.

This commit aims at refactor these operations. Define two functions to
simplify the code :
* quic_strm_frm_fillbuf() which is used to calculate the optimal frame
  length of a STREAM/CRYPTO frame with its payload in a buffer
* quic_strm_frm_split() which is used to split the frame payload if
  buffer is too small

With this patch, both functions are now implemented for STREAM encoding.
2025-02-12 15:10:03 +01:00
William Lallemand
4de86bbbfc MEDIUM: initcall: allow to register mutiple post_section_parser per section
Before this patch, REGISTER_CONFIG_SECTION() allowed to register one and only
one callback (<post>) called after the parsing of a section.

It was limitating because you couldn't register a post callback from anywhere
else in the code.

This patch introduces the new REGISTER_CONFIG_SECTION_POST() macros which allows
to register a new post callback for a section keyword from anywhere.

This patch introduces the feature by allowing `struct cfg_section` entries that
does not have a `section_parser`, and then iterating on all cfg_section with a
post_section_parser for a keyword.
2025-02-12 12:52:41 +01:00
Amaury Denoyelle
731340afbd MINOR: quic: simplify length calculation for STREAM/CRYPTO frames
STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.

It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.

The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.
2025-02-12 11:51:09 +01:00
Willy Tarreau
b6a8318cc2 MEDIUM: server: allocate a tasklet for asyncronous requeuing
This creates a tasklet that only expects to be called when the LB
algorithm is under contention when trying to reposition the server
in its tree. Indeed, that's one of the operations that usually
requires to take a write lock on a highly contended area, often
for very little benefits under contention; indeed, under load, if
a server keeps its previous position for a few extra microseconds,
usually there's no harm. Thus this new tasklet can be woken up by
the LB algo to ask the server to later call lbprm.server_requeue().
It does nothing else.
2025-02-11 17:24:09 +01:00
Willy Tarreau
20b8c4ddba MINOR: lbprm: add a new callback ->server_requeue to the lbprm
This callback will be used to reposition a server to its expected
position regardless of the fact that it was taken or dropped. It
will only be used by supporting LB algos. For now, only fwlc defines
it and assigns it to fwlc_srv_reposition(). At the moment it's not
used yet.
2025-02-11 17:16:14 +01:00
Willy Tarreau
eced1d6d8a DEBUG: thread: reduce the struct lock_stat to store only 30 buckets
Storing only 30 buckets means we only keep 256 bytes per label. This
further simplifies address calculation and reduces the memory used
without complicating the locking code. It means we won't measure wait
times larger than a second but we're not supposed to face this as it
would trigger the watchdog anyway. It may become a little bit just if
measuring using rdtsc() instead of now_mono_time() though (typically
the limit would be around 350ms for a 3 GHz CPU).
2025-02-10 18:34:43 +01:00
Willy Tarreau
c2f2d6fd3c DEBUG: thread: make lock_stat per operation instead of for all operations
It's more convenient (and more readable) to have the lock stats arranged
by operation type (read, seek, write). It will also allow to later simplify
the structure format and the bucket address calculation. Now lock_stat[]
got split into lock_stats_rd[], lock_stats_sk[], lock_stats_wr[].
2025-02-10 18:34:43 +01:00
Willy Tarreau
4168d1278c DEBUG: thread: don't keep the redundant _locked counter
Now that we have our sums by bucket, the _locked counter is redundant
since it's always equal to the sum of all entries. Let's just get rid
of it and replace its consumption with a loop over all buckets, this
will reduce the overhead of taking each lock at the expense of a tiny
extra effort when dumping all locks, which we don't care about.
2025-02-10 18:34:43 +01:00
Willy Tarreau
a22550fbd7 DEBUG: thread: report the wait time buckets for lock classes
In addition to the total/average wait time, we now also store the
wait time in 2^N buckets. There are 32 buckets for each type (read,
seek, write), allowing to store wait times from 1-2ns to 2.1-4.3s,
which is quite sufficient, even if we'd want to switch from NS to
CPU cycles in the future. The counters are only reported for non-
zero buckets so as not to visually pollute the output.

This significantly inflates the lock_stat struct, which is now
aligned to 256 bytes and rounded up to 1kB. But that's not really
a problem, given that there's only one per lock label.
2025-02-10 18:34:43 +01:00
Willy Tarreau
7ddcdff33f BUG/MEDIUM: debug: close a possible race between thread dump and panic()
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.

This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.

In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.

This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
2025-02-10 18:34:26 +01:00
Willy Tarreau
ae540e3d9c Revert "IMPORT: plock: export the uninlined version of the lock wait function"
This reverts commit 5496d06b2b.

It breaks the build on Windows which apparently doesn't support the weak
attribute well on functions. It's not big deal anyway, playing with build
options while debugging still works though it's less easy to use.
2025-02-07 19:51:15 +01:00
Willy Tarreau
b957e2f3ef IMPORT: plock: use cpu_relax() for a shorter time in EBO
Tests have shown that on modern CPUs it's interesting to wait a bit less
in cpu_relax(). Till now we were looping down to 60 iterations and then
switching to just barriers. Increasing the threshold to 90 iterations
left before getting out of the loop improved the average and max time
to grab a write lock by a few percent (e.g. 10% at 1us, 20% at 256ns
or lower). Higher values tend to progressively lose that gain so let's
stick to this one. This was measured on an EPYC 74F3 like previous
measurements that initially led to this value, and the value might
possibly depend on the mask applied to the loop counter.

This is plock commit 74ca0a7307fa6aec3139f27d3b7e534e1bdb748e.
2025-02-07 18:04:29 +01:00
Willy Tarreau
253fba01a7 IMPORT: plock: lower the slope of the exponential back-off
Along many tests involving both haproxy's scheduler and forwarded
traffic, various exponents and algorithms were attempted for the EBO
and their effects were measured. It was found that a growth in 1.25^N
limited to 128k cycles consistently gives a better latency than 1.5^N
limited to 256k cycles, without degrading general performance. The
measures of the time to grab a write lock on a 48-thread EPYC show
that the number of occurrences of low times was roughly multiplied by
2-3 while the number of occurrences of times above 64us was reduced
by similar factors, to even reach 300 at 64us and limiting the maximum
time by a factor of 4.

The other variants that were experimented with are:

  m = ((m + (m >> 1)) + 2) & 0x3ffff;            // original
  m = ((m + (m >> 1) + (m >> 3)) + 2) & 0x3ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x3ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x1ffff;
  m = ((m + (m >> 1) + (m >> 4)) + 1) & 0x1ffff;
  m = ((m + (m >> 2) + (m >> 4)) + 1) & 0x1ffff; // lowest CPU on pl_wr test + good perf
  m = ((m + (m >> 2)) + 1) & 0x1ffff;            // even lower cpu usage, lowest max
  m = ((m + (m >> 1) + (m >> 2)) + 1) & 0x1ffff; // correct but slightly higher maxes
  m = ((m + (m >> 1) + (m >> 3)) + 1) & 0x1ffff; // less good than m+m>>2
  m = ((m + (m >> 2) + (m >> 3)) + 1) & 0x1ffff; // better but not as good as m+m>>2
  m = ((m + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good, lower rates on small coounts.
  m = ((m + (m >> 2) + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good as well
  m = ((m & 0x7fff) + (m >> 1) + (m >> 4)) + 2;
  m = ((m & 0xffff) + (m >> 1) + (m >> 4)) + 2;

This is plock commit dddd9ee01c522da33c353e2e4d4fd743d8336ec3.
2025-02-07 18:04:29 +01:00
Willy Tarreau
9dd56da730 IMPORT: plock: give higher precedence to W than S
It was noticed in haproxy that in certain extreme cases, a write lock
subject to EBO may fail for a very long time in front of a large set
of readers constantly trying to upgrade to the S state. The reason is
that among many readers, one will succeed in its upgrade, and this
situation can last for a very long time with many readers upgrading
in turn, while the writer waits longer and longer before trying again.

Here we're taking a reasonable approach which is that the write lock
should have a higher precedence in its attempt to grab the lock. What
is done is that instead of fully rolling back in case of conflict with
a pure S lock, the writer will only release its read part in order to
let the S upgrade to W if needed, and finish its operations. This
guarantees no other seek/read/write can enter. Once the conflict is
resolved, the writer grabs the read part again and waits for readers
to be gone (in practice it could even return without waiting since we
know that any possible wanderers would leave or even not be there at
all, but it avoids a complicated loop code that wouldn't improve the
practical situation but inflate the code).

Thanks to this change, the maximum write lock latency on a 48 threads
AMD with aheavily loaded scheduler went down from 256 to 64 ms, and the
number of occurrences of 32ms or more was divided by 300, while all
occurrences of 1ms or less were multiplied by up to 3 (3 for the 4-16ns
cases).

This is plock commit b6a28366d156812f59c91346edc2eab6374a5ebd.
2025-02-07 18:04:29 +01:00
Willy Tarreau
5496d06b2b IMPORT: plock: export the uninlined version of the lock wait function
The inlining of the lock waiting function was made more easily
configurable with commit 7505c2e ("plock: always expose the inline
version of the lock wait function"). However, the standard one remained
static, but in order to resolve the symbols in "perf top", it's much
better to export it, so let's move "static" with "inline" and leave it
exported when PLOCK_INLINE_EBO is not set.

This is plock commit 3bea7812ec705b9339bbb0ed482a2cd8aa6c185c.
2025-02-07 18:04:29 +01:00
Christopher Faulet
eb4e517489 CLEANUP: mux-spop: Remove useless comments
Just a small cleanup to remove some comments added during the development of
the mux.
2025-02-06 11:19:32 +01:00
Christopher Faulet
d16c534511 MINOR: mux-spop: Report EOI on the SE when a ACK is received for a stream
The spop stream now reports the end of input when the ACK is transferred to
the SPOE applet. To do so, the flag SPOP_SF_ACK_RCVD was added. It is set on
the SPOP stream when its ACK is received by the SPOP connection.

In addition when SPOP stream flags are propagated to the SE, the error is
now reported if end of input was not reached instead of testing the
connection error code. It is more accurate.

This patch should be backported to 3.1.
2025-02-06 11:19:32 +01:00
Frederic Lecaille
85cb1cc7f4 BUILD: ssl: remove a boringssl definition defined by recent boringssl libs
This is the case for AWS-LC which derives from boringssl, where
X509_OBJECT_get0_X509_CRL() is already defined. There is definitively
no more need to define this function to build haproxy against TLS libs derived
from boringssl.
2025-02-06 10:48:25 +01:00
Aurelien DARRAGON
0846638f7f MEDIUM: stream: interrupt costly rulesets after too many evaluations
It is not rare to see configurations with a large number of "tcp-request
content" or "http-request" rules for instance. A large number of rules
combined with cpu-demanding actions (e.g.: actions that work on content)
may create thread contention as all the rules from a given ruleset are
evaluated under the same polling loop if the evaluation is not interrupted

Thus, in this patch we add extra logic around "tcp-request content",
"tcp-response content", "http-request" and "http-response" rulesets, so
that when a certain number of rules are evaluated under the single polling
loop, we force the evaluating function to yield. As such, the rule which
was about to be evaluated is saved, and the function starts evaluating
rules from the save pointer when it returns (in the next polling loop).

We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so
that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG
is mandatory here because process_stream() expects TASK_WOKEN_MSG for
explicit analyzers re-evaluation.

rules_bcount stream's attribute was added to count how manu rules were
evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag
was added to know that the s->current_rule was assigned due to forced
yield and not regular yield.

By default haproxy will enforce a yield every 50 rules, this behavior
can be configured using the "tune.max-rules-at-once" global keyword.

There is a limitation though: for now, if the ACT_OPT_FINAL flag is set
on act_opts, we consider it is not safe to yield (as it is already the
case for automatic yield). In this case instead of yielding an taking
the risk of not being called back, we skip the yield and hope it will
not create contention. This is something we should ideally try to
improve in order to yield in all conditions.
2025-02-03 17:09:48 +01:00
Christopher Faulet
5f927f603a BUG/MEDIUM: mux-fcgi: Properly handle read0 on partial records
A Read0 event could be ignored by the FCGI multiplexer if it is blocked on a
partial record. Instead of handling the event, it remained blocked, waiting
for the end of the record.

To fix the issue, the same solution than the H2 multiplexer is used. Two
flags are introduced. The first one, FCGI_CF_END_REACHED, is used to
acknowledge a read0. This flag is set when a read0 was received AND the FCGI
multiplexer must handle it. The second one, FCGI_CF_DEM_SHORT_READ, is set
when the demux is interrupted on a partial record. A short read and a read0
lead to set the FCGI_CF_END_REACHED flag.

With these changes, the FCGI mux should be able to properly handle read0 on
partial records.

This patch should be backported to all stable versions after a period of
observation.
2025-02-03 07:49:50 +01:00
Christopher Faulet
71320fc9c1 MINOR: tevt/connection: Add support for POLL_HUP/POLL_ERR events
Connection errors can be detected via connect/recv/send syscall, but also
because it was reported by the poller. So dedicated events, at the FD level,
are introduced to make the difference.

term_events tool was updated accordingly.
2025-01-31 10:41:50 +01:00
Christopher Faulet
990854ee0d REORG: tevt/connection: Move enums at the end of the header file
Enums used to report events were placed in the connection header for
conveniance. But it is not specifically related to connection. So, they are
moved at the end of the file to have a better isolation.
2025-01-31 10:41:50 +01:00
Christopher Faulet
487d6b09f1 MINOR: tevt: Improve function to convert a termination events log to string
The function is now responsible to handle empty log because no event was
reported. In that case, an empty string is returned. It is also responsible to
handle case where termination events log is not supported for an given entity
(for instance the quic mux for now). In that case, a dash ("-") is returned.
2025-01-31 10:41:50 +01:00
Christopher Faulet
cbd898c42b MINOR: tevt: Don't duplicate termination event during reporting
It is hard to never detect the same event several time without painful
tests. In other words, the same termination event can be reported several
time and this must be handled. To do so, "tevt_report_event" macro is
updated to ignore an event if the last reported one is of the same type, for
the same location. Of course, if the same event is reported several times at
different moment, it will not be detected.
2025-01-31 10:41:50 +01:00
Christopher Faulet
2dc02f75b1 MEDIUM: tevt/stconn/stream: Add dedicated termination events for stream location
If it is the last patch to introduce dedicated termination events for each
location. In this one, events for the stream location are introcued. The old
enum is also removed because it is now unused.

Here, more accurate evets are added. The "intercepted" event was splitted.
2025-01-31 10:41:50 +01:00
Christopher Faulet
a58e650ad1 MEDIUM: tevt/muxes: Add dedicated termination events for muxc/se locations
Termination events dedicated to mux connection and stream-endpoint
descriptors are added in this patch. Specific events to these locations are
thus added. Changes for the H1 and H2 multiplexers are reviewed to be more
accurate.
2025-01-31 10:41:50 +01:00
Christopher Faulet
f2778ccc7d MINOR: tevt/connection: Add dedicated termination events for lower locations
To be able to add more accurate termination events for each location, the
enum will be splitted by location. Indeed, there are at most 16 possbile
events. It will be pretty confusing to use same termination events for the
different locations. So the best is to split them.

In this patch, the termination events for the fd, hs and xprt locations are
introduced. For now some holes are added to keep similar events aligned
across enums. But this may change in future.
2025-01-31 10:41:50 +01:00
Christopher Faulet
a4c281a190 MINOR: tevt/muxes: Add CTL and SCTL command to get the termination event logs
MUX_CTL_TEVTS command is added to get the termination event logs of a mux
connection and MUX_SCTL_TEVTS command to get the termination event logs of a
mux stream.
2025-01-31 10:41:50 +01:00
Christopher Faulet
00a07c8b54 MINOR: tevt/stream/stconn: Report termination events for stream and sc
In this patch, events for the stream location are reported. These events are
first reported on the corresponding stream-connector. So front events on scf
and back event on scb. Then all events are both merged in the stream. But
only 4 events are saved on the stream.

Several internal events are for now grouped with the type
"tevt_type_intercepted". More events will be added to have a better
resolution. But at least the place to report these events are identified.

For now, when a event is reported on a SC, it is also reported on the stream
and vice versa.
2025-01-31 10:41:50 +01:00
Christopher Faulet
992b4b9726 MINOR: tevt/stconn: Add a termination events log in the SE descriptor
This termination events log will be used to report events from the mux
streams. The location will be "tevt_loc_se" and the muxes will be
responsible to report the corresponding events.
2025-01-31 10:41:50 +01:00
Christopher Faulet
e944944990 MINOR: tevt: Add the termination events log's fundations
Termination events logs will be used to report the events that led to close
a connection. Unlike flags, that reflect a state, the idea here is to store
a log to preserve the order of the events. Most of time, when debugging an
issue, the order of the events is crucial to be able to understand the root
cause of the issue. The traces are trully heplful to do so. But it is not
always possible to active them because it is pretty verbose. On heavily
loaded platforms, it is not acceptable. We hope that the termination events
logs will help us in that situations.

One termination events log will be be store at each layer (connection, mux
connection, mux stream...) as a 32-bits integer. Each event will be store on
8 bits, 4 bits for the location and 4 bits for the type. So the first four
events will be stored only for each layer. It should be enough why a
connection is closed.

In this patch, the enums defining the termination event locations and types
are added. The macro to report a new event is also added and a function to
convert a termination events log to a string that could be display in log
messages for instance.
2025-01-31 10:41:49 +01:00
Christopher Faulet
e56e718c82 MINOR: mux-h1: Add masks to group H1S DEMUX and MUX errors
It is just a small patch to clean up mux/demux functions. Instead of listing
the H1S errors that must be handled during demux of mux operations, masks of
flags are used. It is more readable.
2025-01-31 10:41:49 +01:00
Willy Tarreau
d155924efe MINOR: fd: add a generation number to file descriptors
This patch adds a counter of close() on file descriptors in the fdtab.
The goal is to better detect if reported events concern the current or
a previous file descriptor. For now the counter is only added, and is
showed in "show fd" as "gen". We're reusing unused space at the end of
the struct. If it's needed for something more important later, this
patch can be reverted.
2025-01-30 19:45:34 +01:00
Willy Tarreau
44ac7a7e73 DEBUG: fd: add a counter of takeovers of an FD since it was last opened
That's essentially in order to help with debugging strange cases like
the occasional epoll issues/races, by keeping a counter of how many
times an FD was taken over since last inserted. The room is available
so let's use it. If it's needed later, this patch can easily be reverted.
The counter is also reported in "show fd" as "tkov".
2025-01-30 19:45:34 +01:00
Amaury Denoyelle
b849ee5fa3 BUILD: quic: fix overflow in global tune
A new global option was recently introduced to disable pacing. However,
the value used (1<<31) caused issue with some compiler as options field
used for storage is declared as int. Move pacing deactivation flag
outside into the newly defined quic_tune to fix this.

This should be backported up to 3.1 after a period of observation. Note
that it relied on the previous patch which defined new quic_tune type.
2025-01-30 18:12:53 +01:00
Amaury Denoyelle
09e9c7d5b7 MINOR: quic: define quic_tune
Define a new structure quic_tune. It will be useful to regroup various
configuration settings and tunable related to QUIC, instead of defining
them into the global structure.
2025-01-30 18:12:40 +01:00
Amaury Denoyelle
0c8b54b2d1 MINOR: quic: transform pacing settings into a global option
Pacing support was previously activated on each bind line individually,
via an optional argument of quic-cc-algo keyword. Remove this optional
argument and introduce a global setting to enable/disable pacing. Pacing
activation is still flagged as experimental.

One important change is that previously BBR usage automatically
activated pacing support. This is not the case anymore, so users should
now always explicitely activate pacing if BBR is selected. A new warning
message will be displayed if this is not the case.

Another consequence of this change is that now pacing_inter callback is
always defined for every quic_cc_algo types. As such, QUIC MUX uses
global.tune.options to determine if pacing is required.

This should be backported up to 3.1, after a period of observation.
2025-01-30 17:19:38 +01:00
William Lallemand
b43e5d8c16 BUILD: ssl: more cleaner approach to WolfSSL without renegotiation
Patch discussed in https://github.com/wolfSSL/wolfssl/issues/6834

When building Wolfssl without renegotiation options, WolfSSL still
defines the macros about it, which warns during the build.

This patch completes the previous one by undefining the macros so
haproxy could build without any warning.
2025-01-28 20:55:20 +01:00
William Lallemand
c6a8279cdf BUILD: ssl: allow to build without the renegotiation API of WolfSSL
In ticket https://github.com/wolfSSL/wolfssl/issues/6834, it was
suggested to push --enable-haproxy within --enable-distro.

WolfSSL does not want to include the renegotiation support in
--enable-distro.

To achieve this, let haproxy build without SSL_renegotiate_pending()
when wolfssl does not define HAVE_SECURE_RENEGOCIATION or
HAVE_SERVER_RENEGOCIATION_INFO.
2025-01-28 18:31:32 +01:00
Willy Tarreau
f17b0a994b BUILD: tools: fix build on BSD by dropping the ETIME check
Commit 44537379fc ("MINOR: tools: add errname to print errno macro
name") brought a facility to report errno using a symbolic string
when known instead of showing only the value. However, among the
listed options, ETIME is mentioned but is unknown from FreeBSD where
it breaks the build. Let's simply drop it, we don't use ETIME anyway
and even if it would be reported, the default code path still reports
the numeric value so there's no harm. If other ones fail to build in
the future, they could be handled the same way.
2025-01-28 15:58:57 +01:00
Christopher Faulet
36d151dc10 MEDIUM: stream: No longer use TASK_F_UEVT* to shut a stream down
Thanks to the previous patch, it is now possible to explicitly rely on
stream's events to shut it down. The right event is set in
stream_shutdown(), before waking up the stream, via an atomic operation. In
process_stream(), this event will be handled as expected.

Thus, TASK_F_UEVT* are no longer used, but not removed since still usable
for other tasks.

This patch depends on "MEDIUM: stream: Map task wake up reasons to dedicated
stream events".
2025-01-28 14:53:37 +01:00
Christopher Faulet
6048460102 MEDIUM: stream: Map task wake up reasons to dedicated stream events
To fix thread-safety issues when a stream must be shut, three new task
states were added. These states are generic (UEVT1, UEVT2 and UEVT3), the
task callback function is responsible to know what to do with them. However,
it is not really scalable.

The best is to use an atomic field in the stream structure itself to deal
with these dedicated events. There is already the "pending_events" field
that save wake up reasons (TASK_WOKEN_*) to not loose them if
process_stream() is interrupted before it had a chance to handle them.

So the idea is to introduce a new field to handle streams dedicated events
and merged them with the task's wake up reasons used by the stream. This
means a mapping must be performed between some task wake up reasons and
streams events. Note that not all task wake up reasons will be mapped.

In this patch, the "new_events" field is introduced. It is an atomic
bit-field. Streams events (STRM_EVT_*) are also introduced to map the task
wake up reasons used by process_stream(). Only TASK_WOKEN_TIMER and
TASK_WOKEN_MSG are mapped, in addition to TASK_F_UEVT* flags. In
process_stream(), "pending_events" field is now filled with new stream
events and the mapping of the wake up reasons.
2025-01-28 14:53:37 +01:00
Christopher Faulet
0a52a75ef7 BUG/MINOR: stream: Properly handle "on-marked-up shutdown-backup-sessions"
shutdown-backup-sessions action for on-marked-up directive does not work anymore
since the stream_shutdown() function was modified to be async-safe.

When stream_shutdown() was modified to be async-safe, dedicated task events were
added to map the reasons to shut a stream down. SF_ERR_DOWN was mapped to
TASK_F_EVT1 and SF_ERR_KILLED was mapped to TASK_F_EVT2. The reverse mapping was
performed by process_stream() to shut the stream with the appropriate reason.

However, SF_ERR_UP reason, used by shutdown-backup-sessions action to shut a
stream down because a preferred server became available, was not mapped in the
same way. So since commit b8e3b0a18d ("BUG/MEDIUM: stream: make
stream_shutdown() async-safe"), this action is ignored and does not work
anymore.

To fix an issue, and being able to bakcport the fix, a third task event was
added. TASK_F_EVT3 is now mapped on SF_ERR_UP.

This patch should fix the issue #2848. It must be backported as far as 2.6.
2025-01-28 14:53:37 +01:00
Olivier Houchard
26b3e5236f MEDIUM: servers/proxies: Switch to using per-tgroup queues.
For both servers and proxies, use one connection queue per thread-group,
instead of only one. Having only one can lead to severe performance
issues on NUMA machines, it is actually trivial to get the watchdog to
trigger on an AMD machine, having a server with a maxconn of 96, and an
injector that uses 160 concurrent connections.
We now have one queue per thread-group, however when dequeueing, we're
dequeuing MAX_SELF_USE_QUEUE (currently 9) pendconns from our own queue,
before dequeueing one from another thread group, if available, to make
sure everybody is still running.
2025-01-28 12:49:41 +01:00
Olivier Houchard
583303c48b MINOR: proxies/servers: Calculate queueslength and use it.
For both proxies and servers, properly calculates queueslength, which is
the total number of element in each queues (as they currently are only
using one queue, it is equivalent to the number of element of that
queue), and use it instead of the queue's length.
2025-01-28 12:49:41 +01:00
Olivier Houchard
59eddabe16 MINOR: Add fields to the per-thread group field in struct server.
Add a per-thread group queue and associated fields in per-thread group
field in struct server, as well as a new field, queues length.
This is currently unused, so should change nothing.
2025-01-28 12:49:41 +01:00
Olivier Houchard
f879b9a18a MINOR: proxies: Add a per-thread group field to struct proxy.
Add a per-thread group field to struct proxy, that will contain a struct
queue, as well as a new field, "queueslength".
This is currently unused, so should change nothing.
Please note that proxy_init_per_thr() must now be called for each proxy
once the thread groups number is known.
2025-01-28 12:49:41 +01:00
Aurelien DARRAGON
e768a531b7 CLEANUP: tree-wide: define and use acl_match_cond() helper
acl_match_cond() combines acl_exec_cond() + acl_pass() and a check on the
condition->pol (to check if the cond is inverted) in order to return
either 0 if the cond doesn't match or 1 if it matches (or NULL).

Thanks to this we can actually simplify some redundant constructs that
iterate over rules and evaluate if the condition matches or not.

Conditions for tcp-request inspect-content and tcp-response
inspect-content couldn't be simplified because they perform an extra
check for missing data, and thus still need to leverage acl_exec_cond()

It's best to display the patch using "-w", like "git show xxxx -w",
because some blocks had to be re-indented after the cleanup, which
makes the patch hard to review by default.
2025-01-27 11:11:43 +01:00
Valentine Krasnobaeva
94d3b7375a CLEANUP: ssl: move ssl_sock_gencert_load_ca declaration in ssl_gencert.h
As ssl_sock_gencert_load_ca and ssl_sock_gencert_free_ca are compiled only if
SSL_NO_GENERATE_CERTIFICATES is not defined, let's align it and move these
declarations in ssl_gencert.h.
2025-01-24 12:31:07 +01:00
Valentine Krasnobaeva
846819b316 CLEANUP: ssl: rename ssl_sock_load_ca to ssl_sock_gencert_load_ca
ssl_sock_load_ca is defined in ssl_gencert.c and compiled only if
SSL_NO_GENERATE_CERTIFICATES is not defined. It's name is a bit confusing, as
we may think at the first glance, that it's a generic function, which is also
used to load CA file, provided via 'ca-file' keyword.
ssl_set_verify_locations_file is used in this case.

So let's rename ssl_sock_load_ca into ssl_sock_gencert_load_ca. Same is
applied to ssl_sock_free_ca.
2025-01-24 12:31:07 +01:00
Valentine Krasnobaeva
44537379fc MINOR: tools: add errname to print errno macro name
Add helper to print the name of errno's corresponding macro, for example
"EINVAL" for errno=22. This may be helpful for debugging and for using in
some CLI commands output. The switch-case in errname() contains only the
errnos currently used in the code. So, it needs to be extended, if one starts
to use new syscalls.
2025-01-24 09:54:57 +01:00
Amaury Denoyelle
7896edccdc MINOR: quic: remove unused pacing burst in bind_conf/quic_cc_path
Pacing burst size is now dynamic. As such, configuration value has been
removed and related fields in bind_conf and quic_cc_path structures can
be safely removed.

This should be backported up to 3.1.
2025-01-23 17:40:48 +01:00
Amaury Denoyelle
cb91ccd8a8 MEDIUM: quic: use dynamic credit for pacing
Major improvements have been introduced in pacing recently. Most
notably, QMUX schedules emission on a millisecond resolution, which
allow to use passive wait to be much CPU friendly.

However, an issue remains with the pacing max credit. Unless BBR is
used, it is fixed to the configured value from quic-cc-algo bind
statement. This is not practical as if too low, it may drastically
reduce performance due to 1ms sleep resolution. If too high, some
clients will suffer from too much packet loss.

This commit fixes the issue by implementing a dynamic maximum credit
value based on the network condition specific to each clients.
Calculation is done to fix a maximum value which should allow QMUX
current tasklet context to emit enough data to cover the delay with the
next tasklet invokation. As such, avg_loop_us is used to detect the
process load. If too small, 1.5ms is used as minimal value, to cover the
extra delay incurred by the system which will happen for a default 1ms
sleep.

This should be backported up to 3.1.
2025-01-23 17:40:48 +01:00
Amaury Denoyelle
8098be1fdc MEDIUM: mux-quic: reduce pacing CPU usage with passive wait
Pacing algorithm has been revamped in the previous commit to implement a
credit based solution. This is a far more adaptative solution, in
particular which allow to catch up in case pause between pacing emission
was longer than expected.

This allows QMUX to remove the active loop based on tasklet wake-up.
Instead, a new task is used when emission should be paced. The main
advantage is that CPU usage is drastically reduced.

New pacing task timer is reset each time qcc_io_send() is invoked. Timer
will be set only if pacing engine reports that emission must be
interrupted. In this case timer is set via qcc_wakeup_pacing() to the
delay reported by congestion algorithm, or 1ms if delay is too short. At
the end of qcc_io_cb(), pacing task is queued if timer has been set.

Pacing task execution is simple enough : it immediately wakes up QCC I/O
handler.

Note that to have decent performance, it requires to have a large enough
burst defined in configuration of quic-cc-algo. However, this value is
common to every listener clients, which may cause too much loss under
network conditions. This will be address in a future patch.

This should be backported up to 3.1.
2025-01-23 17:40:22 +01:00
Amaury Denoyelle
4489a61585 MEDIUM: quic: implement credit based pacing
Implement a new method for QUIC pacing emission based on credit. This
represents the number of packets which can be emitted in a single burst.
After emission, decrement from the credit the number of emitted packets.
Several emission can be conducted in the same sequence until the credit
is completely decremented.

When a new emission sequence is initiated (i.e. under a new QMUX tasklet
invokation), credit is refilled according to the delay which occured
between the last and current emission context.

This new mechanism main advantage is that it allows to conduct several
emission in the same task context without having to wait between each
invokation. Wait is only forced if pacing is expired, which is now
equivalent to having a null credit.

Furthermore, if delay between two emissions sequence would have been
smaller than expected, credit is only partially refilled. This allows to
restart emission without having to wait for the whole credit to be
available.

On the implementation side, a new field <credit> is avaiable in
quic_pacer structure. It is automatically decremented on
quic_pacing_sent_done() invokation. Also, a new function
quic_pacing_reload() must be used by QUIC MUX when a new emission
sequence is initiated to refill credit. <next> field from quic_pacer has
been removed.

For the moment, credit is based on the burst configured via quic-cc-algo
keyword, or directly reported by BBR.

This should be backported up to 3.1.
2025-01-23 17:40:20 +01:00
Amaury Denoyelle
bbaa7aef7b BUG/MINOR: quic: do not increase congestion window if app limited
Previously, congestion window was increased any time each time a new
acknowledge was received. However, it did not take into account the
window filling level. In a network condition with negligible loss, this
will cause the window to be incremented until the maximum value (by
default 480k), even though the application does not have enough data to
fill it.

In most cases, this issue is not noticeable. However, it may lead to
excessive memory consumption when a QUIC connection is suddendly
interrupted, as in this case haproxy will fill the window with
retransmission. It even has caused OOM crash when thousands of clients
were interrupted at once on a local network benchmark.

Fix this by first checking window level prior to every incrementation
via a new helper function quic_cwnd_may_increase(). It was arbitrarily
decided that the window must be at least 50% full when the ACK is
handled prior to increment it. This value is a good compromise to keep
window in check while still allowing fast increment when needed.

Note that this patch only concerns cubic and newreno algorithm. BBR has
already its notion of application limited which ensures the window is
only incremented when necessary.

This should be backported up to 2.6.
2025-01-23 14:49:35 +01:00
Amaury Denoyelle
7c0820892f MINOR: quic: rename pacing_rate cb to pacing_inter
Rename one of the congestion algorithms pacing callback from pacing_rate
to pacing_inter. This better reflects that this function returns a delay
(in nanoseconds) which should be applied between each packet emission to
fill the congestion window with a perfectly smoothed emission.

This should be backported up to 3.1.
2025-01-23 14:49:35 +01:00
Amaury Denoyelle
2178bf1192 CLEANUP: quic: remove unused prototype
Remove undefined quic_pacing_send() function prototype from quic_pacing
module.

This should be backported up to 3.1.
2025-01-23 14:49:35 +01:00
Frederic Lecaille
4f38c4bfd8 MINOR: quic: Add a BUG_ON() on quic_tx_packet refcount
This is definitively a bug to call quic_tx_packet_refdec() to decrement the reference
counter of a TX packet calling quic_tx_packet_refdec(), and possibly to release its
memory when it is negative or null.

This counter is incremented when a TX frm is attached to it with some allocated memory
and when the packet is inserted into a data structure, if needed (list or tree).

Should be easily backported as far as 2.6 to ease any further backport around
this code part.
2025-01-21 22:01:34 +01:00
Frederic Lecaille
cb729fb64d BUG/MINOR: quic: ensure a detached coalesced packet can't access its neighbours
Reset ->prev and ->next fields of a coalesced TX packet to ensure it cannot access
several times its neighbours after it is supposed to be detached from them calling
quic_tx_packet_dgram_detach().

There are two cases where a packet can be coalesced to another previous built one:
this is when it is built into the same datagrame without GSO (and flagged flag with
QUIC_FL_TX_PACKET_COALESCED) or when sent from the same sendto() syscall with GOS
(not flagged with QUIC_FL_TX_PACKET_COALESCED).

This fix may be in relation with GH #2839.

Must be backported as far as 2.6.
2025-01-21 22:01:34 +01:00
Willy Tarreau
b066c0affb REORG: version: move the remaining BUILD_* stuff from haproxy.c to version.c
version.c tries to centralize all variables conveying version information,
but there's still an issue with the BUILD_* variables which are only
passed to haproxy.o and are only updated when that one is rebuilt. This
is not very logical given that we can end up with values there which
contradict info from version.c.

Better move all of these to version.c which is systematically rebuilt.
Most of these variables only end up as string concatenation at the
moment. Some of them are even duplicated. In version.c we now have one
variable (or constant) for each of them and haproxy.c references them
in messages. This is much more logical and easier to maintain in a
consistent state.

The patch looks a bit large but it really only moves the ifdefed string
assignment from one file to another, placing them into variables.
2025-01-20 17:53:55 +01:00
Amaury Denoyelle
a50dd07c16 MINOR: trace: ensure -dt priority over traces config section
Traces can be activated on startup either via -dt command line argument
or via the traces configuration section. This can caused confusion as it
may not be clear as trace source can be completed or overriden by one or
the other.

Fix the precedence to give the priority to the command line argument.
Now, each trace source configured via -dt is first resetted to a default
state before applying new settings. Then, it is impossible to change a
trace source via the configuration file if it was already targetted via
-dt argument.
2025-01-10 14:50:59 +01:00
Willy Tarreau
b25850f25b MINOR: tools: add a few functions to simply check for a file's existence
At many places we'd like to be able to simply construct a path from a
format string and check if that path corresponds to an existing file,
directory etc. Here we add 3 functions, a generic one to test that a
path corresponds to a given file mode (e.g. S_IFDIR, S_IFREG etc), and
two other ones specifically checking for a file or a dir for easier
use.
2025-01-09 09:18:49 +01:00
Willy Tarreau
bd06502b22 BUILD: makefile: add a qinfo macro to pass info in quiet mode
Some commands such as $(cmd_CC) etc already handle the quiet vs verbose
mode in the makefile, but sometimes we may want to pass other info. The
new "qinfo" macro can be called with a 9-char string argument (spaces
included) as a prefix for some commands, to emit that string when in
quiet mode. The caller must fill the spaces needed for alignment. E.g:

  $(call quinfo,  CC     )$(CC) ...
2025-01-08 11:26:05 +01:00
Amaury Denoyelle
af00be8e0f MINOR: mux-quic: change return value of qcs_attach_sc()
A recent fix was introduced to ensure that a streamdesc instance won't
be attached to an already completed QCS which is eligible to purging.
This was performed by skipping application protocol decoding if a QCS is
in such a state. Here is the patch responsible for this change.
  caf60ac696
  BUG/MEDIUM: mux-quic: do not attach on already closed stream

However, this is too restrictive, in particular for unidirection stream
where no streamdesc is never attached. To fix this behavior, first
qcs_attach_sc() API has been modified. Instead of returning a streamdesc
instance, it returns either 0 on success or a negative error code.

There should be no functional changes with this patch. It is only to be
able to extend qcs_attach_sc() with the possibility of skipping
streamdesc instantiation while still keeping a success return value.

This should be backported wherever the above patch has been merged. For
the record, it was scheduled for immediate backport on 3.1, plus merging
on older releases up to 2.8 after a period of observation.
2025-01-03 17:19:21 +01:00
Willy Tarreau
f486f976c7 BUILD: limits: make normalize_rlim() take an rlim_t to fix build on m68k
As can be seen here, the build fails on m68k since commit 665dde648
("MINOR: debug: use LIM2A to show limits") in 3.1:

  https://github.com/haproxy/haproxy/actions/runs/12440234399/job/34735360177

The reason is the comparison between a ulong limit and RLIM_INFINITY.
Indeed, on m68k, rlim_t is an unsigned long long. Let's just change
the function's input type to take an rlim_t instead. This also allows
to get rid of the casts in the call place.

This can be backported to 3.1 though it's not important given the low
prevalence of this platform for such use cases.
2024-12-25 12:33:06 +01:00
Willy Tarreau
f78121dd32 BUILD: compat: add missing fcntl.h before defining F_SETPIPE_SZ
n 1.5-dev8, 13 years ago, support for setting pipe size was added by
commit bd9a0a778 ("OPTIM/MINOR: make it possible to change pipe size
(tune.pipesize)"). For compatibility purposes, it was defining
F_SETPIPE_SZ in compat.h if it was not set. It apparently always had
F_SETPIPE_SZ defined before being included.

Now in 3.2-dev1, commit fbc534a6f ("REORG: startup: move nofile limit
checks in limits.c") reordered a few includes and ended up with
mworker-prog.c including compat.h before fcntl.h, causing a redefinition
error on certain libcs:

    CC      src/mworker-prog.o
  In file included from /usr/include/bits/fcntl.h:61:0,
                   from /usr/include/fcntl.h:35,
                   from include/haproxy/limits.h:11,
                   from include/haproxy/mworker.h:18,
                   from src/mworker-prog.c:27:
  /usr/include/bits/fcntl-linux.h:203:0: warning: "F_SETPIPE_SZ" redefined [enabled by default]
  In file included from include/haproxy/api-t.h:35:0,
                   from include/haproxy/api.h:33,
                   from src/mworker-prog.c:23:
  include/haproxy/compat.h:161:0: note: this is the location of the previous definition

Let's simply include fcntl.h in compat.h before the macro is redefined.

There's normally no need to backport this, though it's harmless to do
it if needed.
2024-12-25 11:53:11 +01:00
Olivier Houchard
505480eeef CLEANUP: Remove pendconn_must_try_again().
Remove pendconn_must_try_again(), now that it no longer is used.
2024-12-24 14:10:06 +01:00
Olivier Houchard
cda7275ef5 MEDIUM: queue: Handle the race condition between queue and dequeue differently
There is a small race condition, where a server would check if there is
something left in the proxy queue, and adding something to the proxy
queue. If the server checks just before the stream is added to the queue,
and it no longer has any stream to deal with, then nothing will take
care of the stream, that may stay in the queue forever.
This was worked around with commit 5541d4995d, by checking for that exact
condition after adding the stream to the queue, and trying again to get
a server assigned if it is detected.
That fix lead to multiple infinite loops, that got fixed, but it is not
unlikely that it could happen again. So let's fix the initial problem
differently : a single server may mark itself as ready, and it removes
itself once used. The principle is that when we discover that the just
queued stream is alone with no active request anywhere ot dequeue it,
instead of rebalancing it, it will be assigned to that current "ready"
server that is available to handle it. The extra cost of the atomic ops
is negligible since the situation is super rare.
2024-12-24 14:10:06 +01:00
Olivier Houchard
5b8899b6cc BUG/MEDIUM: queue: Make process_srv_queue return the number of streams
Make process_srv_queue() return the number of streams unqueued, as
pendconn_grab_from_px() did, as that number is used by
srv_update_status() to generate logs.

This should be backported up to 2.6 with
111ea83ed4
2024-12-23 15:03:40 +01:00
William Lallemand
056ec51c26 MEDIUM: ssl/ocsp: counters for OCSP stapling
Add 2 counters in the SSL stats module for OCSP stapling.

- ssl_ocsp_staple is the number of OCSP response successfully stapled
  with the handshake
- ssl_failed_ocsp_stapled is the number of OCSP response that we
  couldn't staple, it could be because of an error or because the
  response is expired.

These counters are incremented in the OCSP stapling callback, so if no
OCSP was configured they won't never increase. Also they are only
working in frontends.

This was discussed in github issue #2822.
2024-12-23 11:23:00 +01:00
William Lallemand
0e6af97233 MINOR: ssl: change visibility of ssl_stats_module
In order to add stats from other files, the ssl_stats_module need to be
visible from other files.

This moves the ssl_counters definition in ssl_sock-t.h and removes the
static of ssl_stats_module.
2024-12-23 11:23:00 +01:00
William Lallemand
acb2c9eb8b MINOR: ssl: improve HAVE_SSL_OCSP ifdef
Allow to build correctly without OCSP. It could be disabled easily with
OpenSSL build with OPENSSL_NO_OCSP. Or even with
DEFINE="-DOPENSSL_NO_OCSP" on haproxy make line.
2024-12-19 10:53:05 +01:00
Remi Tricot-Le Breton
93f2c73423 MINOR: ssl/ocsp: Add extra details in error logs when possible
When the ocsp response auto update process fails during insertion or
while validating the received ocsp response, we call
ssl_sock_update_ocsp_response or ssl_ocsp_check_response respectively
and both these functions take an 'err' parameter in which detailed error
messages can be written. Until now, those error messages were discarded
and the only information given to the user was a generic error
(ERR_CHECK or ERR_INSERT) which does not help much.
We now keep a pointer to the last error message in the certificate_ocsp
structure and dump its content in the update logs as well as in the
"show ssl ocsp-updates" cli command.

This issue was raised in GitHub #2817.
2024-12-18 10:41:16 +01:00
Amaury Denoyelle
9d155ca706 MINOR: trace: implement tracing disabling API
Define a set of functions to temporarily disable/reactivate tracing for
the current thread. This could be useful when wanting to quickly remove
tracing output for some code parts.

The API relies on a disable/resume set of functions, with a thread-local
counter. This counter is tested under __trace_enabled(). It is a
cumulative value so that the same count of resume must be issued after
several disable usage. There is also the possibility to force reset the
counter to 0 before restoring the old value.

This should be backported up to 3.1.
2024-12-18 09:52:06 +01:00
Amaury Denoyelle
e296585ae9 MEDIUM/OPTIM: mux-quic: implement purg_list
This commit is part of the current serie which aims to refactor and
improve overall performance of QUIC MUX I/O handler.

qcc_io_process() is responsible to perform some internal operations on
QUIC MUX after I/O completion. It is notably called on every qcc_io_cb()
tasklet handler.

The most intensive work on it is the purging of QCS instances after
transfer completion. This was implemented by looping on QCC streams tree
and inspecting the state of every QCS. The purpose of this commit is to
optimize this processing.

A new purg_list QCC member is defined. It is responsible to list every
QCS instances whose transfer has been completed. It is thus safe to
reuse <el_send> QCS list attach point. Stream purging will thus only
loop on purg_list instead of every known QCS.

This should be backported up to 3.1.
2024-12-18 09:33:52 +01:00
Amaury Denoyelle
4b42dd4ae0 MEDIUM/OPTIM: mux-quic: define a recv_list for demux resumption
This commit is part of the current serie which aims to refactor and
improve overall performance of QUIC MUX I/O handler.

Define a recv_list element into qcc structure. This is used to
registered every instance of qcs which are currently blocked on
demuxing, which happen on no more space in <rx.appbuf>.

The purpose of this patch is to reduce qcc_io_recv() CPU usage. Now,
only recv_list iteration is performed, instead of the previous looping
over every qcs instances. This is useful as qcc_io_recv() is called each
time qcc_io_cb() is scheduled, even if only sending condition was the
wakeup origin.

A qcs is not inserted into recv_list immediately after blocking on demux
full buffer. Instead, this is only done after unblocking via stream
rcv_buf callback, which ensure that new buffer space is available.

This should be backported up to 3.1.
2024-12-18 09:23:41 +01:00
Amaury Denoyelle
0a53a008d0 MINOR: mux-quic: refactor wait-for-handshake support
This commit refactors wait-for-handshake support from QUIC MUX. The flag
logic QC_CF_WAIT_HS is inverted : it is now positionned only if MUX is
instantiated before handshake completion. When the handshake is
completed, the flag is removed.

The flag is now set directly on initialization via qmux_init(). Removal
via qcc_wait_for_hs() is moved from qcc_io_process() to qcc_io_recv().
This is deemed more logical as QUIC MUX is scheduled on RECV to be
notify by the transport layer about handshake termination. Moreover,
qcc_wait_for_hs() is now called if recv subscription is still active.

This commit is the first of a serie which aims to refactor QUIC MUX I/O
handler and improves its overall performance. The ultimate objective is
to be able to stream qcc_io_cb() by removing pacing specific code path
via qcc_purge_sending().

This should be backported up to 3.1.
2024-12-18 09:23:41 +01:00
Amaury Denoyelle
17bfe93768 CLEANUP: mux-quic: remove unused qcc member send_retry_list
Remove unused fields send_retry_list from qcc and its corresponding
attach element el from qcs.

This should be backported up to 3.1.
2024-12-18 09:20:20 +01:00
Willy Tarreau
7b6acb6a51 MINOR: bug: make BUG_ON() fall back to ASSUME
When the strict level is zero and BUG_ON() is not implemented, some
possible null-deref warnings are emitted again because some were
covering for these cases. Let's make it fall back to ASSUME() so that
the compiler continues to know that the tested expression never happens.
It also allows to further optimize certain functions by helping the
compiler eliminate certain tests for impossible values. However it
requires that the expression is really evaluated before passing the
result through ASSUME() otherwise it was shown that gcc-11 and above
will fail to evaluate its implications and will continue to emit the
null-deref warnings in case the expression is non-trivial (e.g. it
has multiple terms).

We don't do it for BUG_ON_HOT() however because the extra cost of
evaluating the condition is generally not welcome in fast paths,
particularly when that BUG_ON_HOT() was kept disabled for
performance reasons.
2024-12-17 17:39:12 +01:00
Willy Tarreau
63798088b3 MINOR: compiler: add ASSUME_NONNULL() to tell the compiler a pointer is valid
At plenty of places we have ALREADY_CHECKED() or DISGUISE() on a pointer
just to avoid "possibly null-deref" warnings. These ones have the side
effect of weakening optimizations by passing through an assembly step.
Using ASSUME_NONNULL() we can avoid that extra step. And when the
__builtin_unreachable() builtin is not present, we fall back to the old
method using assembly. The macro returns the input value so that it may
be used both as a declarative way to claim non-nullity or directly inside
an expression like DISGUISE().
2024-12-17 16:46:46 +01:00
Willy Tarreau
2ce63b7b17 MINOR: compiler: also enable __builtin_assume() for ASSUME()
Clang apparently has __builtin_assume() which does exactly the same
as our macro, since at least v3.8. Let's enable it, in case it may
even better detect assumptions vs unreachable code.
2024-12-17 16:46:46 +01:00
Willy Tarreau
efc897484b MINOR: compiler: add a new "ASSUME" macro to help the compiler
This macro takes an expression, tests it and calls an unreachable
statement if false. This allows the compiler to know that such a
combination does not happen, and totally eliminate tests that would
be related to this condition. When the statement is not available
in the compiler, we just perform a break from a do {} while loop
so that the expression remains evaluated if needed (e.g. function
call).
2024-12-17 16:46:46 +01:00
Willy Tarreau
41fc18b1d1 MINOR: compiler: rely on builtin detection for __builtin_unreachable()
Due to __builtin_unreachable() only being associated to gcc 4.5 and
above, it turns out it was not enabled for clang. It's not used *that*
much but still a little bit, so let's enable it now. This reduces the
code size by 0.2% and makes it a bit more efficient.
2024-12-17 16:46:46 +01:00
Willy Tarreau
96cfcb1df3 MINOR: compiler: add a __has_builtin() macro to detect features more easily
We already have a __has_attribute() macro to detect when the compiler
supports a specific attribute, but we didn't have the equivalent for
builtins. clang-3 and gcc-10 have __has_builtin() for this. Let's just
bring it using the same mechanism as __has_attribute(), which will allow
us to simply define the macro's value for older compilers. It will save
us from keeping that many compiler-specific tests that are incomplete
(e.g. the __builtin_unreachable() test currently doesn't cover clang).
2024-12-17 16:46:46 +01:00
Olivier Houchard
b3cd5a4b86 CLEANUP: queues: Remove pendconn_grab_from_px().
pendconn_grab_from_px() is now unused, so just remove it.
2024-12-17 16:05:44 +01:00
William Lallemand
bb88f68cf7 MINOR: ssl: add utils functions to extract X509 notAfter date
Add ASN1_to_time_t() which converts an ASN1_TIME to a time_t and
x509_get_notafter_time_t() which returns the notAfter date in time_t
format.
2024-12-16 14:54:53 +01:00
Valentine Krasnobaeva
fbc534a6fa REORG: startup: move nofile limit checks in limits.c
Let's encapsulate the code, which checks the applied nofile limit into
a separate helper check_nofile_lim_and_prealloc_fd(). Let's keep in this new
function scope the block, which tries to create a copy of FD with the highest
number, if prealloc-fd is set in the configuration.
2024-12-16 10:44:01 +01:00
Valentine Krasnobaeva
14f5e00d38 REORG: startup: move code that applies limits to limits.c
In step_init_3() we try to apply provided or calculated earlier haproxy
maxsock and memmax limits.

Let's encapsulate these code blocks in dedicated functions:
apply_nofile_limit() and apply_memory_limit() and let's move them into
limits.c. Limits.c gathers now all the logic for calculating and setting
system limits in dependency of the provided configuration.
2024-12-16 10:44:01 +01:00
Valentine Krasnobaeva
1332e9b58d REORG: startup: move global.maxconn calculations in limits.c
Let's encapsulate the code, which calculates global.maxconn and
global.maxsslconn into a dedicated function set_global_maxconn() and let's
move this function in limits.c. In limits.c we keep helpers to calculate and
check haproxy internal limits, based on the system nofile and memory limits.
2024-12-16 10:44:01 +01:00
Frederic Lecaille
e1d25cdbdd CLEANUP: quic: remove a wrong comment about ->app_limited (drs)
->app_limited quic_drs struct member is not a boolean. This is
the index of the last transmitted packet marked as application-limited, or 0 if
the connection is not currently application-limited (see C.app_limited
definition in BBR v3 draft).
2024-12-13 14:42:43 +01:00
Frederic Lecaille
eeaeb412dc MINOR: quic: reduce the private data size of QUIC cc algos
After these commits:

    BUG/MINOR: quic: remove max_bw filter from delivery rate sampling
    BUG/MINOR: quic: fix BBB max bandwidth oscillation issue

where some members were removed from bbr struct, the private data
size of QUIC cc algorithms may be reduced from 160 to 144 uint32_t.

Should be easily backported to 3.1 alonside the commits mentioned above.
2024-12-13 14:42:43 +01:00
Frederic Lecaille
22ab45a3a8 BUG/MINOR: quic: remove max_bw filter from delivery rate sampling
This filter is no more needed after this commit:

 BUG/MINOR: quic: fix BBB max bandwidth oscillation issue.

Indeed, one added this filter at delivery rate sampling level to filter
the BBR max bandwidth estimations and was inspired from ngtcp2 code source when
trying to fix the oscillation issue. But this BBR max bandwidth oscillation issue
was fixed by the aforementioned commit.

Furthermore this code tends to always increment the BBR max bandwidth. From my point
of view, this is not a good idea at all.

Must be backported to 3.1.
2024-12-13 14:42:43 +01:00
Frederic Lecaille
a9a2f98f86 MINOR: window_filter: rely on the time to update the filter samples (QUIC/BBR)
The windowed filters are used only the BBR implementation for QUIC to filter
the maximum bandwidth samples for its estimation over a virtual time interval
tracked by counting the cyclical progression through ProbeBW cycles. ngtcp2
and quiche use such windowed filters in their BBR implementation. But in a
slightly different way. When updating the 2nd or 3rd filter samples, this
is done based on their values in place of the time they have been sampled.
It seems more logical to rely on the sample timestamps even if this has no
implication because when a sample is updated using another sample because it
has the same value, they have both the same timestamps!

This patch modifies two statements which compare two consecutive filter samples
based on their values (smp[]->v) by statements which compare them based on the
virtual time they have been sampled (smp[]->t). This fully complies which the
code used by the Linux kernel in lib/win_minmax.c.

Alo take the opportunity of this patch to shorten some statements using <smp>
local variable value to update smp[2] sample in place of initializing its two
members with the <smp> member values.

This patch SHOULD be easily backported to 3.1 where BBR was first implemented.
2024-12-13 14:42:43 +01:00
Amaury Denoyelle
1f458b3ea8 MINOR: applet: define applet_putchk_stress() alternative
Previous patch introduced stress mode to be able to easily test
alternative code paths.

The first point would be to force interruption of stats dump on every
line and check reentrant patchs, in particular while adding and removing
servers instances.

The purpose of this patch is to be able to use applet_putchk_stress()
during stats dump while not impacting other applets. To support this,
extract applet_putchk() into an internal _applet_putchk() which have a
new argument stress. Define two helpers applet_putchk() and
applet_putchk_stress(), the latter to set the stress argument to true.

For the moment, applet_putchk_stress() is not used. This will be the
subject of the next patch.
2024-12-12 11:26:33 +01:00
Amaury Denoyelle
9d19fc4cf7 MINOR: build: define DEBUG_STRESS
Define a new build mode DEBUG_STRESS. This will be used to stress some
code parts which cannot be reproduce easily with an alternative
suboptimal code.

First, a global <mode_stress> is set either to 1 or 0 depending on
DEBUG_STRESS compilation. A new global keyword "stress-level" is also
defined. It allows to specify a level from 0 to 9, to increase the
stress incurred on the code.

Helper macro STRESS_RUN* are defined for each stress level. This allows
to easily specify an instruction in default execution and a stress
counterpart if running on the corresponding stress level.
2024-12-12 11:19:10 +01:00
Aurelien DARRAGON
358166ae6a BUG/MINOR: hlua_fcn: restore server pairs iterator pointer consistency
Since 9c91b30 ("MINOR: server: remove prev_deleted server list"), hlua
server pair iterator may use and return invalid (stale) server pointer
if multiple servers were deleted between two iterations.

Indeed, the server refcount mechanism (using srv_take()) is no longer
sufficient as the prev_deleted mitigation was removed.

To ensure server pointer consistency between two yields, the new watcher
mechanism must be used (as it already the case for stats dumping).

Thus in this patch we slightly change the server iteration logic:
hlua_server_list_iterator_context struct now stores the next valid server
pointer, and a watcher is added to ensure this pointer is never stale.

Then in hlua_listable_servers_pairs_iterator(), this next pointer is used
to create the Lua server object, and the next valid pointer is obtained by
leveraging watcher_next().

No backport needed unless 9c91b30 ("MINOR: server: remove prev_deleted
server list") is. Please note that dynamic servers were not supported in
Lua prior to 2.8, so it doesn't make sense to backport this patch further
than 2.8.
2024-12-11 10:52:11 +01:00
Amaury Denoyelle
9c91b30139 MINOR: server: remove prev_deleted server list
This patch is a direct follow-up to the previous one. Thanks to watcher
type, it is not safe to assume that servers manipulated via stats dump
were not targetted by a "delete server" CLI command. As such,
prev_deleted list server member is now unneeded. This patch thus removes
any reference to it.
2024-12-10 16:19:33 +01:00
Amaury Denoyelle
071ae8ce3d BUG/MEDIUM: stats/server: use watcher to track server during stats dump
If a server A is deleted while a stats dump is currently on it, deletion
is delayed thanks to reference counting. Server A is nonetheless removed
from the proxy list. However, this list is a single linked list. If the
next server B is deleted and freed immediately, server A would still
point to it. This problem has been solved by the prev_deleted list in
servers.

This model seems correct, but it is difficult to ensure completely its
validity. In particular, it implies when stats dump is resumed, server A
elements will be accessed despite the server being in a half-deleted
state.

Thus, it has been decided to completely ditch the refcount mechanism for
stats dump. Instead, use the watcher element to register every stats
dump currently tracking a server instance. Each time a server is deleted
on the CLI, each stats dump element which may points to it are updated
to access the next server instance, or NULL if this is the last server.
This ensures that a server which was deleted via CLI but not completely
freed is never accessed on stats dump resumption.

Currently, no race condition related to dynamic servers and stats dump
is known. However, as described above, the previous model is deemed too
fragile, as such this patch is labelled as bug-fix. It should be
backported up to 2.6, after a reasonable period of observation. It
relies on the following patch :
  MINOR: list: define a watcher type
2024-12-10 16:19:33 +01:00
Amaury Denoyelle
eafa8a32bb MINOR: list: define a watcher type
Define a new watcher type into list module. This type is similar to bref
and can be used to register an element which is currently tracking a
dynamic target. Contrary to bref, if the target is freed, every watcher
element are updated to point to a next valid entry or NULL.

This type will simplify handling of dynamic servers deletion, in
particular while stats dump are performed.

This patch is not a bug-fix. However, it is mandatory to fix a race
condition in dynamic servers. Thus, it should be backported along the
next commit up to 2.6.
2024-12-10 16:04:11 +01:00
Valentine Krasnobaeva
1f63a53955 BUG/MINOR: mworker: detach from tty when received READY from worker
Some master process' initialization steps are conditioned by receiving the
READY message from worker (pidfile creation, forwarding READY message to the
launching parent). So, master process can not do these initialization routines
before.

If the master process fails, while creating pid or forwarding the READY to the
parent in daemon mode, he exits with a proper alert message. In daemon mode we
no longer see such message, as process is already detached from the tty.

To fix this, as these alerts could be very useful, let's detach the master
process from the tty after his last initialization steps in _send_status.
2024-12-09 21:32:54 +01:00
Valentine Krasnobaeva
663d75e7a0 BUG/MEDIUM: startup: report status if daemonized process fails
Due to master-worker rework, daemonization fork happens now before parsing
and applying the configuration. This makes impossible to report correctly all
warnings and alerts to shell's stdout. Daemonzied process fails, while being
already in background, exit code reported by shell via '$?' equals to 0, as
it's the exit code of his parent.

To fix this, let's create a pipe between parent and daemonized child. The
child will send into this pipe a "READY" message, when it finishes his
initialization. The parent will wait on the "read" end of the pipe until
receiving something. If read() fails, parent obtains the status of the
exited child with waitpid(). So, the parent can correctly report the error to
the stdout and he can exit with child's exitcode.

This fix should be backported only in 3.1.
2024-12-09 21:32:44 +01:00
William Lallemand
5454824e31 MINOR: ssl: add notBefore and notAfter utility functions
Extracting notBefore and notAfter as a string can be bothersome,
add 2 utility functions that returns the value in a static buffer.
2024-12-09 18:29:23 +01:00
Willy Tarreau
c3ee4e375b MINOR: tools: make fddebug() automatically emit the location
fddebug() is sometimes quite helpful, but annoying to use when following
a call path because it's a pain to always repeat the function name and
call place. Let's have it automatically prepend the function name, the
file name and the line number, and make its arguments optional, replacing
them by a simple LF when all absent. This way, simply placing:

    fddebug();

is sufficient to emit a location follocing "[%s@%s:%d]\n". This function
must not be used in production (and even call places with it shouldn't be
committed) and it should only be used by developers, so the simplest the
better.
2024-12-09 18:05:09 +01:00
Willy Tarreau
d6dc8120c0 BUILD: debug: fix build issues in COUNT_IF() with -Wunused-value
Commit 7f64bb79fd ("BUG/MINOR: debug: COUNT_IF() should return true/false")
allowed the COUNT_IF() macro to return the evaluated value. This is handy
to place it in "if ()" conditions and count them at the same time. When
glitches are disabled, the condition is just returned as-is, but most call
places do not use the result, making some compilers complain. In addition,
while reviewing this, it was noticed that when DEBUG_STRICT=0, the macro
would still be replaced by a "do { } while (0)" statement, which not only
does not evaluate the expression, but also cannot return anything. Ditto
for COUNT_IF_HOT().

Let's make sure both are always properly evaluated now.
2024-12-09 18:04:51 +01:00
Willy Tarreau
7f64bb79fd BUG/MINOR: debug: COUNT_IF() should return true/false
The COUNT_IF() macro was initially meant to return true/false to be used
in if() conditions but had an extra do { } while(0) that prevents it from
doing so. Let's get rid of the do { } while(0) before the code generalizes
to too many places. There's no impact on existing code, but may have to be
backported if future fixes rely on it.
2024-12-06 18:45:46 +01:00
Valentine Krasnobaeva
cd0b58e23e BUG/MINOR: startup: fix error path for master, if can't open pidfile
If master process can't open a pidfile, there is no sense to send SIGTTIN to
oldpids, as it will exit. So, old workers will terminate as well. It's better
to send the last alert to the log about unrecoverable error, because master is
already in its polling loop.

For the standalone mode we should keep the previous logic in this case: send
SIGTTIN to old process and unbind listeners for the new one. So, it's better
to put this error path in main(), as it's done when other configuration settings
can't be applied.

This patch should be backported only in 3.1.
2024-12-06 12:00:22 +01:00
Aurelien DARRAGON
ae9d8d40d0 CLEANUP: stktable: add some stktable flags polishing
Better late than never, commit 1f73d35 ("MINOR: stktable: implement
"recv-only" table option") implemented stktable flags and initial
definitions, but it lacks some comments plus the flag is stored as
16bits but the SKT_FL_ definition width allows for only 8bits so
it is a bit confusing, let's fix that
2024-12-05 13:14:21 +01:00
Aurelien DARRAGON
9f44c5f9be CLEANUP: stktable: replace nopurge attribute with flag
Thanks to previous commit stktable struct now have a "flags" struct member

Let's take this opportunity to remove the isolated "nopurge" attribute in
stktable struct and rely on a flag named STK_FL_NOPURGE instead.

This helps to better organize stktable struct members.
2024-12-05 12:15:31 +01:00
Aurelien DARRAGON
1f73d3524d MINOR: stktable: implement "recv-only" table option
When "recv-only" keyword is added on a stick table declaration (in peers
or proxy section), haproxy considers that the table is only used for
data retrieval from a remote location and not used to perform local
updates. As such, it enables the retrieval of local-only values such
as conn_cur that are ignored by default. This can be useful in some
contexts where we want to know about local-values such are conn_cur
from a remote peer.

To do this, add stktable struct flags  which default to NONE and enable
the RECV_ONLY flag on the table then "recv-only" keyword is found in the
table declaration. Then, when in peer_treat_updatemsg(), when handling
table updates, don't ignore data updates for local-only values if the flag
is set.
2024-12-05 12:15:24 +01:00
Willy Tarreau
e6f4f15929 MINOR: tasklet: set TASK_WOKEN_OTHER on tasklets by default
Now when tasklets are woken up via tasklet_wakeup(), tasklet_wakeup_on()
or tasklet_wakeup_after(), either the optional wakeup flags will be used,
or TASK_WOKEN_OTHER will be used.

This allows tasklet handlers waking up for any given cause to notice
whether or not they were also woken for another reason. For example, a
mux handler could skip heavy parts when seeing that TASK_WOKEN_OTHER is
absent, proving that no standard tasklet_wakeup() was done, for example
in response to a subscribe().

The benefit of the TASK_WOKEN_* flags is that they're purged during the
wakeup, and that they're easy to check for using TASK_WOKEN_ANY.
TASK_F_UEVT1 and TASK_F_UEVT2 are also usable for private use (e.g. wakeup
from a stream to a connection inside a mux).

Probably that in the future, code dealing with subscribe events should
start to place TASK_WOKEN_IO like is done for upper layers.
2024-12-03 19:45:08 +01:00
Willy Tarreau
6322c9fbbf MINOR: tools: add a new macro DEFVAL() to provide a default argument
This is like DEFZERO and DEFNULL, but this one allows to specify the
default value to be used as the first argument.
2024-12-03 19:45:08 +01:00
Valentine Krasnobaeva
a33977da48 BUG/MINOR: startup: close pidfd and free global.pidfile in handle_pidfile()
After master-worker mode refactoring, global.pidfile is only used in
handle_pidfile(), which opens the provided file and writes the PID into it. So,
it's more appropriate to perform the close(pidfd) and ha_free(&global.pidfile)
also in this function.

This commit prepares the fix of the pidfile creation, as it's created now very
early, when we are not sure, that process has successfully started. In
master-worker mode handle_pidfile() can be called in the master process context.
So, let's make it accessible from other compilation units via global.h.

This should be backported only in 3.1.
2024-12-02 17:28:04 +01:00
Aurelien DARRAGON
8bce7ff854 MINOR: hlua_fcn: add Patref:commit() method
commit() method may be used to commit pending updates on the local patref
object:

hlua_patref flags were added:
 HLUA_PATREF_FL_GEN means the patref object has been updated
 and it is associated to a new revision (curr_gen) in order to prepare
 and commit the pending updates.

upon commit, the pattern API is leveraged with curr_gen as revision to
commit new object items. Once commit is performed, previous (pending)
revisions that are older than the committed one are cleaned up (similar
to what's done with commit on the cli). Also, Patref function APIs now
take into account curr_gen to perform lookups.
2024-11-29 07:23:08 +01:00
Aurelien DARRAGON
e769d8f426 MINOR: pattern: add pat_ref_may_commit() helper function
pat_ref_may_commit() may be used to know if a given generation ID id still
valid, which means it may still be committed at some point. Else it means
that another pending generation ID older than the tested one was already
committed and thus other generations ID below this one are stale and must
be regenerated.
2024-11-29 07:23:01 +01:00
Aurelien DARRAGON
43ab25f007 MINOR: hlua_fcn: wrap pat_ref struct for patref class
In order to extend the patref class features, let's wrap the pat_ref struct
into hlua_patref struct. This way we may add additional data alongside the
pat_ref pointer to store additional context required for pat_ref data
manipulation from lua.

Since the wrapper (hlua_patref) is an allocated object, we declare the _gc
metamethod for patref class in order to properly cleanup resources when
they are out of scope.
2024-11-29 07:22:54 +01:00
Aurelien DARRAGON
2021072391 MINOR: hlua_fcn: implement index and pair metamethods for patref class
patref object may now leverage index and pair methamethods to list and
access patref elements at a specific index (=key)

Also, patref:is_map() method may be used to know if the patref stores acl
(key only) or map-style (key:value) patterns.
2024-11-29 07:22:46 +01:00
Aurelien DARRAGON
956a25cf60 MINOR: hlua: add patref class
Implement patref class to expose pat_ref struct internal pattern struct
in lua. This is some prerequisite work needed to be able to manipulate
exisiting generic pattern object lists (acl/map) from Lua, because the Map
class can only be used to perform matching ops on Map files.
2024-11-29 07:22:32 +01:00
Aurelien DARRAGON
f72a66eef2 MINOR: pattern: publish event_hdl events on pat_ref updates
Now that PAT_REF events were defined in previous commit, let's actually
publish them from pattern API where relevant. Unlike server events,
pattern reference events are only published in the pat_ref subscriber's
list on purpose, because in some setups patref updates (updates performed
on a map for instance from action or cli) are very frequent, and we don't
want to impact pattern API performance just for that.

Moreover, as the main use case is to be able to subscribe to maps updates
from Lua, allowing a per-pattern reference registration is already enough.

No additional data is provided for such events (also for performance reason)

Care was taken not to publish events when the update doesn't affect the
live subset (the one targeted by curr_gen).
2024-11-29 07:22:25 +01:00
Aurelien DARRAGON
f7267bd315 MINOR: event_hdl: add PAT_REF events
This is some prerequisite work for implementing PAT_REF events.

In this commit we define the PAT_REF event_hdl family (which gets family
slot id #2), with the following supported events:

  - EVENT_HDL_SUB_PAT_REF_ADD: element was added to the current version of
    the pattern ref
  - EVENT_HDL_SUB_PAT_REF_DEL: element was deleted from the current
    version of the pattern ref
  - EVENT_HDL_SUB_PAT_REF_SET: element was modified in the current version
    of the pattern ref
  - EVENT_HDL_SUB_PAT_REF_COMMIT: pending element(s) was/were commited in
    the current version of the pattern ref
  - EVENT_HDL_SUB_PAT_REF_CLEAR: all elements were cleared from the
    current version of the pattern ref

The goal is to be able to track a pat_ref struct in order to be notified
when it is updated. For performance reasons, events from this family won't
provide any additional info, and will only be published in the pat_ref
subscription list. Indeed, pat_ref may be updated at a relatively high
frequency (or worse, batch work), so we cannot afford doing expensive
treatment for each update.
2024-11-29 07:22:18 +01:00
Frederic Lecaille
f8b697c19b BUG/MINOR: improve BBR throughput on very fast links
This patch fixes the loss of information when computing the delivery rate
(quic_cc_drs.c) on links with very low latency due to usage of 32bits
variables with the millisecond as precision.

Initialize the quic_conn task with TASK_F_WANTS_TIME flag ask it to ask
the scheduler to update the call date of this task. This allows this task to get
a nanosecond resolution on the call date calling task_mono_time(). This is enabled
only for congestion control algorithms with delivery rate estimation support
(BBR only at this time).

Store the send date with nanosecond precision of each TX packet into
->time_sent_ns new quic_tx_packet struct member to store the date a packet was
sent in nanoseconds thanks to task_mono_time().

Make use of this new timestamp by the delivery rate estimation algorithm (quic_cc_drs.c).

Rename current ->time_sent member from quic_tx_packet struct to ->time_sent_ms to
distinguish the unit used by this variable (millisecond) and update the code which
uses this variable. The logic found in quic_loss.c is not modified at all.

Must be backported to 3.1.
2024-11-28 21:39:05 +01:00
Christopher Faulet
bc66d31985 MINOR: proxy: Add support of 421-Misdirected-Request in retry-on status
The "421" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.

This patch should fix the issue #2794. It is quite simple so it may safely
be backported to 3.1 if necessary.
2024-11-28 11:47:40 +01:00
Willy Tarreau
97d33abb23 MINOR: version: this is development again (3.2)
This basically reverts commit b629f366a7 ("MINOR: version: mention that
3.1 is stable now").
2024-11-26 17:21:16 +01:00
Aurelien DARRAGON
4792f27892 MINOR: pattern: add pat_ref_gen_delete() function
pat_ref_gen_delete(ref, gen_id, key) tries to delete all samples belonging
to <gen_id> and matching <key> under <ref>

The goal is to be able to target a single subset from <ref>
2024-11-26 16:12:21 +01:00
Aurelien DARRAGON
a131c542a6 MINOR: pattern: add pat_ref_gen_find_elt() function
pat_ref_gen_find_elt(ref, gen_id, key) tries to find <elt> element
belonging to <gen_id> and matching <key> in <ref> reference.

The goal is to be able to target a single subset from <ref>
2024-11-26 16:12:16 +01:00
Aurelien DARRAGON
c9d6af3c6d MINOR: pattern: add pat_ref_gen_set() function
pat_ref_gen_set(ref, gen_id, value, err) modifies to <value> the sample
of all patterns matching <key> and belonging to <gen_id> (generation id)
under <ref>

The goal is to be able to target a single subset from <ref>
2024-11-26 16:12:11 +01:00
Aurelien DARRAGON
3d250b3be8 MINOR: pattern: split pat_ref_set()
split pat_ref_set() function in 2 distinct functions. Indeed, since
0844bed7d3 ("MEDIUM: map/acl: Improve pat_ref_set() efficiency (for
"set-map", "add-acl" action perfs)"), pat_ref_set() prototype was updated
to include an extra <elt> argument. But the logic behind is not explicit
because the function will not only try to set <elt>, but also its
duplicate (unlike pat_ref_set_elt() which only tries to update <elt>).

Thus, to make it clearer and better distinguish between the key-based
lookup version and the elt-based one, restotre pat_ref_set() previous
prototype and add a dedicated pat_ref_set_elt_duplicate() that takes
<elt> as argument and tries to update <elt> and all duplicates.
2024-11-26 16:12:05 +01:00
Willy Tarreau
4d58f521ee [RELEASE] Released version 3.2-dev0
Released version 3.2-dev0 with the following main changes :
    - exact copy of 3.1.0
2024-11-26 15:33:57 +01:00
Christopher Faulet
b629f366a7 MINOR: version: mention that 3.1 is stable now
This version will be maintained up to around Q1 2026. The INSTALL file
also mentions it.
2024-11-26 15:23:54 +01:00
Amaury Denoyelle
2fffd85b97 BUG/MEDIUM: quic: prevent EMSGSIZE with GSO for larger bufsize
A UDP datagram cannot be greater than 65535 bytes, as UDP length header
field is encoded on 2 bytes. As such, sendmsg() will reject a bigger
input with error EMSGSIZE. By default, this does not cause any issue as
QUIC datagrams are limited to 1.252 bytes and sent individually.

However, with GSO support, value bigger than 1.252 bytes are specified
on sendmsg(). If using a bufsize equal to or greater than 65535, syscall
could reject the input buffer with EMSGSIZE. As this value is not
expected, the connection is immediately closed by haproxy and the
transfer is interrupted.

This bug can easily reproduced by requesting a large object on loopback
interface and using a bufsize of 65535 bytes. In fact, the limit is
slightly less than 65535, as extra room is also needed for IP + UDP
headers.

Fix this by reducing the count of datagrams encoded in a single GSO
invokation via qc_prep_pkts(). Previously, it was set to 64 as specified
by man 7 udp. However, with 1252 datagrams, this is still too many.
Reduce it to a value of 52. Input to sendmsg will thus be restricted to
at most 65.104 bytes if last datagram is full.

If there is still data available for encoding in qc_prep_pkts(), they
will be written in a separate batch of datagrams. qc_send_ppkts() will
then loop over the whole QUIC Tx buffer and call sendmsg() for each
series of at most 52 datagrams.

This does not need to be backported.
2024-11-26 11:49:30 +01:00
Valentine Krasnobaeva
3500865bc1 REORG: startup: move mworker_apply_master_worker_mode in mworker.c
mworker_apply_master_worker_mode() is called only in master-worker mode, so
let's move it mworker.c
2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva
dee247c14e REORG: startup: move mworker_reexec and mworker_reload in mworker.c
Let's move mworker_reexec() and mworker_reload() in mworker.c. mworker_reload()
is called only within the functions, which are already in mworker.c. So, this
reorganization allows to declare mworker_reload() as a static.
2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva
0c7b93eb1d REORG: startup: move mworker_run_master and mworker_loop in mworker.c
mworker_run_master() is called only in master mode. mworker_loop() is static
and called only in mworker_run_master(). So let's move these both functions in
mworker.c.

We also need here to make run_thread_poll_loop() accessible from other units,
as it's used in mworker_loop().
2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva
7974089ac6 REORG: startup: move mworker_prepare_master in mworker.c
mworker_prepare_master() performs some preparation routines for the new worker
process, which will be forked during the startup. It's called only in
master-worker mode, so let's move it in mworker.c.
2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva
af642420b4 REORG: startup: move on_new_child_failure in mworker.c
mworker_on_new_child_failure() performs some routines for the worker process,
if it has failed the reload. As it's called only in mworker_catch_sigchld()
from mworker.c, let's move mworker_on_new_child_failure() in mworker.c as well.
Like this it could also be declared as a static.
2024-11-25 15:20:24 +01:00
Valentine Krasnobaeva
321c021a83 MINOR: startup: rename on_new_child_failure to mworker_on_new_child_failure
This patch prepares the moving of on_new_child_failure definition into
mworker.c. So, let's rename it accordingly and let's also update its
description.
2024-11-25 15:20:24 +01:00
Amaury Denoyelle
22bd92a87f MINOR: mux-quic: use sched call time for pacing
QUIC pacing was recently implemented to limit burst and improve overall
bandwidth. This is used only for MUX STREAM emission. Pacing requires
nanosecond resolution. As such, it used now_cpu_time() which relies on
clock_gettime() syscall.

The usage of clock_gettime() has several drawbacks :
* it is a syscall and thus requires a context-switch which may hurt
  performance
* it is not be available on all systems
* timestamp is retrieved multiple times during a single task execution,
  thus yielding different values which may tamper pacing calculation

Improve this by using task_mono_time() instead. This returns task call
time from the scheduler thread context. It requires the flag
TASK_F_WANTS_TIME on QUIC MUX tasklet to force the scheduler to update
call time with now_mono_time(). This solves every limitations listed
above :
* syscall invokation is only performed once before tasklet execution,
  thus reducing context-switch impact
* on non compatible system, a millisecond timer is used as a fallback
  which should ensure that pacing works decently for them
* timer value is now guaranteed to be fixed duing task execution
2024-11-25 11:21:45 +01:00
Willy Tarreau
670507a66e MINOR: tools: add a new function "resolve_dso_name" to find a symbol's DSO
In the memprofile summary per DSO, we currently have to pay a high price
by calling dladdr() on each symbol when doing the summary per DSO at the
end, while we're not interested in these details, we just want the DSO
name which can be made cheaper to obtain, and easier to manipulate. So
let's create resolve_dso_name() to only extract minimal information from
an address. At the moment it still uses dladdr() though it avoids all the
extra expensive work, and will further be able to leverage the same
mechanism as "show libs" to instantly spot DSO from address ranges.
2024-11-21 19:58:06 +01:00
Willy Tarreau
5ddc8b3ad4 MINOR: activity/memprofile: monitor non-portable calls as well
Some dependencies might very well rely on posix_memalign(), strndup()
or other less portable callsn making us miss them when chasing memory
leaks, resulting in negative global allocation counters. Let's provide
the handlers for the following functions:

  strndup()        // _POSIX_C_SOURCE >= 200809L || glibc >= 2.10
  valloc()         // _BSD_SOURCE || _XOPEN_SOURCE>=500 || glibc >= 2.12
  aligned_alloc()  // _ISOC11_SOURCE
  posix_memalign() // _POSIX_C_SOURCE >= 200112L
  memalign()       // obsolete
  pvalloc()        // obsolete

This time we don't fail if they're not found, we just silently forward
the calls.
2024-11-21 19:58:06 +01:00
Willy Tarreau
33c0ce299d MINOR: activity/memprofile: also monitor strdup() activity
Some memory profiling outputs have showed negative counters, very likely
due to some libs calling strdup(). Let's add it to the list of monitored
activities.

Actually even haproxy itself uses some. Having "profiling.memory on" in
the config reveals 35 call places.
2024-11-21 19:58:06 +01:00
Willy Tarreau
623a2c4e19 CLEANUP: activity: better use a mask to tests freeing methods
In "show profiling memory", we need to distinguish methods which really
free memory from those which do not so that we don't account for the
free value twice. However for now it's done using multiple tests, which
are going to complicate the addition of new methods. Let's switch to a
bit field defined as a mask in a single place instead, as we don't
intend to use more than 32/64 methods!
2024-11-21 19:58:06 +01:00
Willy Tarreau
859341c1ec MINOR: activity/memprofile: offer a function to unregister stale info
There's actually a problem with memprofiles: the pool pointer is stored
in ->info but some pools are replaced during startup, such as the trash
pool, leaving a dangling pointer there.

Let's complete the API with a new function memprof_remove_stale_info()
that will remove all stale references to this info pointer. It's also
present when USE_MEMORY_PROFILING is not set so as to ease the job on
callers.
2024-11-21 19:58:06 +01:00
Valentine Krasnobaeva
bfe0f9d02d MINOR: startup: use global progname variable
Let's store progname in the global variable, as it is handy to use it in
different parts of code to format messages sent to stdout.

This reduces the number of arguments, which we should pass to some functions.
2024-11-21 19:55:21 +01:00
Valentine Krasnobaeva
ef154a49e1 MINOR: capabilities: rename program_name argument to progname
This commit prepares the usage of the global progname variable.
prepare_caps_from_permitted_set() use progname value in warning messages. So,
let's rename program_name argument to progname.
2024-11-21 19:55:21 +01:00
Frederic Lecaille
01fcbd6c08 BUG/MINOR: quic: Missing application limitations tracking for BBR
The ->app_limited member of the delivery rate struct (quic_cc_drs) aim is to
store the index of the last transmitted byte marked as application-limited
so that to track the application-limited phases. During these phases,
BBR must ignore delivery rate samples to properly estimate the delivery rate.

Without such a patch, the Startup phase could be exited very quickly with
a very low estimated bottleneck bandwidth. This had a very bad impact
on little objects with download times smaller than the expected Startup phase
duration. For such objects, with enough bandwith, BBR should stay in the Startup
state.

No need to be backported, as BBR is implemented in the current developement version.
2024-11-21 19:23:53 +01:00
Amaury Denoyelle
95d3edd68f MINOR: quic: support pacing for newreno and nocc
Extend extra pacing support for newreno and nocc congestion algorithms,
as with cubic.

For better extensibility of cc algo definition, define a new flags field
in quic_cc_algo structure. For now, the only value is
QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be
optionally activated. Both cubic, newreno and nocc now supports this.

This new flag is then reused by QUIC config parser. If set, extra
quic-cc-algo burst parameter is taken into account. If positive, this
will activate pacing support on top of the congestion algorithm. As with
cubic previously, pacing is only supported if running under experimental
mode.

Only BBR is not flagged with this new value as pacing is directly
builtin in the algorithm and cannot be turn off. Furthermore, BBR
calculates automatically its value for maximum burst. As such, any
quic-cc-algo burst argument used with BBR is still ignored with a
warning.
2024-11-21 11:33:44 +01:00
Christopher Faulet
e58a30d369 MINOR: cfgparse: Emit a warning for misplaced "tcp-response content" rules
When a "tcp-response content" rule is placed after a "http-response" rule, a
warning is now emitted, just like for rules applied on the requests.
2024-11-21 09:55:04 +01:00
Christopher Faulet
5dcd3b0d99 CLEANUP: cfgparse: Add direction in functions name that warn on misplaced rules
This only concerns functions emitting warnings about misplaced tcp-request
rules. The direction is now specified in the functions name. For instance
"warnif_misplaced_tcp_conn" is replaced by "warnif_misplaced_tcp_req_conn".
2024-11-21 09:51:37 +01:00
Christopher Faulet
7710580428 MINOR: config: Improve warnings on misplaced rules by adding an optional arg
In warnings about misplaced rules, only the first keyword is mentionned. It
works well for http-request or quic-initial rules for instance. But it is a
bit confusing for tcp-request rules, because the layer is missing (session
or content).

To make it a bit systematic (and genric), the second argument can now be
provided. It can be set to NULL if there is no layer or scope. But
otherwise, it may be specified and it will be reported in the warning.

So the following snippet:

    tcp-request content reject if FALSE
    tcp-request session reject if FALSE
    tcp-request connection reject if FALSE

Will now emit the following warnings:

  a 'tcp-request session' rule placed after a 'tcp-request content' rule will still be processed before.
  a 'tcp-request connection' rule placed after a 'tcp-request session' rule will still be processed before.

This patch should fix the issue #2596.
2024-11-21 09:28:42 +01:00
Frederic Lecaille
d85eb127e9 MINOR: quic: quic_loss modifications to support BBR
qc_packet_loss_lookup() aim is to detect the packet losses. This is this function
which must called ->on_pkt_lost() BBR specific callback. It also set
<bytes_lost> passed parameter to the total number of bytes detected as lost upon
an ACK frame receipt for its caller.
Modify qc_release_lost_pkts() to call ->congestion_event() with the send time
from the newest packet detected as lost.
Modify qc_release_lost_pkts() to call ->slow_start() callback only if define
by the congestion control algorithm. This is not the case for BBR.
2024-11-20 17:34:22 +01:00
Frederic Lecaille
af75665cb7 MINOR: quic: quic_cc modifications to support BBR
Add several callbacks to quic_cc_algo struct which are only called by BBR.
->get_drs() may be used to retrieve the delivery rate sampling information
from an congestion algorithm struct (quic_cc).
->on_transmit() must be called before sending any packet a QUIC sender.
->on_ack_rcvd() must be called after having received an ACK.
->on_pkt_lost() must be called after having detected a packet loss.
->congestion_event() must be called after any congestion event detection
Modify quic_cc.c to call ->event only if defined. This is not the case
for BBR.
2024-11-20 17:34:22 +01:00
Frederic Lecaille
d04adf44dc MINOR: quic: implement BBR congestion control algorithm for QUIC
Implement the version 3 of BBR for QUIC specified by the IETF in this draft:

https://datatracker.ietf.org/doc/draft-ietf-ccwg-bbr/

Here is an extract from the Abstract part to sum up the the capabilities of BBR:

BBR ("Bottleneck Bandwidth and Round-trip propagation time") uses recent
measurements of a transport connection's delivery rate, round-trip time, and
packet loss rate to build an explicit model of the network path. BBR then uses
this model to control both how fast it sends data and the maximum volume of data
it allows in flight in the network at any time. Relative to loss-based congestion
control algorithms such as Reno [RFC5681] or CUBIC [RFC9438], BBR offers
substantially higher throughput for bottlenecks with shallow buffers or random
losses, and substantially lower queueing delays for bottlenecks with deep buffers
(avoiding "bufferbloat"). BBR can be implemented in any transport protocol that
supports packet-delivery acknowledgment. Thus far, open source implementations
are available for TCP [RFC9293] and QUIC [RFC9000].

In haproxy, this implementation is considered as still experimental. It depends
on the newly implemented pacing feature.

BBR was asked in GH #2516 by @KazuyaKanemura, @osevan and @kennyZ96.
2024-11-20 17:34:22 +01:00
Frederic Lecaille
472d575950 MINOR: quic: implement delivery rate sampling algorithm
This patch implements an algorithm which may be used by congestion algorithms
for QUIC to estimate the current delivery rate of a sender. It is at least used
by BBR and could be used by others congestion algorithms as cubic.

This algorithm was specified by an RFC draft here:
https://datatracker.ietf.org/doc/html/draft-cheng-iccrg-delivery-rate-estimation
before being merged into BBR v3 here:
https://datatracker.ietf.org/doc/html/draft-cardwell-ccwg-bbr#section-4.5.2.2
2024-11-20 17:34:22 +01:00
Frederic Lecaille
c08b877657 MINOR: window_filter: Implement windowed filter (only max)
Implement the Kathleen Nichols' algorithm used by several congestion control
algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track
the maximum value of a data type during a fixe time interval.
In this implementation, counters which are periodically reset are used in place
of timestamps.
Only the max part has been implemented.
(see lib/minmax.c implemenation for Linux kernel).
2024-11-20 17:34:22 +01:00
Frederic Lecaille
7bbe8828ba MINOR: quic: Add the congestion window initial value to QUIC path
Add ->initial_wnd new member to quic_cc_path struct to keep the initial value
of the congestion window. This member is initialized as soon as a QUIC connection
is allocated. This modification is required for BBR congestion control algorithm.
2024-11-20 17:34:22 +01:00
Willy Tarreau
1171a23aec BUILD: makefile: make ERR apply to build options as well
Once in a while we find some makefiles ignoring some outdated arguments
and just emit a warning. What's annoying is that if users (say, distro
packagers), have purposely added ERR=1 to their build scripts to make
sure to fail on any warning, these ones will be ignored and the build
can continue with invalid or missing options.

William rightfully suggested that ERR=1 should also catch make's warnings
so this patch implements this, by creating a new "complain" variable that
points either to "error" or "warning" depending on $(ERR), and that is
used to send the messages using $(call $(complain),...). This does the
job right at little effort (tested from GNU make 3.82 to 4.3).

Note that for this purpose the ERR declaration was upped in the makefile
so that it appears before the new errors.mk file is included.
2024-11-20 14:58:35 +01:00
Willy Tarreau
12fcd65468 MINOR: tasklet: support an optional set of wakeup flags to tasklet_wakeup_on()
tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and
tasklet_wakeup()) do not support passing a wakeup cause like
task_wakeup(). This is essentially due to an API limitation cause by
the fact that for a very long time the only reason for waking up was
to process pending I/O. But with the growing complexity of mux tasks,
it is becoming important to be able to skip certain heavy processing
when not strictly needed.

One possibility is to permit the caller of tasklet_wakeup() to pass
flags like task_wakeup(). Instead of going with a complex naming scheme,
let's simply make the flags optional and be zero when not specified. This
means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the
third one is the optional flags to be passed to the callee. Eligible flags
are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*)
which are cleared when the tasklet is executed. This way the handler
will find them in its <state> argument and will be able to distinguish
various causes for the call.
2024-11-19 20:13:41 +01:00
Willy Tarreau
0334cb28a9 MINOR: tasklet: make the low-level tasklet API take a flag
Everything in the tasklet layer supports flags, except that they are
just not implemented in the wakeup functions, while they are in the
task_wakeup functions. Initially it was not considered useful to pass
wakeup causes because these were essentially I/O, but with the growing
number of I/O handlers having to deal with various types of operations
(typically cheap I/O notifications on subscribe vs heavy parsing on
application-level wakeups), it would be nice to start to make this
distinction possible.

This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after()
to pass a set of flags that continues to be set as zero. For now this
changes nothing, but new functions will come.
2024-11-19 20:13:41 +01:00
Willy Tarreau
e57581d76d MINOR: tools: add new macro DEFZERO to provide a default zero argument
This is the equivalent of DEFNULL except that it sets a zero value instead
of a NULL for a missing argument.
2024-11-19 20:13:41 +01:00
Willy Tarreau
c5052bad8a MINOR: sched: add TASK_F_WANTS_TIME to make the scheduler update the call date
Currently tasks being profiled have th_ctx->sched_call_date set to the
current nanosecond in monotonic time. But there's no other way to have
this, despite the scheduler being capable of it. Let's just declare a
new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time
just before calling the handler. This way, a task that needs nanosecond
resolution on the call date will be able to be called with an up-to-date
date without having to abuse now_mono_time() if not needed. In addition,
if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0),
the date is set to the most recently known now_ns, which is guaranteed
to be atomic and is only updated once per poll loop.

This date can be more conveniently retrieved using task_mono_time().

This can be useful, e.g. for pacing. The code was slightly adjusted so
as to merge the common parts between the profiling case and this one.
2024-11-19 20:13:41 +01:00
Willy Tarreau
12969c1b17 MINOR: tinfo/clock: turn sched_call_date to 64-bits
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
2024-11-19 20:13:41 +01:00
Willy Tarreau
973c81ceec CLEANUP: tinfo: move sched_*_date/*_mono_time to the thread-local area
These ones are never atomically accessed, they have nothing to do in
the atomic ops cache line, let's move them to the thread-local area.
2024-11-19 20:13:41 +01:00
Amaury Denoyelle
5a29fd6c61 MINOR: mux_quic/pacing: display pacing info on show quic
To improve debugging, extend "show quic" output to report if pacing is
activated on a connection. Two values will be displayed for pacing :

* a new counter paced_sent_ctr is defined in QCC structure. It will be
  incremented each time an emission is interrupted due to pacing.

* pacing engine now saves the number of datagrams sent in the last paced
  emission. This will be helpful to ensure burst parameter is valid.
2024-11-19 16:21:05 +01:00
Amaury Denoyelle
24cea66e07 MEDIUM: quic: define cubic-pacing congestion algorithm
Define a new QUIC congestion algorithm token 'cubic-pacing' for
quic-cc-algo bind keyword. This is identical to default cubic
implementation, except that pacing is used for STREAM frames emission.

This algorithm supports an extra argument to specify a burst size. This
is stored into a new bind_conf member named quic_pacing_burst which can
be reuse to initialize quic path.

Pacing support is still considered experimental. As such, 'cubic-pacing'
can only be used with expose-experimental-directives set.
2024-11-19 16:20:58 +01:00
Amaury Denoyelle
796446a15e MAJOR: mux-quic: support pacing emission
Support pacing emission for STREAM frames at the QUIC MUX layer. This is
implemented by adding a quic_pacer engine into QCC structure.

The main changes have been written into qcc_io_send(). It now
differentiates cases when some frames have been rejected by transport
layer. This can occur as previously due to congestion or FD buffer full,
which requires subscribing on transport layer. The new case is when
emission has been interrupted due to pacing timing. In this case, QUIC
MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1.

On tasklet execution, if TASK_F_USR1 is set, all standard processing for
emission and reception is skipped. Instead, a new function
qcc_purge_sending() is called. Its purpose is to retry emission with the
saved STREAM frames list. Either all remaining frames can now be send,
subscribe is done on transport error or tasklet must be rescheduled for
pacing purging.

In the meantime, if tasklet is rescheduled due to other conditions,
TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM
frames. In this case, pacing expiration must be check before calling
qcc_send_frames() to ensure emission is now allowed.
2024-11-19 16:16:48 +01:00
Amaury Denoyelle
ede4cd4c2e MINOR: mux-quic: encapsulate QCC tasklet wakeup
QUIC MUX will be responsible to drive emission with pacing. This will be
implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To
prepare this, encapsulate each I/O tasklet wakeup into a new function
qcc_wakeup().

This commit is purely refactoring prior to pacing implementation into
QUIC MUX.
2024-11-19 16:16:48 +01:00
Amaury Denoyelle
4a94a018f0 MINOR: mux-quic: define a tx STREAM frame list member
For STREAM emission, MUX QUIC previously used a local list defined under
qcc_io_send(). This was suitable as either all frames were sent, or
emission must be interrupted due to transport congestion or fatal error.
In the latter case, the list was emptied anyway and a new frame list was
built on future qcc_io_send() invokation.

For pacing, MUX QUIC may have to save the frame list if pacing should be
applied across emission. This is necessary to avoid to unnecessarily
rebuilt stream frame list between each paced emission. To support this,
STREAM list is now stored as a member of QCC structure.

Ensure frame list is always deleted, even on QCC release, using newly
defined utility function qcc_tx_frms_free().
2024-11-19 16:16:48 +01:00
Amaury Denoyelle
886a7c475c MINOR: quic/pacing: add burst support
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.

A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.

quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
2024-11-19 16:16:48 +01:00