Commit graph

20432 commits

Author SHA1 Message Date
Willy Tarreau
5574163073 BUG/MINOR: acme: avoid a possible crash on error paths
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
In acme_EVP_PKEY_gen(), an error message is printed if *errmsg is set,
however, since commit 546c67d13 ("MINOR: acme: generate a temporary key
pair"), errmsg is passed as NULL in at least one occurrence, leading
the compiler to issue a NULL deref warning at -O3. And indeed, if the
errors are encountered, a crash will occur. No backport is needed.
2025-11-07 22:27:25 +01:00
Willy Tarreau
fb8edd0ce6 BUG/MEDIUM: proxy: use aligned allocations for struct proxy_per_tgroup
In 3.2, commit f879b9a18 ("MINOR: proxies: Add a per-thread group field
to struct proxy") introduced struct proxy_per_tgroup that is declared as
thread_aligned, but is allocated using calloc(). Thus it is at risk of
crashing on machines using instructions requiring 64-byte alignment such
as AVX512. Let's use ha_aligned_zalloc_typed() instead of malloc().

For 3.2, we don't have aligned allocations, so instead the THREAD_ALIGNED()
will have to be removed from the struct definition. Alternately, we could
manually align it as is done for fdtab.
2025-11-07 22:22:55 +01:00
Willy Tarreau
df9eb2e7b6 BUG/MEDIUM: proxy: use aligned allocations for struct proxy
Commit fd012b6c5 ("OPTIM: proxy: move atomically access fields out of
the read-only ones") caused the proxy struct to be 64-byte aligned,
which allows the compiler to use optimizations such as AVX512 to zero
certain fields. However the struct was allocated using calloc() so it
was not necessarily aligned, causing segv on startup on compatible
machines. Let's just use ha_aligned_zalloc_typed() to allocate the
struct.

No backport is needed.
2025-11-07 22:22:55 +01:00
Olivier Houchard
c26bcfc1e3 BUG/MEDIUM: stick-tables: Make sure updates are seen as local
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
In stktable_touch_with_exp, if it is a local update, add it to the
pending update list even if it's already in the tree as a remote update,
otherwise it will never be communicated to other peers;
It used to work before 3.2 because of the ordering of operations, but
it's been broken by adding an extra step with the pending update list,
so we now have to explicitely check for that.

This should be backported to 3.2.
2025-11-07 16:23:21 +01:00
Christopher Faulet
7d1787ba8e MINOR: sample/stats: Add "bytes" in req_{in,out} and res_{in,out} names
Number of bytes received or sent by a client or a server are now
saved. Sample fetches and stats fields to retrieve these informations are
renamed to add "bytes" in names to avoid any ambiguity with number of
requests and responses.
2025-11-07 14:09:48 +01:00
Christopher Faulet
f12252c7a5 BUG/MEDIUM: peers: Fix update message parsing during a full resync
The commit 590c5ff2e ("MEDIUM: peers: No longer ack updates during a full
resync") introduced a regression. During a full resync, the ID of an update
message is not parsed at all. Thus, the parsing of the whole message in
desynchronized.

On full resync the update id itself is ignored, to not be acked, but it must
be parsed. It is now fixed.

It is a 3.3-specific bug, no backport needed.
2025-11-07 12:47:34 +01:00
Christopher Faulet
ecc2c3a35d MEDIUM: peers: Remove commitupdate field on stick-tables
This stick-table field was atomically updated with the last update id pushed
and dumped on the CLI but never used otherwise. And all peer sessions share
the same id because it is a stick-table info. So the info in peers dump is
pretty limited.

So, let's remove it.
2025-11-07 12:17:53 +01:00
Christopher Faulet
590c5ff2ed MEDIUM: peers: No longer ack updates during a full resync
ACK messages received by a peer sending updates during a full resync are
ignored. So, on the other side, there is no reason to still send these ACK
messages. Let's skip them.

In addition, the received updates during this stage are not considered as to
be acked. It is important to be sure to properly emit ACK messages once the
full sync finished.
2025-11-07 11:50:13 +01:00
Christopher Faulet
383bf11306 MINOR: peers: Improve traces for peers
Trace messages for peers were only protocol oriented and information
provided were quite light. With this patch, the traces were
improved. information about the peer, its applet and the section are
dumped. Several verbosities are now available and messages are dumped at
different levels depending on the context. It should easier to track issues
in the peers.
2025-11-07 11:50:13 +01:00
Olivier Houchard
25559e7055 MEDIUM: backend: Defer conn_xprt_start() after mux creation
In connect_server(), defer the call to conn_xprt_start() until after we
had a chance to create the mux. The xprt can behave differently
depending on if a mux is or is not available at this point, as if it is,
it may want to wait until some data comes from the mux.

This does not need to be backported.
2025-11-07 11:40:52 +01:00
William Lallemand
3bc90d01d1 BUG/MINOR: acme: wrong dns-01 challenge in the log
Since 861fe53204 ("MINOR: acme: add the dns-01-record field to the
sink"), the dns-01 challenge is output in the dns_record trash, instead
of the global trash.

The send_log string was never updated with this change, and dumps some
data from the global trash instead. Since the last data emitted in the
trash seems to be the dns-01 token from the authorization object, it
looks like the response to the challenge.

This must be backported to 3.2.
2025-11-07 09:49:04 +01:00
Willy Tarreau
4c3351fd63 MINOR: cfgparse: try to suggest correct variable names on errors
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
When an empty argument comes from the use of a non-existing variable,
we'll now detect the difference with an empty variable (error pointer
points to the variable's name instead), and submit it to env_suggest()
to see if another variable looks likely to be the right one or not.

This can be quite useful to quickly figure how to fix misspelled variable
names. Currently only series of letters, digits and underscores are
attempted to be resolved as a name. A typical example is:

   peer "${HAPROXY_LOCAL_PEER}" 127.0.0.1:10000

which produces:

  [ALERT]    (24231) : config : parsing [bug-argv4.cfg:2]: argument number 1 at position 13 is empty and marks the end of the argument list:
    peer "${HAPROXY_LOCAL_PEER}" 127.0.0.1:10000
            ^
  [NOTICE]   (24231) : config : Hint: maybe you meant HAPROXY_LOCALPEER instead ?
2025-11-06 19:57:44 +01:00
Willy Tarreau
49585049b9 MINOR: tools: have parse_line's error pointer point to unknown variable names
When an argument is empty, parse_line() currently returns a pointer to
the empty string itself. This is convenient, but it's only actionable by
the user who will see for example "${HAPROXY_LOCALPEER}" and figure what
is wrong. Here we slightly change the reported pointer so that if an empty
argument results from the evaluation of an empty variable (meaning that
all variables in string are empty and no other char is present), then
instead of pointing to the opening quote, we'll return a pointer to the
first character of the variable's name. This will allow to make a
difference between an empty variable and an unknown variable, and for
the caller to take action based on this.

I.e. before we would get:

    log "${LOG_SERVER_IP}" local0
        ^

if LOG_SERVER_IP is not set, and now instead we'll get this:

    log "${LOG_SERVER_IP}" local0
           ^
2025-11-06 19:57:44 +01:00
Willy Tarreau
14087e48b9 MINOR: tools: add env_suggest() to suggest alternate variable names
The purpose here is to look in the environment for a variable whose
name looks like the provided one. This will be used to try to auto-
correct misspelled environment variables that would silently be turned
to an empty string.
2025-11-06 19:57:44 +01:00
Willy Tarreau
a4d78dd4f5 MINOR: tools: add support for ist to the word fingerprinting functions
The word fingerprinting functions are used to compare similar words to
suggest a correctly spelled one that looks like what the user proposed.
Currently the functions only support const char*, but there's no reason
for this, and it would be convenient to support substrings extracted
from random pieces of configurations. Here we're adding new variants
"_with_len" that take these ISTs and which are in fact a slight change
of the original ones that the old ones now rely on.
2025-11-06 19:57:44 +01:00
Willy Tarreau
d9d0721bc9 MEDIUM: config: now reject configs with empty arguments
As prepared during 3.2, we must error on empty arguments because they
mark the end of the line and cause subsequent arguments to be silently
ignored. It was too late in 3.2 to turn that into an error so it's a
warning, but for 3.3 it needed to be an alert.

This patch does that. It doesn't instantly break, instead it counts
one fatal error per violating line. This allows to emit several errors
at once, which can often be caused by the same variable being missed,
or a group of variables sharing a same misspelled prefix for example.
Tests show that it helps locate them better. It also explains what to
look for in the config manual for help with variables expansion.
2025-11-06 19:57:44 +01:00
Willy Tarreau
1968731765 BUG/MEDIUM: config: solve the empty argument problem again
This mostly reverts commit ff8db5a85 ("BUG/MINOR: config: Stopped parsing
upon unmatched environment variables").

As explained in commit #2367, finally the fix above was incorrect because
it causes other trouble such as this:

     log "192.168.100.${NODE}" "local0"

being resolved to this:

     log 192.168.100.local0

when NODE does not exist due to the loss of the spaces. In fact, while none
of us was well aware of this, when the user had:

     server app 127.0.0.1:80 "${NO_CHECK}" weight 123

in fact they should have written it this way:

     server app 127.0.0.1:80 "${NO_CHECK[*]}" weight 123

so that the variable is expanded to zero, one or multiple words, leaving
no empty arg (like in shell). This is supported since 2.3 with commit
fa41cb6 so the right fix is in the config, let's revert the fix and
properly address the issue.

Some changes are necessary however, since after that patch, the in_arg
checks were added and are now inserting an empty argument even for
proper error reporting. For example, the following statement:

    acl foo path "/a" "${FOO[*]}" "/b"

would complain about an empty arg at FOO due to in_arg=1, while dropping
this in_arg=1 with the following config:

    acl foo path "/a" "${FOO}" "/b"

would silently stop after "/a" instead of complaining about an empty
field. So the approach here consists in noting whether or not something
was written since the quotes were emitted, in order to decide whether
or not to produce an argument. This way, "" continues to be an explicitly
empty arg, just like the same with an unknown variable, while "${FOO[*]}"
is allowed to prevent the creation of an argument if empty.

This should be backported to *some* versions, but the risk that some
configs were altered to rely on the broken fix is not null. At least
recent LTS should be reverted. Note that this requires previous commit:

    BUG/MINOR: config: emit warning for empty args when *not* in discovery mode

otherwise this will break again configs relying on HAPROXY_LOCALPEER and
maybe a few other variables set at the end of discovery.
2025-11-06 19:57:44 +01:00
Willy Tarreau
004e1be48e BUG/MINOR: config: emit warning for empty args when *not* in discovery mode
This actually reverses the condition of commit 5f1fad1690 ("BUG/MINOR:
config: emit warning for empty args only in discovery mode"). Indeed,
some variables are not known in discovery mode (e.g. HAPROXY_LOCALPEER),
and statements like:

   peer "${HAPROXY_LOCALPEER}" 127.0.0.1:10000

are broken during discovery mode. It turns out that the warning is
currently hidden by commit ff8db5a85d ("BUG/MINOR: config: Stopped
parsing upon unmatched environment variables") since it silently drops
empty args which is sufficient to hide the warning, but it also breaks
other configs and needs to be reverted, which will break configs like
above again.

In issue #2995 we were not fully decided about discovery mode or not,
and already suspected some possible issues without being able to guess
which ones. The only downside of not displaying them in discovery mode
is that certain empty fields on the rare keywords specific to master
mode might remain silent until used. Let's just flip the condition to
check for empty args in normal mode only.

This should be backported to 3.2 after some time of observation.
2025-11-06 19:57:44 +01:00
Willy Tarreau
0144426dfb BUG/MEDIUM: server: close a race around ready_srv when deleting a server
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.

Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.

This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
2025-11-06 19:57:44 +01:00
Christopher Faulet
c6f68901cc BUG/MINOR: config: Limit "tune.maxpollevents" parameter to 1000000
"tune.maxpollevents" global parameter was not limited. It was possible to
set any integer value. But this value is used to allocate the array of
events used by epoll. With a huge value, it seems the allocation silently
fail, making haproxy totally unresponsive.

So let's to limit its value to 1 million. It is pretty high and it should
not be an issue to forbid greater values. The documentation was updated
accordingly.

This patch could be backported to all stable branches.
2025-11-06 15:56:21 +01:00
Christopher Faulet
80edbad4f9 MEDIUM: stktables: Limit the number of stick counters to 100
"tune.stick-counters" global parameter was accepting any positive integer
value. But the maximum value is incredibly high. Setting a huge value has
signitifcant impact on memory and CPU usage. To avoid any issue, this value
is now limited to 100. It should be greater enough to all usage.

It can be seen as a breaking change.
2025-11-06 15:01:29 +01:00
Christopher Faulet
949199a2f4 DEBUG: stream: Add bytes_in/bytes_out value for both SC in session dump
It could be handy to have these infos in the full session dump. So let's
dump it now.
2025-11-06 15:01:29 +01:00
Christopher Faulet
a1b5325a7a MINOR: channel: Remove total field from channels
The <total> field in the channel structure is now useless, so it can be
removed. The <bytes_in> field from the SC is used instead.

This patch is related to issue #1617.
2025-11-06 15:01:29 +01:00
Christopher Faulet
4991a51208 MINOR: stats: Add stats about request and response bytes received and sent
In previous patches, these counters were added per frontend, backend, server
and listener. With this patch, these counters are reported on stats,
including promex.

Note that the stats file minor version was incremented by one because the
shm_stats_file_object struct size has changed.

This patch is related to issue #1617.
2025-11-06 15:01:29 +01:00
Christopher Faulet
0084baa6ba MINOR: counters: Remove bytes_in and bytes_out counter from fe/be/srv/li
bytes_in and bytes_out counters per frontend, backend, listener and server
were removed and we now rely on, respectively on, req_in and res_in
counters.

This patch is related to issue #1617.
2025-11-06 15:01:29 +01:00
Christopher Faulet
567df50d91 MINOR: stream: Remove bytes_in and bytes_out counters from stream
per-stream bytes_in and bytes_out counters was removed and replaced by
req.in and res.in. Coorresponding samples still exists but replies on new
counters.

This patch is related to issue #1617.
2025-11-06 15:01:29 +01:00
Christopher Faulet
1c62a6f501 MINOR: counters: Add req_in/req_out/res_in/res_out counters for fe/be/srv/li
Thanks to the previous patch, and based on info available on the stream, it
is now possible to have counters for frontends, backends, servers and
listeners to report number of bytes received and sent on both sides.

This patch is related to issue #1617.
2025-11-06 15:01:29 +01:00
Christopher Faulet
ac9201f929 MINOR: stream: Add samples to get number of bytes received or sent on each side
req.in and req.out samples can now be used to get the number of bytes
received by a client and send to the server. And res.in and res.out samples
can be used to get the number of bytes received by a server and send to the
client. These info are stored in the logs structure inside a stream.

This patch is related to issue #1617.
2025-11-06 15:01:28 +01:00
Christopher Faulet
629fbbce19 MINOR: stconn: Add counters to SC to know number of bytes received and sent
<bytes_in> and <bytes_out> counters were added to SC to count, respectively,
the number of bytes received from an endpoint or sent to an endpoint. These
counters are updated for connections and applets.

This patch is related to issue #1617.
2025-11-06 15:01:28 +01:00
William Lallemand
094baa1cc0 BUG/MINOR: acme: allow 'key' when generating cert
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Allow to use the 'key' keyword when 'crt' was generated with both a crt
and a key.

No backport needed.
2025-11-06 14:11:43 +01:00
Willy Tarreau
5fe4677231 MINOR: server: move the lock inside srv_add_idle()
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.

Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.
2025-11-06 13:16:24 +01:00
William Lallemand
a8498cde74 BUILD: ssl/ckch: fix ckch_conf_kws parsing without ACME
Without ACME, the tmp_pkey and tmp_x509 functions are not available, the
patch checks HAVE_ACME to use them.
2025-11-06 12:27:27 +01:00
William Lallemand
22f92804d6 BUG/MINOR: acme: fix initialization issue in acme_gen_tmp_x509()
src/acme.c: In function ‘acme_gen_tmp_x509’:
src/acme.c:2685:15: error: ‘digest’ may be used uninitialized [-Werror=maybe-uninitialized]
 2685 |         if (!(X509_sign(newcrt, pkey, digest)))
      |              ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/acme.c:2628:23: note: ‘digest’ was declared here
 2628 |         const EVP_MD *digest;
      |                       ^~~~~~
2025-11-06 12:12:18 +01:00
William Lallemand
0524af034f BUILD: acme: acme_gen_tmp_x509() signedness and unused variables
Fix compilation issues in acme_gen_tmp_x509().

src/acme.c:2665:66: warning: pointer targets in passing argument 4 of ‘X509_NAME_add_entry_by_txt’ differ in signedness [-Wpointer-sign]
 2665 |         if (X509_NAME_add_entry_by_txt(name, "CN", MBSTRING_ASC, "expired",
      |                                                                  ^~~~~~~~~
      |                                                                  |
      |                                                                  char *
In file included from /usr/include/openssl/ssl.h:32,
                 from include/haproxy/openssl-compat.h:19,
                 from include/haproxy/acme-t.h:6,
                 from src/acme.c:16:
/usr/include/openssl/x509.h:1074:53: note: expected ‘const unsigned char *’ but argument is of type ‘char *’
 1074 |                                const unsigned char *bytes, int len, int loc,
      |                                ~~~~~~~~~~~~~~~~~~~~~^~~~~
src/acme.c:2630:23: warning: unused variable ‘i’ [-Wunused-variable]
 2630 |         unsigned int  i;
      |                       ^
src/acme.c:2629:23: warning: unused variable ‘ctx’ [-Wunused-variable]
 2629 |         X509V3_CTX    ctx;
      |                       ^~~
2025-11-06 12:08:04 +01:00
William Lallemand
a15d4f5b19 BUILD: ssl/ckch: wrong function name in ckch_conf_kws
ckch_conf_load_pem does not exist anymore and
ckch_conf_load_pem_or_generate must be used instead
2025-11-06 12:03:29 +01:00
William Lallemand
582a1430b2 MEDIUM: acme: generate a key pair when no file are available
When an acme keyword is associated to a crt and key, and the corresponding
files does not exist, HAProxy would not start.

This patch allows to configure acme without pre-generating a keypair before
starting HAProxy. If the files does not exist, it tries to generate a unique
keypair in memory, that will be used for every ACME certificates that don't
have a file on the disk yet.
2025-11-06 11:56:27 +01:00
William Lallemand
546c67d137 MINOR: acme: generate a temporary key pair
This patch provides two functions acme_gen_tmp_pkey() and
acme_gen_tmp_x509().

These functions generates a unique keypair and X509 certificate that
will be stored in tmp_x509 and tmp_pkey. If the key pair or certificate
was already generated they will return the existing one.

The key is an RSA2048 and the X509 is generated with a expiration in the
past. The CN is "expired".

These are just placeholders to be used if we don't have files.
2025-11-06 11:56:27 +01:00
William Lallemand
1df55b441b MEDIUM: ssl/ckch: use ckch_store instead of ckch_data for ckch_conf_kws
This is an API change, instead of passing a ckch_data alone, the
ckch_conf_kws.func() is called with a ckch_store.

This allows the callback to access the whole ckch_store, with the
ckch_conf and the ckch_data. But it requires the ckch_conf to be
actually put in the ckch_store before.
2025-11-06 11:56:27 +01:00
Olivier Houchard
201971ec5f MEDIUM: stick-tables: Optimize the expiration process a bit.
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
In process_tables_expire(), if the table we're analyzing still has
entries, and thus should be put back into the tree, do not put it in the
mt_list, to have it put back into the tree the next time the task runs.
There is no problem with putting it in the tree right away, as either
the next expiration is in the future, or we handled the maximum number
of expirations per task call and we're about to stop, anyway.

This does not need to be backported.
2025-11-05 19:22:11 +01:00
Olivier Houchard
93f994e8b1 BUG/MEDIUM: stick-tables: Make sure we handle expiration on all tables
In process_tables_expire(), when parsing all the tables with expiration
set, to check if the any entry expired, make sure we start from the
oldest one, we can't just rely on eb32_first(), because of sign issues
on the timestamp.
Not doing that may mean some tables are not considered for expiration.

This does not need to be backported.
2025-11-05 19:22:11 +01:00
Amaury Denoyelle
b9809fe0d0 MINOR: quic: remove <mux_state> field
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
This patch removes <mux_state> field from quic_conn structure. The
purpose of this field was to indicate if MUX layer above quic_conn is
not yet initialized, active, or already released.

It became tedious to properly set it as initialization order of the
various quic_conn/conn/MUX layers now differ between the frontend and
backend sides, and also depending if 0-RTT is used or not. Recently, a
new change introduced in connect_server() will allow to initialize QUIC
MUX earlier if ALPN is cached on the server structure. This had another
level of complexity.

Thus, this patch removes <mux_state> field completely. Instead, a new
flag QUIC_FL_CONN_XPRT_CLOSED is defined. It is set at a single place
only on close XPRT callback invokation. It can be mixed with the new
utility functions qc_wait_for_conn()/qc_is_conn_ready() to determine the
status of conn/MUX layers now without an extra quic_conn field.
2025-11-05 14:03:34 +01:00
William Lallemand
4f978325ac MEDIUM: cfgparse: 'daemon' not compatible with -Ws
Emit a warning when the 'daemon' keyword is used in master-worker mode
for systemd (-Ws). This never worked and was always ignored by setting
MODE_FOREGROUND during cmdline parsing.
2025-11-05 11:49:11 +01:00
William Lallemand
631233e9ec MEDIUM: cfgparse: deprecate 'master-worker' keyword alone
Warn when the 'master-worker' keyword is used without
'no-exit-on-failure'.

Warn when the 'master-worker' keyword is used and -W and -Ws already set
the mode.
2025-11-05 11:49:11 +01:00
Willy Tarreau
096999ee20 BUG/MEDIUM: connections: permit to permanently remove an idle conn
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
  - when picking a new connection that no longer has any avail stream
  - when temporarily working on the connection from an I/O handler,
    in which case it's re-added at the end
  - when killing a connection

The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.

The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.

The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.

A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.

This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
2025-11-05 11:08:25 +01:00
Willy Tarreau
59c599f3f0 BUG/MEDIUM: mux-h2: make sure not to move a dead connection to idle
In h2_detach(), it looks possible to place a dead connection back to
the idle list, and to later call h2_release() on it once detected as
dead. It's not certain that it happens but nothing in the code shows
it is not possible, so better make sure it cannot happen.

This should be preventively backported to all versions.
2025-11-05 11:08:25 +01:00
Maximilian Moehl
0799fd1072 BUG/MEDIUM: mux-h1: fix 414 / 431 status code reporting
The more detailed status code reporting introduced with bc967758a2 is
checking against the error state to determine whether it is a too long
URL or too large headers. The check used always returns true which
results in a 414 as the error state is only set at a later point.

This commit adjusts the check to use the current state instead to return
the intended status code.

This patch must be backported as far as 3.1.
2025-11-05 10:55:18 +01:00
Olivier Houchard
06821dc189 BUG/MEDIUM: server: Also call srv_reset_path_parameters() on srv up
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Also call srv_reset_path_parameters() when the server changed states,
and got up. It is not enough to do it when the server goes down, because
there's a small race condition, and a connection could get established
just after we did it, and could have set the path parameters.

This does not need to be backported.
2025-11-04 18:47:34 +01:00
Olivier Houchard
7d4aa7b22b BUG/MEDIUM: server: Add a rwlock to path parameter
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.

This does not need to be backported.
2025-11-04 18:47:34 +01:00
Amaury Denoyelle
efe60745b3 MINOR: quic: remove connection arg from qc_new_conn()
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
This patch is similar to the previous one, this time dealing with
qc_new_conn(). This function was asymetric on frontend and backend side,
as connection argument was set only in the latter case.

This was required prior due to qc_alloc_ssl_sock_ctx() signature. This
has changed with the previous patch, thus qc_new_conn() can also be
realigned on both FE and BE sides. <conn> member of quic_conn instance
is always set outside it, in qc_xprt_start() on the backend case.
2025-11-04 17:47:42 +01:00
Amaury Denoyelle
5a17cade4f MINOR: quic: do not set conn member if ssl_sock_ctx
ssl_sock_ctx is a generic object used both on TCP/SSL and QUIC stacks.
Most notably it contains a <conn> member which is a pointer to struct
connection.

On QUIC frontend side, this member is always set to NULL. Indeed,
connection is only created after handshake completion. However, this has
changed for backend side, where the connection is instantiated prior to
its quic_conn counterpart. Thus, ssl_sock_ctx member would be set in
this case as a convenience for use later in qc_ssl_do_hanshake().

However, this method was unsafe as the connection can be released,
without resetting ssl_sock_ctx member. Thus, the previous patch fixes
this by using on <conn> member through the quic_conn instance which is
the proper way.

Thus, this patch resets ssl_sock_ctx <conn> member to NULL. This is
deemed the cleanest method as it ensures that both frontend and backend
sides must not use it anymore.
2025-11-04 17:38:09 +01:00
Amaury Denoyelle
69de7ec14e BUG/MINOR: quic: fix crash on client handshake abort
On backend side, a connection can be aborted and released prior to
handshake completion. This causes a crash in qc_ssl_do_hanshake() as
<conn> member of ssl_sock_ctx is not reset in this case.

To fix this, use <conn> member of quic_conn instead. This is safe as it
is properly set to NULL when a connection is released.

No impact on the frontend side as <conn> member is not accessed. Indeed,
in this case connection is most of the times allocated after handshake
completion.

No need to be backported.
2025-11-04 17:33:42 +01:00
Amaury Denoyelle
6bfabfdc77 OPTIM: backend: skip conn reuse for incompatible proxies
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
When trying to reuse a backend connection, a connection hash is
calculated to match an entry with similar parameters. Previously, this
operation was skipped if the stream content wasn't based on HTTP, as it
would have been incompatible with http-reuse.

With the introduction of SPOP backends, this condition was removed, so
that it can also benefit from connection reuse. However, this means that
now hash calcul is always performed when connecting to a server, even
for TCP or log backends. This is unnecessary as these proxies cannot
perform connection reuse.

Note also that reuse mode is resetted on postparsing for incompatible
backends. This at least guarantees that no tree lookup will be performed
via be_reuse_connection(). However, connection lookup is still performed
in the session via session_get_conn() which is another unnecessary
operation.

Thus, this patch restores the condition so that reuse operations are now
entirely skipped if a backend mode is incompatible. This is implemented
via a new utility function named be_supports_conn_reuse().

This could be backported up to 3.1, as this commit could be considered
as a performance regression for tcp/log backend modes.
2025-11-03 10:43:50 +01:00
Willy Tarreau
ad1bdc3364 BUG/MAJOR: stats-file: fix crash on non-x86 platform caused by unaligned cast
Since commit d655ed5f14 ("BUG/MAJOR: stats-file: ensure
shm_stats_file_object struct mapping consistency (2nd attempt)"), the
last_state_change field in the counters is a uint (to match how it's
reported). However, it happens that there are explicit casts in function
me_generate_field() to retrieve the value, and which cause crashes on
aarch64 and likely other non-x86 64-bit platforms due to atomically
reading an unaligned 64-bit value, and may even randomly crash other
64-bit platforms when reading past the end of the structure.

The fix for now adapts the cast to match the one used by the accessed
type (i.e. unsigned int), but the approach must change, as there's
nothing there which allows to figure whether or not the type is correct
by just reading the code. At minima a typeof() on a named field is
needed, but this requires more invasive changes, hence this temporary
fix.

No backport is needed, as stats-file is only in 3.3.
2025-11-03 07:33:11 +01:00
Damien Claisse
561dc127bd BUG/MINOR: resolvers: ensure fair round robin iteration
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Previous fixes restored round robin iteration, but an imbalance remains
when the response tree contains record types other than A or AAAA. Let's
take the following example: the DNS answers two A records and a CNAME.
The response "tree" (which is actually flat, more like a list) may look
as follows, ordered by hash:
- 1st item: first A record with IP 1
- 2nd item: second A record with IP 2
- 3rd item: CNAME record
As a consequence, resolv_get_ip_from_response will iterate as follows,
while the TTL is still valid:
- 1st call: DNS request is done, response tree is created, iteration
  starts at the first item, IP 1 is returned.
- 2nd call: cached response tree is used, iteration starts at the second
  item, IP 2 is returned.
- 3rd call: cached response tree is used, iteration starts at the third
  item, but it's a CNAME, so we continue to the next item, which restarts
  iteration at the first item, and IP 1 is returned.
- 4th call: cached response tree is used and iteration restarts at the
  beginning, returning IP 1 again.
The 1-2-1-1-2-1-1-2 sequence will repeat, so IP 1 will be used twice as
often as IP 2, creating a strong imbalance. Even with more IP addresses,
the first one by hashing order in the tree will always receive twice the
traffic of the others.
To fix this, set the next iteration item to the one following the selected
IP record, if any. This ensures we never use the same IP twice in a row.

This commit should be backported where 3023e9819 ("BUG/MINOR: resolvers:
Restore round-robin selection on records in DNS answers") is, so as far
as 2.6.
2025-11-02 17:28:32 +01:00
William Lallemand
1d859bdaa2 MINOR: sample: optional AAD parameter support to aes_gcm_enc/dec
The aes_gcm_enc() and aes_gcm_dec() sample converters now accept an
optional fifth argument for Additional Authenticated Data (AAD). When
provided, the AAD value is base64-decoded and used during AES-GCM
encryption or decryption. Both string and variable forms are supported.

This enables use cases that require authentication of additional data.
2025-10-31 12:27:38 +01:00
Amaury Denoyelle
73b5d331cc OPTIM: quic: adjust automatic ALPN setting for QUIC servers
If a QUIC server is declared without ALPN, "h3" value is automatically
set during _srv_parse_finalize().

This patch adjusts this operation. Instead of relying on
ssl_sock_parse_alpn(), a plain strdup() is used. This is considered more
efficient as the ALPN string is constant in this case. This method is
already used for listeners on the frontend side.
2025-10-31 11:32:20 +01:00
Amaury Denoyelle
14a6468df5 MINOR: quic: reject conf with QUIC servers if not compiled
Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.

Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.
2025-10-31 11:32:20 +01:00
Amaury Denoyelle
1af3caae7d MINOR: quic: enable SSL on QUIC servers automatically
Previously, QUIC servers were rejected if SSL was not explicitely
activated using 'ssl' configuration keyword.

Change this behavior : now SSL is automatically activated for QUIC
servers when the keyword is missing. A warning is displayed as it is
considered better to explicitely note that SSL is in use.
2025-10-31 11:32:14 +01:00
Willy Tarreau
a1f26ca307 BUG/MINOR: mux-h2: send the preface along with the first request if needed
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Tests involving 0-RTT and H2 on the backend show that 0-RTT is being
partially used but does not work. The analysis shows that only the
preface and settings are sent using early-data and the request is sent
separately. As explained in the previous patch, this is caused by the
fact that a wakeup of the iocb is needed just to send the preface, then
a new call to process_stream is needed to try sending again.

Here with this patch, we're making h2_snd_buf() able to send the preface
if it was not yet sent. Thanks to this, the preface, settings and first
request can now leave as a single TCP segment. In case of TLS with 0-RTT,
it now allows all the block to leave in early data.

Even in clear-text H2, we're now seeing a 15% lower context-switch count,
and the number of calls to process_stream() per connection dropped from 3
to 2. The connection rate increased by an extra 9.5%. Compared to without
the last 3 patches, this is a 22% reduction of context-switches, 33%
reduction of process_stream() calls, and 15.7% increase in connection
rate. And more importantly, 0-RTT now really works with H2 on the
backend, saving one full RTT on the first request.

This fix is only for a missed optimization and a non-functional 0-RTT
on the backend. It's worth backporting it, but it doesn't cause enough
harm to hurry a backport. Better wait for it to live a little bit in
3.3 (till at least a week or two after the final release) before
backporting it. It's not sure that it's worth going beyond 3.2 in any
case. It depends on the these two previous commits:

  MEDIUM: mux-h2: do not needlessly refrain from sending data early
  MINOR: mux-h2: extract the code to send preface+settings into its own function
2025-10-30 18:16:54 +01:00
Willy Tarreau
d5aa3e19cc MINOR: mux-h2: extract the code to send preface+settings into its own function
The code that deals with sending preface + settings and changing the
state currently is in h2_process_mux(), but we'll want to do it as
well from h2_snd_buf(), so let's move it to a dedicate function first.
At this point there is no functional change.
2025-10-30 18:16:54 +01:00
Willy Tarreau
b0e8edaef2 MEDIUM: mux-h2: do not needlessly refrain from sending data early
The mux currently refrains from sending data before H2_CS_FRAME_H, i.e.
before the peer's SETTINGS frame was received. While it makes sense on
the frontend, it's causing harm on the backend because it forces the
first request to be sent in two halves over an extra RTT: first the
preface and settings, second the request once the settings are received.
This is totally contrary to the philosophy of the H2 protocol, consisting
in permitting the client to send as soon as possible.

Actually what happens is the following:
  - process_stream() calls connect_server()
  - connect_server() creates a connection, and if the proto/alpn is guessed
    or known, the mux is instantiated for the current request.
  - the H2 init code wakes the h2 tasklet up and returns
  - process_stream() tries to send the request using h2_snd_buf(), but that
    one sees that we're before H2_CS_FRAME_H, refrains from doing so and
    returns.
  - process_stream() subscribes and quits
  - the h2 tasklet can now execute to send the preface and settings, which
    leave as a first TCP segment. The connection is ready.
  - the iocb is woken again once the server's SETTINGS frame is received,
    turning the connection to the H2_CS_FRAME_H state, and the iocb wake
    up process_stream().
  - process_stream() executes again and can try to send again.
  - h2_snd_buf() is called and finally sends the request as a second TCP
    segment.

Not only this is inefficient, but it also renders 0-RTT and TFO impossible
on H2 connections. When 0-RTT is used, only the preface and settings leave
as early data (the very first data of that connection), which is totally
pointless.

In order to fix this, we have to go through a few steps:
  - first we need to let data be sent to a server immediately after the
    SETTINGS frame was sent (i.e. in H2_CS_SETTINGS1 state instead of
    H2_CS_FRAME_H). However, some protocol extensions are advertised by
    the server using SETTINGS (e.g. RFC8441) and some requests might need
    to know the existence of such extensions. For this reason we're adding
    a new h2c flag, H2_CF_SETTINGS_NEEDED, which indicates that some
    operations were not done because a server's SETTINGS frame is needed.
    This is set when trying to send a protocol upgrade or extended CONNECT
    during H2_CS_SETTINGS1, indicating that it's needed to wait for
    H2_CS_FRAME_H in this case. The flag is always set on frontend
    connections. This is what is being done in this patch.

  - second, we need to be able to push the preface opportunistically with
    the first h2_snd_buf() so that it's not needed to wake the tasklet up
    just to send that and wake process_stream() again. This will be in a
    separate patch.

By doing the first step, we're at least saving one needless tasklet
wakeup per connection (~9%), which results in ~5% backend connection
rate increase.
2025-10-30 18:16:54 +01:00
William Lallemand
f6503bd7d3 BUG/MINOR: ech: non destructive parsing in cli_find_ech_specific_ctx()
cli_find_ech_specific_ctx() parses the <frontend>/<bind_conf> and sets
 a \0 in place the '/'. But the originals tring is still used to emit
 messages in the CLI so we only output the frontend part.

 This patch do the parsing in a trash buffer instead.
2025-10-30 11:59:39 +01:00
sftcd
9aacb684cd MINOR: ssl/ech: key management via stats socket
This patch extends the ECH support by adding runtime CLI commands to
view and modify ECH configurations.

New commands are added to the HAProxy CLI:
- "show ssl ech [<name>]" displays all ECH configurations or a specific
  one.
- "add ssl ech <name> <payload>" adds a new PEM-formatted ECH
  configuration.
- "set ssl ech <name> <payload>" replaces all existing ECH
  configurations.
- "del ssl ech <name> [<age-in-secs>]" removes ECH configurations,
  optionally filtered by age.
2025-10-30 10:38:31 +01:00
William Lallemand
1e2f920be6 MINOR: listener: implement bind_conf_find_by_name()
Returns a pointer to the first bind_conf matching <name> in a frontend
<front>.

When name is prefixed by a @ (@<filename>:<linenum>), it tries to look
for the corresponding filename and line of the configuration file.

NULL is returned if no match is found.
2025-10-30 10:37:42 +01:00
sftcd
23f5cbb411 MINOR: ssl/ech: add logging and sample fetches for ECH status and outer SNI
This patch adds functions to expose Encrypted Client Hello (ECH) status
and outer SNI information for logging and sample fetching.

Two new helper functions are introduced in ech.c:
 - conn_get_ech_status() places the ECH processing status string into a
   buffer.
 - conn_get_ech_outer_sni() retrieves the outer SNI value if ECH
   succeeded.

Two new sample fetch keywords are added:
 - "ssl_fc_ech_status" returns the ECH status string.
 - "ssl_fc_ech_outer_sni" returns the outer SNI value seen during ECH.

These allow ECH information to be used in HAProxy logs, ACLs, and
captures.
2025-10-30 10:37:30 +01:00
sftcd
dba4fd248a MEDIUM: ssl/ech: config and load keys
This patch introduces the USE_ECH option in the Makefile to enable
support for Encrypted Client Hello (ECH) with OpenSSL.

A new function, load_echkeys, is added to load ECH keys from a specified
directory. The SSL context initialization process in ssl_sock.c is
updated to load these keys if configured.

A new configuration directive, `ech`, is introduced to allow users to
specify the ECH key  directory in the listener configuration.
2025-10-30 10:37:12 +01:00
William Lallemand
83e3cbc262 BUG/MINOR: ssl: returns when SSL_CTX_new failed during init
In ssl_sock_initial_ctx(), returns when SSL_CTX_new() failed instead of
trying to apply anything on the ctx. This may avoid crashing when
there's not enough memory anymore during configuration parsing.

Could be backported in every haproxy versions
2025-10-30 10:36:56 +01:00
Olivier Houchard
b3d6f44af8 MEDIUM: h1: Immediately try to read data for frontend
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
In h1_init(), if we're a frontend connection, immediately attempt to
read data, if the connection is ready, instead of just subscribing.
There may already be data available, at least if we're using 0RTT.

This may be backported up to 2.8 in a while, after 3.3 is released, so
that if it causes problem, we have a chance to hear about it.
2025-10-29 17:18:26 +01:00
Christopher Faulet
c84c15d393 BUG/MINOR: resolvers: Apply dns-accept-family setting on additional records
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
dns-accept-family setting was only evaluated for responses to A / AAAA DNS
queries. It was ignored when additional records in SRV responses were
parsed.

With this patch, whena SRV responses is parsed, additional records not
matching the dns-accept-family setting are ignored, as expected.

This patch must be backported to 3.2.
2025-10-29 11:20:27 +01:00
Remi Tricot-Le Breton
dc35a3487b MINOR: ssl: Do not dump decrypted privkeys in 'dump ssl cert'
A private keys that is password protected and was decoded during init
thanks to the password obtained thanks to 'ssl-passphrase-cmd' should
not be dumped via 'dump ssl cert' CLI command.
2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton
5a036d223b MEDIUM: ssl: Add local passphrase cache
Instead of calling the external password command for all loaded
encrypted certificates, we will keep a local password cache.
The passwords won't be stored as plain text, they will be stored
obfuscated into the password cache. The obfuscation is simply based on a
XOR'ing with a random number built during init.
After init is performed, the password cache is overwritten and freed so
that no dangling info allowing to dump the passwords remains.
2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton
478dd7bad0 MEDIUM: ssl: Add certificate password callback that calls external command
When a certificate is protected by a password, we can provide the
password via the dedicated pem_password_cb param provided to
PEM_read_bio_PrivateKey.
HAProxy will fetch the password automatically during init by calling a
user-defined external command that should dump the right password on its
standard output (see new 'ssl-passphrase-cmd' global option).
2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton
a011683622 MINOR: init: Use devnullfd in stdio_quiet calls instead of recreating a fd everytime
Since commit "65760d MINOR: init: Make devnullfd global and create it
earlier in init" the devnullfd file descriptor pointing to /dev/null
is created regardless of the process's parameters so we can use it in
all 'stdio_quiet' calls instead or recreating an FD.
2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton
1ec59d3426 MINOR: init: Make devnullfd global and create it earlier in init
The devnull fd might be needed during configuration parsing, if some
options require to fork/exec for instance. So we now create it much
earlier in the init process and without depending on the '-q' or '-d'
parameters.
2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton
c606ff45a0 BUG/MINOR: init: Do not close previously created fd in stdio_quiet
During init we were calling 'stdio_quiet' and passing the previously
created 'devnullfd' file descriptor. But the 'stdio_quiet' was also
closed afterwards which raised an error (EBADF).
If we keep from closing FDs that were opened outside of the
'stdio_quiet' function we will let the caller manage its FD and avoid
double close calls.

This patch can be backported to all stable branches.
2025-10-29 10:54:17 +01:00
Huangbin Zhan
ad9a24ee55 MINOR: http: fix 405,431,501 default errorfile
A few typos were present in the default errorfiles for the status codes
above (missing dot at the end of the sentence, extra closing bracket).
This fixes them. This can be backported.
2025-10-29 08:47:19 +01:00
Willy Tarreau
18b27bfec9 MINOR: ssl-sample: add ssl_fc_early_rcvd() to detect use of early data
We currently have ssl_fc_has_early() which says that early data are still
unconfirmed by a final handshake, but nothing to see if a client has been
able to use early data at all, which is a problem because such mechanisms
generally depend on multiple factors and it's hard to know when they start
to work. This new sample fetch function will indicate that some early data
were seen over that front connection, i.e. this can be used to confirm
that at some point the client was able to push some. This is essentially
a debugging tool that has no practical use case other than debugging.
2025-10-29 08:13:29 +01:00
Amaury Denoyelle
7f2ae10920 BUG/MINOR: acl: warn if "_sub" derivative used with an explicit match
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
Recently, a new warning is displayed when an ACL derivative match method
is override with another '-m' method. This is implemented via the
following patch :

  6ea50ba462
  MINOR: acl; Warn when matching method based on a suffix is overwritten

However, this warning was not reported when "_sub" suffix was specified.
Fix this by adding PAT_MATCH_SUB in the warning comparison.

No backport needed except if above commit is.
2025-10-28 11:59:32 +01:00
Remi Tricot-Le Breton
89b43740e3 BUG/MINOR: ssl: Remove unreachable code in CLI function
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Remove unreachable code in 'cli_parse_show_jwt' function.

This bug was raised in GitHub #3159.
This patch does not need to be backported.
2025-10-28 10:44:51 +01:00
Remi Tricot-Le Breton
7482b6ebf0 BUG/MEDIUM: ssl: Crash because of dangling ckch_store reference in a ckch instance
When updating CAs via the CLI, we need to create new copies of all the
impacted ckch instances (as in referenced in the ckch_inst_link list of
the updated CA) in order to use them instead of the old ones once the
updated is completed. This relies on the ckch_inst_rebuild function that
would set the ckch_store field of the ckch_inst. But we forgot to also
add the newly created instances in the ckch_inst list of the
corresponding ckch_store.

When updating a certificate afterwards, we iterate over all the
instances linked in the ckch_inst list of the ckch_store (which is
missing some instances because of the previous command) and rebuild the
instances before replacing the ckch_store. The previous ckch_store,
still referenced by the dangling ckch instance then gets deleted which
means that the instance keeps a reference to a free'd object.

Then if we were to once again update the CA file, we would iterate over
the ckch instances referenced in the cafile_entry's ckch_inst_link list,
which includes the first mentioned ckch instance with the dead
ckch_store reference. This ends up crashing during the ckch_inst_rebuild
operation.

This bug was raised in GitHub #3165.
This patch should be backported to all stable branches.
2025-10-28 10:43:45 +01:00
Willy Tarreau
2d7e3ddd4a BUG/MEDIUM: cli: do not return ACKs one char at a time
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Since 3.0 where the CLI started to use rcv_buf, it appears that some
external tools sending chained commands are randomly experiencing
failures. Each time this happens when the whole command is sent as a
single packet, immediately followed by a close. This is not a correct
way to use the CLI but this has been working for ages for simple
netcat-based scripts, so we should at least try to preserve this.

The cause of the failure is that the first LF that acks a command is
immediately sent back to the client and rejected due to the closed
connection. This in turn forwards the error back to the applet which
aborts its processing.

Before 3.0 the responses would be queued into the buffer, then sent
back to the channel, and would all fail at once. This changed when
snd_buf/rcv_buf were implemented because the applets are much more
responsive and since they yield between each command, they can
deliver one ACK at a time that is immediately forwarded down the
chain.

An easy way to observe the problem is to send 5 map updates, a shutdown,
and immediately close via tcploop, and in parallel run a periodic
"show map" to count the number of elements:

  $ tcploop -U /tmp/sock1 C S:"add map #0 1 1; add map #0 2 2; add map #0 3 3; add map #0 4 4; add map #0 5 5\n" F K

Before 3.0, there would always be 5 elements. Since 3.0 and before
20ec1de214 ("MAJOR: cli: Refacor parsing and execution of pipelined
commands"), almost always 2. And since that commit above in 3.2, almost
always one. Doing the same using socat or netcat shows almost always 5...
It's entirely timing-dependent, and might even vary based on the RTT
between the client and haproxy!

The approach taken here consists in doing the same principle as MSG_MORE
or Nagle but on the response buffer: the applet doesn't need to send a
single ACK for each command when it has already been woken up and is
scheduled to come back to work. It's fine (and even desirable) that
ACKs are grouped in a single packet as much as possible.

For this reason, this patch implements APPCTX_CLI_ST1_YIELD, a new CLI
flag which indicates that the applet left in yielding condition, i.e.
it has not finished its work. This flag is used by .rcv_buf to hold
pending data. This way we won't return partial responses for no reason,
and we can continue to emulate the previous behavior.

One very nice benefit to this is that it saves huge amounts of CPU on
the client. In the test below that tries to update 1M map entries, the
CPU used by socat went from 100% to 0% and the total transfer time
dropped by 28%:

  before:
    $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
         printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null

    real    0m2.407s
    user    0m1.485s
    sys     0m1.682s

  after:
    $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
         printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null

    real    0m1.721s
    user    0m0.952s
    sys     0m0.057s

The difference is also quite visible on the number of syscalls during
the test (for 1k updates):

  before:
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.071691           0    100001           sendmsg

  after:
    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
    100.00    0.000011           1         9           sendmsg

This patch will need to be backported to 3.0, and depends on these two
patches to be backported as well:

    MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
    MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
2025-10-27 16:57:07 +01:00
Willy Tarreau
f38ea2731b MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
This is in preparation for a future fix. For now it's simply a pure
copy of the original function, but dedicated to the CLI. It will
have to be backported to 3.0.
2025-10-27 16:57:07 +01:00
Willy Tarreau
35106d65fb MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
appctx_rcv_buf() prepares all the work to schedule the transfers between
the applet and the channel, and it takes care of setting the various flags
that indicate what condition is blocking the transfer from progressing.

There is one limitation though. In case an applet refrains from sending
data (e.g. rate-limited, prefers to aggregate blocks etc), it will leave
a possibly empty channel buffer, and keep some data in its outbuf. The
data in its outbuf will be seen by the function above as an indication
of a channel full condition, so it will place SE_FL_WANT_ROOM. But later,
sc_applet_recv() will see this flag with a possibly empty channel, and
will rightfully trigger a BUG_ON().

appctx_rcv_buf() should be more accurate in fact. It should only set
SE_FL_RCV_MORE when more data are present in the applet, then it should
either set or clear SE_FL_WANT_ROOM dependingon whether the channel is
empty or not.

Right now it doesn't seem possible to trigger this condition in the
current state of applets, but this will become possible with a future
bugfix that will have to be backported, so this patch will need to be
backported to 3.0.
2025-10-27 16:57:07 +01:00
Olivier Houchard
259b1e1c18 MEDIUM: quic: Fix build with openssl-compat
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
As the QUIC options have been split into backend and frontend, there is
no more GTUNE_QUIC_LISTEN_OFF to be found in global.tune.options, look
for QUIC_TUNE_FE_LISTEN_OFF in quic_tune.fe instead.
This should fix the build with USE_QUIC and USE_QUIC_OPENSSL_COMPAT.
2025-10-24 13:51:15 +02:00
Willy Tarreau
1824079fca BUG/MINOR: stick-tables: properly index string-type keys
This is one of the rare pleasant surprises of fixing an almost 16-years
old bug that remained unnoticed since the feature was implemented. In
1.4-dev7, commit 3bd697e071 ("[MEDIUM] Add stick table (persistence)
management functions and types") introduced stick-tables with multiple
key types, including strings, IP addresses and integers. Entries are
coded in binary and their binary representation is indexed. A special
case was made for strings in order to index them as zero-terminated
strings. However, there's one subtlety. While strings indeed have a
zero appended, they're still indexed using ebmb_insert(), which means
that all the bytes till the configured size are indexed as well. And
while these bytes generally come from a temporary storage that often
contains zeroes, or that is longer than the configured string length
and will result in truncation, it's not always the case and certain
traffic patterns with certain configurations manage to occasionally
present unpadded strings resulting in apparent duplicate keys appearing
in the dump, as shown in GH issue #3161. It seems to be essentially
reproducible at boot, and not to be particularly affected by mixed
patterns. These keys are in fact not exact duplicates in memory, but
everywhere they're used (including during synchronization), they are
equal.

What's interesting is that when this happens, one key can be presented
to a peer with its own data and will be indexed as the only one, possibly
replacing contents from the previous key, which might replace them again
later once updated in turn. This is visible in the dump of the issue
above, where key "localhost:8001" was split into two entries, one with a
request count of one and the other with a request count of 499999, and
indeed, all peers see only that last value, which overwrote the first
one.

This fix must be backported to all stable branches. Special kudos to
Mark Wort for undelining that one.
2025-10-24 10:15:11 +02:00
Aurelien DARRAGON
d655ed5f14 BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency (2nd attempt)
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
This is a second attempt at fixing issues on 32bits systems which would
trigger the following BUG_ON() statement:

 FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this

This is a drop-in replacement for d30b88a6c + 4693ee0ff, as suggested by
Willy.

Indeed, on supported platforms unsigned int can be assumed to be 4 bytes
long, and long can be assumed to be 8 bytes long. As such, the previous
attempt was overkill and added unecessary maintenance complexity which
could result in bugs if not used properly. Moreover, it would only
partially solve the issue, since on little endian vs big endian
architectures, the provisioned memory areas (originating from the same
shm stats file) could be read differently by the host.

Instead we fix the aligments issues, and this alone helps to ensure
struct memory consistency on 64 vs 32bits platforms. It was tested
on both i386 and i586.

last_change and last_sess counters are now stored as unsigned int, as
it helped to fix the alignment issues and they were found to be used
as 32bits integers anyway.

Thanks to Willy for problem analysis and the patch proposal.

No backport needed.
2025-10-24 09:35:38 +02:00
Christopher Faulet
854888497e BUG/MEDIUM: applet: Improve again spinning loops detection with the new API
A first attempt to fix this issue was already pushed (54b7539d6 "BUG/MEDIUM:
apppet: Improve spinning loop detection with the new API"). But it not was
fully accurrate. Indeed, we must check if something was received or sent by
the applet before incrementing the call rate. But we must also take care the
applet is allowed to receive or send data. That is what is performed in this
patch.

This patch must be backported as far as 3.0 with the patch above.
2025-10-24 09:26:10 +02:00
Amaury Denoyelle
7ba4b0ad5f BUG/MINOR: quic: rename and duplicate stream settings
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
Several settings can be set to control stream multiplexing and
associated receive window. Previously, all of these settings were
configured using prefix "tune.quic.frontend.", despite being applied
blindly on both sides.

Fix this by duplicating these settings specific to frontend and backend
side. Options are also renamed to use the standardize prefix
"tune.quic.[be|fe].stream." notation.

Also, each option is individually renamed to better reflect its purpose
and hide technical details relative to QUIC transport parameter naming :
* max-data-size -> stream.rxbuf
* max-streams-bidi -> stream.max-concurrent
* stream-data-ratio -> stream.data-ratio

No need to backport.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
d5142706f8 BUG/MINOR: quic: split option for congestion max window size 2025-10-23 16:49:20 +02:00
Amaury Denoyelle
33afba0dda BUG/MINOR: quic: split max-idle-timeout option for FE/BE usage
Streamline max-idle-timeout option. Rename it to use the newer cohesive
naming scheme 'tune.quic.fe|be.'.

Two different fields were already defined in global struct. These fields
are moved into quic_tune along with other QUIC settings. However, no
parser was defined for backend option, this commit fixes this.

No need to backport this.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
5bc659a4a2 MINOR: quic: rename frontend sock-per-conn setting
On frontend side, a quic_conn can have a dedicated FD or use the
listener one. These different modes can be activated via a global QUIC
tune setting.

This patch adjusts the option. First, it is renamed to the more
meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now
either 'default-on' or 'force-off'. The objective is to better highlight
reliationship with 'quic-socket' bind option.

The older option is deprecated and will be removed in 3.5.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
a14c6cee17 MINOR: quic: rename retry-threshold setting
A QUIC global tune setting is defined to be able to force Retry emission
prior to handshake. By definition, this ability is only supported by
QUIC servers, hence it is a frontend option only.

Rename the option to use "fe" prefix. The old option name is deprecated
and will be removed in 3.5
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
d248c5bd21 MINOR: quic: rename max Tx mem setting
QUIC global memory can be limited across the entire process via a global
tune setting. Previously, this setting used to misleading "frontend"
prefix. As this is applied as a sum between all QUIC connections, both
from frontend and backend sides, remove the prefix. The new option name
is "tune.quic.mem.tx-max".

The older option name is deprecated and will be removed in 3.5.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
9bfe9b9e21 MINOR: quic: split Tx options for FE/BE usage
This patch is similar to the previous one, except that it is focused on
Tx QUIC settings. It is now possible to toggle GSO and pacing on
frontend and backend sides independently.

As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUI
settings. Older options are deprecated and will be removed on 3.5
release.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
33a8cb87a9 MINOR: quic: split congestion controler options for FE/BE usage
Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.

As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
7640e9a9ee MINOR: quic: duplicate glitches FE option on BE side
Previously, QUIC glitches support was only implemented for frontend
side. Extend this so that the option can be specified separately both on
frontend and backend sides. Function _qcc_report_glitch() now retrieves
the relevant max value based on connection side.

In addition to this, option has been renamed to use "fe/be" prefixes.
This is part of the current serie of commits which unify QUIC settings.
Older options are deprecated and will be removed on 3.5 release.
2025-10-23 16:49:20 +02:00
Amaury Denoyelle
b34cd0b506 MINOR: quic: rename "no-quic" to "tune.quic.listen"
Rename the option to quickly enable/disable every QUIC listeners. It now
takes an argument on/off. The documentation is extended to reflect the
fact that QUIC backend are not impacted by this option.

The older keyword is simply removed. Deprecation is considered
unnecessary as this setting is only useful during debugging.
2025-10-23 16:47:58 +02:00
Amaury Denoyelle
cf3cf7bdda MINOR: quic: remove unused conn-tx-buffers limit keyword
Remove parsing code for tune.quic.frontend.conn-tx-buffers.limit. This
option was deprecated for some time and in fact was noop and not
mentionned anymore in the documentation.
2025-10-23 15:06:01 +02:00
William Lallemand
d0f9515e5c MINOR: acme: display the complete challenge_ready command in the logs
When using a wildcard DNS domain in the ACME configuration, for example
*.example.com, one might think that it needs to use the challenge_ready
command with this domain. But that's not the case, the challenge_ready
command takes the domain asked by the ACME server, which is stripped of
the wildcard.

In order to be clearer, the log message shows exactly the command the
user should sent, which is clearer.
2025-10-23 11:14:07 +02:00
William Lallemand
861fe53204 MINOR: acme: add the dns-01-record field to the sink
The dns-01-record field in the dpapi sink, output the authentication
token which is needed in the TXT record in order to validate the DNS-01
challenge.
2025-10-23 11:14:07 +02:00
Olivier Houchard
dfe866fa98 BUG/MEDIUM: stick-tables: Don't loop if there's nothing left
Before waking up the expiration task again at the end of it, make sure
the next date is set. If there's nothing left to do, then task_exp will
be TASK_ETERNITY and we then don't want to be waken up again.
2025-10-23 10:51:52 +02:00
Aurelien DARRAGON
1e4dbebef2 MINOR: stats-file: fix typo in shm-stats-file object struct size detection
As reported by @TimWolla on GH #3168, there was a typo in shm stats file
BUG_ON to report that the size of shm_stats_file_object changed.

No backport needed.
2025-10-22 20:52:08 +02:00
Amaury Denoyelle
f50425c021 MINOR: quic: remove received CRYPTO temporary tree storage
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
The previous commit switch from ncbuf to ncbmbuf as storage for received
CRYPTO frames. The latter ensures that buffering of such frames cannot
fail anymore due to gaps size.

Previously, extra mechanism were implemented on QUIC frames parsing
function to overcome the limitation of ncbuf on gaps size. Before
insertion, CRYPTO frames were stored in a temporary tree to order their
insertion. As this is not necessary anymore, this commit removes the
temporary tree insertion.

This commit is closely associated to the previous bug fix. As it
provides a neat optimization and code simplication, it can be backported
with it, but not in the next immediate release to spot potential
regression.
2025-10-22 15:24:02 +02:00
Amaury Denoyelle
4c11206395 BUG/MAJOR: quic: use ncbmbuf for CRYPTO handling
In QUIC, TLS handshake messages such as ClientHello are encapsulated in
CRYPTO frames. Each QUIC implementation can split the content in several
frames of random sizes. In fact, this feature is now used by several
clients, based on chrome so-called "Chaos protection" mechanism :

https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7

To support this, haproxy uses a ncbuf storage to store received CRYPTO
frames before passing it to the SSL library. However, this storage
suffers from a limitation as gaps between two filled blocks cannot be
smaller than 8 bytes. Thus, depending on the size of received CRYPTO
frames and their order, ncbuf may not be sufficient. Over time, several
mechanisms were implemented in haproxy QUIC frames parsing to overcome
the ncbuf limitation.

However, reports recently highlight that with some clients haproxy is
not able to deal with CRYPTO frames reception. In particular, this is
the case with the latest ngtcp2 release, which implements a similar
chaos protection mechanism via the following patch. It also seems that
this impacts haproxy interaction with firefox.

commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c
Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>
Date:   Mon Aug 4 22:48:06 2025 +0900

    Crumble Client Initial CRYPTO (aka chaos protection)

To fix haproxy CRYPTO frames buffering once and for all, an alternative
non-contiguous buffer named ncbmbuf has been recently implemented. This
type does not suffer from gaps size limitation, albeit at the cost of a
small reduction in the size available for data storage.

Thus, the purpose of this current patch is to replace ncbuf with the
newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used
to buffer received frames which is guaranteed to suceed. The only
remaining case of error is if a received frame offset and length exceed
the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED
error code.

A notable behavior change when switching to ncbmbuf implementation is
that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead,
crypto frame content received at a similar offset will be overwritten.

A final note regarding STREAM frames parsing. For now, it is considered
unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does
not perform aggressive fragmentation for them. Keeping ncbuf ensure that
the data storage size is bigger than the equivalent ncbmbuf area.

This should fix github issue #3141.

This patch must be backported up to 2.6. It is first necessary to pick
the relevant commits for ncbmbuf implementation prior to it.
2025-10-22 15:04:41 +02:00
Amaury Denoyelle
25e378fa65 MINOR: ncbmbuf: add tests as standalone mode
Write some tests for ncbmbuf buf. These tests should be run each time
ncbmbuf implementation is adjusted. Use the following command :

$ gcc -g -DSTANDALONE -I./include -o ncbmbuf src/ncbmbuf.c && ./ncbmbuf

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
2025-10-22 15:04:24 +02:00
Amaury Denoyelle
8b8ab2824e MINOR: ncbmbuf: implement advance operation
Implement ncbmb_advance() function for the ncbmbuf type. This allows to
remove bytes in front of the buffer, regardless of the existing gaps.
This is implemented by resetting the corresponding bits of the bitmap.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
2025-10-22 15:04:06 +02:00
Amaury Denoyelle
42c495f3d7 MINOR: ncbmbuf: implement ncbmb_data()
Implement ncbmb_data() function for the ncbmbuf type. Its purpose is
similar to its ncbuf counterpart : it returns the size in bytes of data
starting at a specific offset until the next gap.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
2025-10-22 15:04:06 +02:00
Amaury Denoyelle
db4a68752d MINOR: ncbmbuf: implement iterator bitmap utilities functions
Extend private API for ncbmbuf type by defining an iterator type for the
buffer bitmap handling. The purpose is to provide a simple method to
iterate over the bitmap one byte at a time, with a proper bitmask set to
hide irrelevant bits.

This internal type is unused for now, but will become useful when
implementing ncb_data() and ncb_advance() functions.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
2025-10-22 15:04:06 +02:00
Amaury Denoyelle
1e1a3aa6aa MINOR: ncbmbuf: implement add
This patch implements add operation for ncbmbuf type.

This function is simpler than its ncbuf counterpart. Indeed, for now
only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen
as ncbmbuf will be first used for QUIC CRYPTO frames handling, which
does not mandate to compare existing filled blocks during insertion.

As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
2025-10-22 15:04:06 +02:00
Amaury Denoyelle
b9f91ad3ff MINOR: ncbmbuf: define new ncbmbuf type
Define ncbmbuf which is an alternative non-contiguous buffer
implementation. "bm" abbreviation stands for bitmap, which reflects how
gaps and filled blocks are encoded. The main purpose of this
implementation is to get rid of the ncbuf limitation regarding the
minimal size for gaps between two blocks of data.

This commit adds the new module ncbmbuf. Along with it, some utility
functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are
defined. Public API of ncbmbuf will be extended in the following
patches.

This patch is not considered a bug fix. However, it will be required to
fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be
necessary to backport the current patch prior to the fix to come.
2025-10-22 15:04:06 +02:00
Willy Tarreau
f936feb3a9 BUG/MAJOR: pools: fix default pool alignment
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
The doc in commit 977feb5617 ("DOC: api: update the pools API with the
alignment and typed declarations") says that alignment of zero means
the type's alignment. And this is followed by the DECLARE_TYPED_POOL()
macro. Yet this is not what is done in create_pool_from_reg() which
only raises the alignment to a void* if lower, while it should start
from the type's. The effect is haproxy refusing to start on some 32-bit
platforms since that commit, displaying an error such as:

   "BUG in the code: at src/mux_h2.c:454, requested creation of pool
    'h2s' aligned to 4 while type requires alignment of 8! Please
    report to developers. Aborting."

Let's just apply the default type's alignment.

Thanks to @tianon for reporting this in GH issue #3168. No backport is
needed since aligned pools are 3.3-only.
2025-10-22 09:06:20 +02:00
Amaury Denoyelle
bece704128 BUG/MEDIUM: h3: properly encode response after interim one in same buf
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
Recently, proper support for interim responses forwarding to HTTP/3
client has been implemented. However, there was still an issue if two
responses are both encoded in the same snd_buf() iteration.

The issue is caused due to H3 HEADERS frame encoding method : 5 bytes
are reserved in front of the buffer to encode both H3 frame type and
varint length field. After proper headers encoding, output buffer head
is adjusted so that length can be encoded using the minimal varint size.

However, if the buffer is not empty due to a previous response already
encoded but not yet emitted, messing with the buffer head will corrupt
the entire H3 message. This only happens when encoding of both responses
is done in the same snd_buf() iteration, or at least without emission to
quic_conn layer in between.

The result of this bug is that the HTTP/3 client will be unable to parse
the response, most of the time reporting a formatting error. This can
be reproduced using the following netcat as HTTP/1 server to haproxy :

$ while sleep 0.2; do \
    printf "HTTP/1.1 100 continue\r\n\r\nHTTP/1.1 200 ok\r\nContent-length: 5\r\nConnection: close\r\n\r\nblah\n" | nc -lp8002
  done

To fix this, only adjust buffer head if content is empty. If this is not
the case, frame length is simply encoded as a 4-bytes varint size so
that messages are contiguous in the buffer.

This must be backported up to 2.6.
2025-10-21 15:51:48 +02:00
Christopher Faulet
18ece2b424 BUG/MEDIUM: h1-htx: Don't set HTX_FL_EOM flag on 1xx informational messages
1xx informational messages are part of the HTTP response. It is not expected
to have a HX_FL_EOM flag set after parsing such messages when received from
a server. It is espacially important whne an informational messages is
processed on client side while the final response was not recieved yet, to
not erroneously detect the end of the message.

The HTTP multiplexers seem to ignore the HTX_FL_EOM flag for information
messages, but it remains an error from the HTX specification point of
view. So it must be fixed.

While it should theorically be backported as far as 3.0, it is a good idea
to not do so for now because no bug was reported and regressions may happen.
2025-10-21 14:22:26 +02:00
Olivier Houchard
cd92aeb366 MEDIUM: stick-tables: Stop as soon as stktable_trash_oldest succeeds.
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
stktable_trash_oldest() goes through all the shards, trying to free a
number of entries. Going through each shard is expensive, as we have to
take the shard lock, so stop as soon as we free'd at least one entry, as
it is only called when we want to make room for one entry.
2025-10-20 15:04:47 +02:00
Olivier Houchard
7854331c71 MEDIUM: stick-tables: Stop if stktable_trash_oldest() fails.
In stksess_new(), if the table is full, we call stktable_trash_oldest()
to remove a few entries so that we have some room for a new one.
It is unlikely, but possible, that stktable_trash_oldest() will fail. If
so, just give up and do not add the new entry, instead of adding it
anyway.
Give up if stktable_trash_oldest() fails to free any entry
2025-10-20 15:04:47 +02:00
Olivier Houchard
d5562e31bd MEDIUM: stick-tables: Remove the table lock
Remove the table lock, it was only protecting the per-table expiration
date, and that task is gone.
2025-10-20 15:04:47 +02:00
Olivier Houchard
8bc8a21b25 MEDIUM: stick-tables: Use a per-shard expiration task
Instead of having per-table expiration tasks, just use one per shard.
The task will now go through all the tables to expire entries. When a
table gets an expiration earlier than the one previously known, it will
be put in a mt-list, and the task will be responsible to put it into an
eb32, ordered based on the next expiration.
Each per-shard task will run on a different thread, so it should lead to
a better load distribution than the per-table tasks.
2025-10-20 15:04:47 +02:00
Olivier Houchard
945aa0ea82 MINOR: initcalls: Add a new initcall stage, STG_INIT_2
Add a new initcall stage, STG_INIT_2, for stuff to be called after
step_init_2() is called, so after we know for sure that global.nbthread
will be set.
Modify stick-tables stkt_late_init() to run at STG_INIT_2 instead of
STG_INIT, in anticipation for it to be enhanced and have a need for
global.nbthread.
2025-10-20 15:04:41 +02:00
Willy Tarreau
e63e98f1d8 BUG/MEDIUM: cli: also free the trash chunk on the error path
Since commit 20ec1de214 ("MAJOR: cli: Refacor parsing and execution of
pipelined commands"), command not returning any response (e.g. "quit")
don't pass through the free_trash_chunk() call, possibly leaking the
cmdline buffer. A typical way to reproduce it is to loop on "quit" on
the CLI, though it very likely affects other specific commands.

Let's make sure in the release handler that we always release that
chunk in any case. This must be backported to 3.2.
2025-10-20 14:58:53 +02:00
Frederic Lecaille
edd21121d2 BUG/MINOR: quic-be: unchecked connections during handshakes
This bug impacts only the backends.

The ->conn (pointer to struct connection) member validity of the ssl_sock_ctx
struct was not checked before being dereferenced, leading to possible crashes
in qc_ssl_do_hanshake() during handshake.

This was reported by GH #3163 issue.

No need to backport because the QUIC backend support arrived with 3.3
2025-10-20 14:27:12 +02:00
Willy Tarreau
c365e47095 BUG/MEDIUM: threads/config: drop absent threads from thread groups
Thread groups can be assigned arbitrary thread ranges, but if the
mentioned threads do not exist, this causes crashes in listener_accept()
or some connections to be ignored. The reason is that the calculated
mask is derived from the thread group's enabled threads count. Examples:

  global
     nbthread 2
     thread-groups 2
     thread-group 1 1-64
     thread-group 2 65-128

  frontend f-crash
     bind :8001 thread 1/all

  frontend f-freeze
     bind :8002 thread 2/all

This commit removes missing threads, emits a warning when the thread
group just has less threads than requested, and an error when it is
left with no threads at all.

This must be backported to 3.1 since the issue is present there already.
2025-10-17 20:36:00 +02:00
Willy Tarreau
8b7a82cd30 MEDIUM: config: warn when expose-experimental-directives is used for no reason
If users start to enable expose-experimental-directives for the purpose
of testing one specific feature, there are chances that the option remains
forever and hides the experimental status of other options.

Let's emit a warning if the option appears and is not used. This will
remind users that they can now drop it, and help keep configs safe for
future upgrades.
2025-10-17 19:00:21 +02:00
Willy Tarreau
80ed9f9dcf MINOR: tree-wide: add missing TAINTED flags for some experimental directives
We normally taint the process when using experimental directives, but
a handful of places were missed so we don't always know that they are
in use. Let's fix these places (hint for future directives, just look
for places checking for "experimental_directives_allowed", and add
"mark_tainted(TAINTED_CONFIG_EXP_KW_DECLARED);").
2025-10-17 19:00:21 +02:00
Willy Tarreau
d3881e61ac MINOR: config: remove experimental status on tune.disable-fast-forward
The option was turned to off by default in 2.8 with commit 2f7c82bfd
("BUG/MINOR: haproxy: Fix option to disable the fast-forward"), however
at the same time it should have dropped its experimental status since
the feature is enabled by default. The only goal of the option is to
debug something, like many other tune.xxx options. The option should
still normally not be used without being invited to do so by developers
looking for something specific though.

This could be backported if desired to simplify debugging, though this
has never been needed for now.
2025-10-17 18:59:47 +02:00
Frederic Lecaille
51eca5cbce BUG/MINOR: quic: SSL counters not handled
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
The SSL counters were not handled at all for QUIC connections. This patch
implement ssl_sock_update_counters() extracting the code from ssl_sock.c
and call this function where applicable both in TLS/TCP and QUIC parts.

Must be backported as far as 2.8.
2025-10-17 12:13:43 +02:00
Frederic Lecaille
8a8417b54a BUG/MAJOR: quic: do not reset QUIC backends fds in closing state
This bug impacts only the backends.

When entering the closing state, a quic_closed_conn is used to replace the quic_conn.
In this state, the ->fd value was reset to -1 value calling qc_init_fd(). This value
is used by qc_may_use_saddr() which supposes it cannot be -1 for a backend, leading
->li to be dereferencd, which is legal only for a listener.

This bug impacts only the backend but with possible crash when qc_may_use_saddr()
is called: qc_test_fd() is false leading qc->li to be dereferenced. This is legal
only for a listener.

This patch prevents such fd value resettings for backends.

No need to backport because the QUIC backends support arrived with 3.3.
2025-10-17 12:13:43 +02:00
Frederic Lecaille
56d15b2a03 BUG/MAJOR: quic: uninitialized quic_conn_closed struct members
A quic_conn_closed struct is initialized to replace the quic_conn when the
connection enters the closing to reduce the connection memory footprint.
->max_udp_payload quic_conn_close was not initialized leading to possible
BUG_ON()s in qc_rcv_buf() when comparing the RX buf size to this payload.

->cntrs counters were alon not initialized with the only consequence
to generate wrong values for these counters.

Must be backported as far as 2.9.
2025-10-17 12:13:43 +02:00
William Lallemand
b74a437e57 BUILD: ssl: can't build when using -DLISTEN_DEFAULT_CIPHERS
Emeric reported that he can't build haproxy anymore since 9bc6a034
("BUG/MINOR: ssl: Free global_ssl structure contents during deinit").

    src/ssl_sock.c:7020:40: error: comparison with string literal results in unspecified behavior [-Werror=address]
     7020 |  if (global_ssl.listen_default_ciphers != LISTEN_DEFAULT_CIPHERS)
          |                                        ^~
    src/ssl_sock.c:7023:41: error: comparison with string literal results in unspecified behavior [-Werror=address]
     7023 |  if (global_ssl.connect_default_ciphers != CONNECT_DEFAULT_CIPHERS)
          |                                         ^~
    src/ssl_sock.c: At top level:

Indeed the mentionned patch is checking the pointer in order to free
something freeable, but that can't work because these constant are
strings literal which can be passed from the compiler and not pointers.

Also the test is not useful, because these strings are strdup() in
__ssl_sock_init, so they can be free directly.

Must be backported in every stable branches with 9bc6a034.
2025-10-17 09:45:26 +02:00
Amaury Denoyelle
5b04a85bc7 TESTS: quic: fix uninit of quic_cc_path const member
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
Fix quic_tx unittest module by adding an explicit define for <mtu> const
member of quic_cc_path.

This should fix coverity report from github issue #3162.

This can be backported up to 3.2.
2025-10-17 09:29:01 +02:00
Amaury Denoyelle
5067a15870 BUG/MINOR: quic: check applet_putchk() for 'show quic' first line
Ensure applet_putchk() return value is checked when outputing via the
CLI 'show quic' header line.

This is only to align with other usages of the same function, as trash
output buffer should always be large enough for it. As such, the command
is simply aborted if this is not the case.

This should fix coverity report from github issue #3139.

This could be backported up to 2.8.
2025-10-17 09:29:01 +02:00
Olivier Houchard
8d31784c0f BUG/MEDIUM: stick-tables: Don't forget to dec count on failure.
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
In stksess_new(), if we failed to allocate memory for the new stksess,
don't forget to decrement the table entry count, as nobody else will
do it for us.
An artificially high count could lead to at least purging entries while
there is no need to.

This should be backported up to 2.8.

WIP decrement current on allocation failure
2025-10-16 23:46:37 +02:00
Willy Tarreau
03e9a5a1e7 BUG/MAJOR: lb-chash: fix key calculation when using default hash-key id
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
A subtle regression was introduced in 3.0 by commit faa8c3e02 ("MEDIUM:
lb-chash: Deterministic node hashes based on server address"). When keys
are calculated from the server's ID (which is the default), due to the
reorganisation of the code, the key ended up being hashed twice instead
of being multiplied by the scaling range.

While most users will never notice it, it is blocking some large cache
users from upgrading from 2.8 to 3.0 or 3.2 because the keys are
redistributed.

After a check with users on the mailing list [1] it was estimated that
keep the current situation is the worst choice because those who have
not yet upgraded will face the problem while by fixing it, those who
already have and for whom it happened smoothly will handle it just
right again.

As such this fix must be backported to 3.0 without waiting (in order
to preserve those who upgrade from two redistributions). Please note
that only configurations featuring "hash-type consistent" and not
having "hash-key" present with a value other than "id" are affected,
others are not (e.g. "hash-key addr" is unaffected).

[1] https://www.mail-archive.com/haproxy@formilux.org/msg46115.html
2025-10-16 10:43:09 +02:00
Willy Tarreau
f263a45ddf BUG/MINOR: pools: don't report "limited to the first X entries" by default
With the fix in commit 982805e6a3 ("BUG/MINOR: pools: Fix the dump of
pools info to deal with buffers limitations"), the max count is now
compared to the number of dumped pools instead of the configured
numbered, and keeping >= is no longer valid because maxcnt is set by
default to the same value when not set, so this means that since this
patch we're always displaying "limited to the first X entries" where X
is the number of dumped entries even in the absence of any limitation.
Let's just fix the comparison to only show this when the limit is lower.

This must be backported to 3.2 where the patch above already is.
2025-10-16 08:41:32 +02:00
Willy Tarreau
ab0c97139f BUG/MEDIUM: pools: fix crash on filtered "show pools" output
The truncation of pools output that was adressed in commit 982805e6a3
("BUG/MINOR: pools: Fix the dump of pools info to deal with buffers
limitations") required to split the pools filling from dumping. However
there is a problem when a limit is passed that is lower than the number
of pools or if a pool name is specified or if pool caches are disabled,
because in this case the number of filled slots will be lower than the
initially allocated one, and empty entries will be visited either by the
sort functions when filling the entries if "byxxx" is specified, or by
the dump function after the last entry, but none of these functions was
expecting to be passed a NULL entry.

Let's just re-adjust nbpools to match the number of filled entries at
the end. Anyway the totals are calculated on the number of dumped
entries.

This must be backported to 3.2 since the fix above was backported there
as well.
2025-10-16 08:41:32 +02:00
Frederic Lecaille
d5f4872ba6 TESTS: quic: useless param for b_quic_dec_int()
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
The third parameter passed to b_quic_dec_int() is unitialized. This is not a bug.
But this disturbs coverity for an unknown reason as revealed by GH issue #3154.

This patch takes the opportunity to use NULL as passed value to avoid using such
an uneeded third parameter.

Should be backported to 3.2 where this unit test was introduced.
2025-10-15 09:58:03 +02:00
Willy Tarreau
fda6dc9597 MINOR: regex: use a thread-local match pointer for pcre2
The pcre2 matching requires an array of matches for grouping, that is
allocated when executing the rule by pre-processing it, and that is
immediately freed after use. This is quite inefficient and results in
annoying patterns in "show profiling" that attribute the allocations
to libpcre2 and the releases to haproxy.

A good suggestion from Dragan is to pre-allocate these per thread,
since the entry is not specific to a regex. In addition we're already
limited to MAX_MATCH matches so we don't even have the problem of
having to grow it while parsing nor processing.

The current patch adds a per-thread pair of init/deinit functions to
allocate a thread-local entry for that, and gets rid of the dynamic
allocations. It will result in cleaner memory management patterns and
slightly higher performance (+2.5%) when using pcre2.
2025-10-13 16:56:43 +02:00
Remi Tricot-Le Breton
6f4ca37880 BUG/MINOR: ssl: Potential NULL deref in trace macro
'ctx' might be NULL when we exit 'ssl_sock_handshake', it can't be
dereferenced without check in the trace macro.

This was found by Coverity andraised in GitHub #3113.
This patch should be backported up to 3.2
2025-10-13 15:44:45 +02:00
Remi Tricot-Le Breton
d4bb9983fa MINOR: jwt: Add new "add/del/show ssl jwt" CLI commands
The new "add/del ssl jwt <file>" commands allow to change the "jwt" flag
of an already loaded certificate. It allows to delete certificates used
for JWT validation, which was not yet possible.
The "show ssl jwt" command iterates over all the ckch_stores and dumps
the ones that have the option set.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
daa36adc6e MINOR: ssl: Dump options in "show ssl cert"
Dump the values of the 'ocsp-update' and 'jwt' flags in the output of
'show ssl cert' CLI command.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
bf5b912a62 MINOR: jwt: Add specific error code for known but unavailable certificate
A certificate that does not have the 'jwt' flag enabled cannot be used
for JWT validation. We now raise a specific return value so that such a
case can be identified.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
18ff130e9d MINOR: jwt: Add new "jwt" certificate option
This option can be used to enable the use of a given certificate for JWT
verification. It defaults to 'off' so certificates that are declared in
a crt-store and will be used for JWT verification must have a
"jwt on" option in the configuration.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
53957c50c3 MINOR: jwt: Do not look into ckch_store for jwt_verify converter
We must not try to load full-on certificates for 'jwt_verify' converter
anymore. 'jwt_verify_cert' is the only one that accepts a certificate.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
f5632fd481 MINOR: jwt: Add new jwt_verify_cert converter
This converter will be in charge of performing the same operation as the
'jwt_verify' one except that it takes a full-on pem certificate path
instead of a public key path as parameter.
The certificate path can be either provided directly as a string or via
a variable. This allows to use certificates that are not known during
init to perform token validation.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
c3c0597a34 MEDIUM: jwt: Remove certificate support in jwt_verify converter
The jwt_verify converter will not take full-on certificates anymore
in favor of a new soon to come jwt_verify_cert. We might end up with a
new jwt_verify_hmac in the future as well which would allow to deprecate
the jwt_verify converter and remove the need for a specific internal
tree for public keys.
The logic to always look into the internal jwt tree by default and
resolve to locking the ckch tree as little as possible will also be
removed. This allows to get rid of the duplicated reference to
EVP_PKEYs, the one in the jwt tree entry and the one in the ckch_store.
2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton
b706f2d092 BUG/MINOR: ssl: Free key_base from global_ssl structure during deinit
Some checks failed
Contrib / build (push) Has been cancelled
alpine/musl / gcc (push) Has been cancelled
VTest / Generate Build Matrix (push) Has been cancelled
Windows / Windows, gcc, all features (push) Has been cancelled
VTest / (push) Has been cancelled
The key_base field of the global_ssl structure is an strdup'ed field
(when set) which was never free'd during deinit.

This patch can be backported up to branch 3.0.
2025-10-10 17:22:48 +02:00
Remi Tricot-Le Breton
9bc6a0349d BUG/MINOR: ssl: Free global_ssl structure contents during deinit
Some fields of the global_ssl structure are strings that are strdup'ed
but never freed. There is only one static global_ssl structure so not
much memory is used but we might as well free it during deinit.

This patch can be backported to all stable branches.
2025-10-10 17:22:48 +02:00
Christopher Faulet
54b7539d64 BUG/MEDIUM: apppet: Improve spinning loop detection with the new API
Conditions to detect the spinning loop for applets based on the new API are
not accurrate. We cannot continue to check the channel's buffers state to
know if an applet has made some progress. At least, we must also check the
applet's buffers.

After digging to find the right way to do, it was clear that the best is to
use something similar to what is performed for the streams, namely, checking
read and write events. And in fact, it is quite easy to do with the new
API. So let's do so.

This patch must be backported as far as 3.0.
2025-10-10 14:41:15 +02:00
William Lallemand
47a93dc750 BUG/MINOR: ssl: leak crtlist_name in ssl-f-use
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
This patch fixes a leak of the temporary variable "crtlist_name" which
is used in the ssl-f-use parser.

Must be backported in 3.2.
2025-10-10 11:22:37 +02:00
William Lallemand
d9365a88a5 BUG/MINOR: ssl: leak in ssl-f-use
Fix the leak of the filename in the struct cfg_crt_node which is a
temporary structure used for ssl-f-use initialization.

Must be backported to 3.2.
2025-10-10 11:22:37 +02:00
Christopher Faulet
cbe5221182 DEBUG: mux-h1: Dump <kip> and <kop> values with sedesc info
It could be handy to debug issues, especially because these values was
recently introduced.
2025-10-10 11:16:21 +02:00
Christopher Faulet
6a0fe6e460 MEDIUM: applet: Forward <kip> to applets
For now, no applets are using the <kop> value when consuming data. At least,
as far as I know. But it remains a good idea to keep the applet API
compatible. So now, the <kip> of the opposite side is properly forwarded to
applets.
2025-10-10 11:11:44 +02:00
Christopher Faulet
4145a61101 BUG/MEDIUM: stconn: Properly forward kip to the opposite SE descriptor
By refactoring the HTX to remove the extra field, a bug was introduced in
the stream-connector part. The <kip> (known input payload) value of a sedesc
was moved to <kop> (knwon output payload) using the same sedesc. Of course,
this is totally wrong. <kip> value of a sedesc must be forwarded to the
opposite side.

In addition, the operation is performed in sc_conn_send(). In this function,
we manipulate the stream-connectors. So se_fwd_kip() function was changed to
use the stream-connectors directely.

Now, the function sc_ep_fwd_kip() is now called with the both
stream-connectors to properly forward <kip> from on side to the opposite
side.

The bug is 3.3-specific. No backport needed.
2025-10-10 11:01:21 +02:00
Willy Tarreau
54f0ab08b8 BUG/MINOR: ssl: always clear the remains of the first hello for the second one
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
William rightfully pointed that despite the ssl capture being a
structure, some of its entries are only set for certain contents,
so we need to always zero it before using it so as to clear any
remains of a previous use, otherwise we could possibly report some
entries that were only present in the first hello and not the second
one. No need to clear the data though, since any remains will not be
referenced by the fields.

This must be backported wherever commit 336170007c ("BUG/MEDIUM: ssl:
take care of second client hello") is backported.
2025-10-09 18:50:30 +02:00
Willy Tarreau
336170007c BUG/MEDIUM: ssl: take care of second client hello
For a long time we've been observing some sporadic leaks of ssl-capture
pool entries on haproxy.org without figuring exactly the root cause. All
that was seen was that less calls to the free callback were made than
calls to the hello parsing callback, and these were never reproduced
locally.

It recently turned out to be triggered by the presence of "curves" or
"ecdhe" on the "bind" line. Captures have shown the presence of a second
client hello, called "Change Cipher Client Hello" in wireshark traces,
that calls the client hello callback again. That one wasn't prepared for
being called twice per connection, so it allocates an ssl-capture entry
and assigns it to the ex_data entry, possibly overwriting the previous
one.

In this case, the fix is super simple, just reuse the current ex_data
if it exists, otherwise allocate a new one. This completely solves the
problem.

Other callbacks have been audited for the same issue and are not
affected: ssl_ini_keylog() already performs this check and ignores
subsequent calls, and other ones do not allocate data.

This must be backported to all supported versions.
2025-10-09 17:06:49 +02:00
William Lallemand
f35caafa6e BUG/MINOR: acme: memory leak from the config parser
This patch fixes some memory leaks in the configuration parser:

- deinit_acme() was never called
- add ha_free() before every strdup() for section overwrite
- lacked some free() in deinit_acme()
2025-10-09 12:04:22 +02:00
William Lallemand
9344ecaade MEDIUM: acme: don't insert acme account key in ckchs_tree
Don't insert the acme account key in the ckchs_tree anymore. ckch_store
are not made to only include a private key. CLI operations are not
possible with them either. That doesn't make much sense to keep it that
way until we rework the ckch_store.
2025-10-09 11:01:58 +02:00
Christopher Faulet
914538cd39 MEDIUM: htx: Remove the HTX extra field
Thanks for previous changes, it is now possible to remove the <extra> field
from the HTX structure. HTX_FL_ALTERED_PAYLOAD flag is also removed because
it is now unsued.
2025-10-08 11:10:42 +02:00
Christopher Faulet
2e2953a3f0 MEDIUM: mux-h1: Stop to use HTX extra value when formatting message
We now rely on the <kop> value to format the message payload before
sending it. It is no longer necessary to use the HTX extra field.
2025-10-08 11:10:42 +02:00
Christopher Faulet
4f40b2de86 MINOR: compression: Use the <kip> value to check body size
When an minimum compression size is defined, we can now use the <kip>
value to skip the compression instead of the HTX extra field.
2025-10-08 11:10:42 +02:00
Christopher Faulet
c0f5b19bc6 MINOR: cache: Use the <kip> value to check too big objects
When an object should be cache, to check if it is too big or not, the
<kip> value is now used instead of the HTX extra field.
2025-10-08 11:10:42 +02:00
Christopher Faulet
f1c659f3ae MINOR: hlua/http-fetch: Use <kip> instead of HTX extra field to get body size
The known input payload length now contains the information. There is no
reason to still rely on the HTX extra field.
2025-10-08 11:10:25 +02:00
Christopher Faulet
be1ce400c4 MINOR: filters: Reset knwon input payload length if a data filter is used
It a data filter is registered on a channel, the corresponding <kip>
field must be reset because the payload may be altered.
2025-10-08 11:01:37 +02:00
Christopher Faulet
30c50e4f19 MINOR: stconn: Move data from kip to kop when data are sent to the consumer
When data are sent to the consumer, the known output payload length is
updated using the known input payload length value and this last one is then
reset. se_fwd_kip() function is used for this purpose.
2025-10-08 11:01:37 +02:00
Christopher Faulet
f6a4d41dd0 MINOR: h3: Set known input payload length of the sedesc
Set <kip> value when data are transfer to the upper layer, in h3_rcv_buf().
The difference between the known length of the payload before and after a
parsing loop is added to <kip> value. When a content-length is specified in
the message, the h3s <body_len> field is used. Otherwise, it is the h3s
<data_len> field.
2025-10-08 11:01:36 +02:00
Christopher Faulet
bc8c6c42f4 MINOR: mux-h2: Set known input payload length of the sedesc
Set <kip> value when data are transfer to the upper layer, in h2_rcv_buf().
The new <body_len> filed of the H2S is used to increment <kip> value and
then it is reset. The patch relies on the previous one ("MINOR: mux-h2: Save
the known length of the payload").
2025-10-08 11:01:36 +02:00
Christopher Faulet
3a6a576e73 MINOR: mux-h2: Use <body_len> H2S field for payload without content-length
Before, the <body_len> H2S field was only use for verity the annonced
content-lenght value was respected. Now, this field is used for all
messages. Messages with a content-length are still handled the same way.
<body_len> is set to the content-length value and decremented by the size of
each DATA frame. For other messages, the value is initialized to ULLONG_MAX
and still decremented by the size of each DATA frame. This change is
mandatory to properly define the known input payload length value of the
sedesc.
2025-10-08 11:01:36 +02:00
Christopher Faulet
4fdc23e648 MINOR: mux-fcgi: Set known input payload length during demux
Set <kip> value during the response parsing. The difference between the body
length before and after a parsing loop is added. The patch relies on the
previous one ("MINOR: h1-htx: Increment body len when parsing a payload with
no xfer length").
2025-10-08 11:01:36 +02:00
Christopher Faulet
2bf2f68cd8 MINOR: mux-h1: Set known input payload length during demux
Set <kip> value during the message parsing. The difference between the body
length before and after a parsing loop is added. The patch relies on the
previous one ("MINOR: h1-htx: Increment body len when parsing a payload with
no xfer length").
2025-10-08 11:01:36 +02:00
Christopher Faulet
c9bc18c0bf MINOR: h1-htx: Increment body len when parsing a payload with no xfer length
In the H1 parseur, the body length was only incremented when the transfer
length was known. So when the content-length was specified or when the
transfer-encoding value was set to "chunk".

Now for messages with unknown transfer length, it is also incremented. It is
mandatory to be able to remove the extra field from the HTX message.
2025-10-08 11:01:36 +02:00
Christopher Faulet
c0b6db2830 MINOR: stconn: Add two fields in sedesc to replace the HTX extra value
For now, the HTX extra value is used to specify the known part, in bytes, of
the HTTP payload we will receive. It may concerne the full payload if a
content-length is specified or the current chunk for a chunk-encoded
message. The main purpose of this value is to be used on the opposite side
to be able to announce chunks bigger than a buffer. It can also be used to
check the validity of the payload on the sending path, to properly detect
too big or too short payload.

However, setting this information in the HTX message itself is not really
appropriate because the information is lost when the HTX message is consumed
and the underlying buffer released. So the producer must take care to always
add it in all HTX messages. it is especially an issue when the payload is
altered by a filter.

So to fix this design issue, the information will be moved in the sedesc. It
is a persistent area to save the information. In addition, to avoid the
ambiguity between what the producer say and what the consumer see, the
information will be splitted in two fields. In this patch, the fields are
added:

 * kip : The known input payload length
 * kop : The known output payload lenght

The producer will be responsible to set <kip> value. The stream will be
responsible to decrement <kip> and increment <kop> accordingly. And the
consumer will be responsible to remove consumed bytes from <kop>.
2025-10-08 11:01:36 +02:00
Christopher Faulet
586511c278 MINOR: h3/qmux: Set QC_SF_UNKNOWN_PL_LENGTH flag on QCS when headers are sent
QC_SF_UNKNOWN_PL_LENGTH flag is set on the qcs to know a payload of message
has an unknown length and not send a RESET_STREAM on shutdown. This flag was
based on the HTX extra field value. However, it is not necessary. When
headers are processed, before sending them, it is possible to check the HTX
start-line to know if the length of the payload is known or not.

So let's do so and don't use anymore the HTX extra field for this purpose.
2025-10-08 11:01:36 +02:00
Willy Tarreau
00b27a993f MAJOR: proxy: enable abortonclose by default on TLS listeners
In the continuity of https://github.com/orgs/haproxy/discussions/3146,
we must also enable abortonclose by default for TLS listeners so as not
to needlessly compute TLS handshakes on dead connections. The change is
very small (just set the default value to 1 in the TLS code when neither
the option nor its opposite were set).

It may possibly cause some TLS handshakes to start failing with 3.3 in
certain legacy environments (e.g. TLS health-checks performed using only
a client hello and closing afterwards), and in this case it is sufficient
to disable the option using "no option abortonclose" in either the
affected frontend or the "defaults" section it derives from.
2025-10-08 10:36:59 +02:00
Willy Tarreau
75103e7701 MINOR: proxy: introduce proxy_abrt_close_def() to pass the desired default
With this function we can now pass the desired default value for the
abortonclose option when neither the option nor its opposite were set.
Let's also take this opportunity for using it directly from the HTTP
analyser since there's no point in re-checking the proxy's mode there.
2025-10-08 10:29:41 +02:00
Willy Tarreau
fe47e8dfc5 MINOR: proxy: only check abortonclose through a dedicated function
In order to prepare for changing the way abortonclose works, let's
replace the direct flag check with a similarly named function
(proxy_abrt_close) which returns the on/off status of the directive
for the proxy. For now it simply reflects the flag's state.
2025-10-08 10:29:41 +02:00
Willy Tarreau
c42e62d890 MINOR: proxy: explicitly permit abortonclose on frontends and clarify the doc
The "abortonclose" option was recently deprecated in frontends because its
action was essentially limited to the backend part (queuing etc). But in
3.3 we started to support it for TLS on frontends, though it would only
work when placed in a defaults section. Let's officially support it in
frontends, and take this opportunity to clarify the documentation on this
topic, which was incomplete regarding frontend and TLS support. Now the
doc tries to better cover the different use cases.
2025-10-08 10:29:41 +02:00
William Lallemand
45fba1db27 BUG/MINOR: acme: avoid overflow when diff > notAfter
Avoid an overflow or a negative value if notAfter < diff.

This is unlikely to provoke any problem.

Fixes issue #3138.

Must be backported to 3.2.
2025-10-07 10:54:58 +02:00
William Lallemand
69bd253b23 CLEANUP: mjson: remove unused defines from mjson.h
This patch removes unused defines from mjson.h.
It also removes unused c++ declarations and includes.

string.h is moved to mjson.c
2025-10-06 09:30:07 +02:00
Christopher Faulet
8219fa1842 BUG/MINOR: http-ana: Reset analyse_exp date after 'wait-for-body' action
'wait-for-body' action set analyse_exp date for the channel to the
configured time. However, when the action is finished, it does not reset
it. It is an issue for some following actions, like 'pause', that also rely
on this date.

To fix the issue, we must take care to reset the analyse_exp date to
TICK_ETERNITY when the 'wait-for-body' action is finished.

This patch should fix the issue #3147. It must be backported to all stable
versions.
2025-10-03 17:09:16 +02:00
William Lallemand
61933a96a6 CLEANUP: mjson: remove unused defines and math.h
Remove unused defines for MSVC which is not used in the case of haproxy,
and remove math.h which is not used as well.
2025-10-03 16:09:51 +02:00
William Lallemand
8ea8aaace2 CLEANUP: mjson: remove MJSON_ENABLE_BASE64 code
Remove the code used under #if MJSON_ENABLE_BASE64, which is not used
within haproxy, to ease the maintenance of mjson.
2025-10-03 16:09:13 +02:00
William Lallemand
4edb05eb12 CLEANUP: mjson: remove MJSON_ENABLE_NEXT code
Remove the code used under #if MJSON_ENABLE_NEXT, which is not used
within haproxy, to ease the maintenance of mjson.
2025-10-03 16:08:17 +02:00
William Lallemand
a4eeeeeb07 CLEANUP: mjson: remove MJSON_ENABLE_PRINT code
Remove the code used under #if MJSON_ENABLE_PRINT, which is not used
within haproxy, to ease the maintenance of mjson.
2025-10-03 16:07:59 +02:00
William Lallemand
d63dfa34a2 CLEANUP: mjson: remove MJSON_ENABLE_RPC code
Remove the code used under #if MJSON_ENABLE_RPC, which is not used
within haproxy, to ease the maintenance of mjson.
2025-10-03 16:06:33 +02:00
Aurelien DARRAGON
c26ac3f5e4 BUG/MINOR: sink: retry attempt for sft server may never occur
Since 9561b9fb6 ("BUG/MINOR: sink: add tempo between 2 connection
attempts for sft servers"), there is a possibility that the tempo we use
to schedule the task expiry may point to TICK_ETERNITY as we add ticks to
tempo with a simple addition that doesn't take care of potential wrapping.

When this happens (although relatively rare, since now_ms only wraps every
49.7 days, but a forced wrap occurs 20 seconds after haproxy is started
so it is more likely to happen there), the process_sink_forward() task
expiry being set to TICK_ETERNITY, it may never be called again, this
is especially true if the ring section only contains a single server.

To fix the issue, we must use tick_add() helper function to set the tempo
value and this way we ensure that the value will never be TICK_ETERNITY.

It must be backported everywhere 9561b9fb6 was backported (up to 2.6
it seems).
2025-10-03 14:31:05 +02:00
Olivier Houchard
b01a00acb1 BUG/MEDIUM: connections: Only avoid creating a mux if we have one
In connect_server(), only avoid creating a mux when we're reusing a
connection, if that connection already has one. We can reuse a
connection with no mux, if we made a first attempt at connecting to the
server and it failed before we could create the mux (or during the mux
creation). The connection will then be reused when trying again.
This fixes a bug where a stream could stall if the first connection
attempt failed before the mux creation. It is easy to reproduce by
creating random memory allocation failure with -dmFail.
This was introduced by commit 4aaf0bfbce,
and thus does not need any backport as long as that commit is not
backported.
2025-10-03 13:13:10 +02:00
Willy Tarreau
ced9784df4 BUG/MEDIUM: resolvers: break an infinite loop in resolv_get_ip_from_response()
The fix in 3023e98199 ("BUG/MINOR: resolvers: Restore round-robin
selection on records in DNS answers") still contained an issue not
addressed f6dfbbe870 ("BUG/MEDIUM: resolvers: Test for empty tree
when getting a record from DNS answer"). Indeed, if the next element
is the same as the first one, then we can end up with an endless loop
because the test at the end compares the next pointer (possibly null)
with the end one (first).

Let's move the null->first transition at the end. This must be
backported where the patches above were backported (3.2 for now).
2025-10-03 09:08:10 +02:00
zhanhb
ad75431b9c BUG/MINOR: h3: forbid 'Z' as well in header field names checks
The current tests in _h3_handle_hdr() and h3_trailers_to_htx() check
for an interval between 'A' and 'Z' for letters in header field names
that should be forbidden, but mistakenly leave the 'Z' out of the
forbidden range, resulting in it being implicitly valid.

This has no real consequences but should be fixed for the sake of
protocol validity checking.

This must be backported to all relevant versions.
2025-10-02 15:30:02 +02:00
zhanhb
7163d9180c BUG/MINOR: h2: forbid 'Z' as well in header field names checks
The current tests in h2_make_htx_request(), h2_make_htx_response()
and h2_make_htx_trailers() check for an interval between 'A' and 'Z'
for letters in header field names that should be forbidden, but
mistakenly leave the 'Z' out of the forbidden range, resulting in it
being implicitly valid.

This has no real consequences but should be fixed for the sake of
protocol validity checking.

This must be backported to all relevant versions.
2025-10-02 15:29:58 +02:00
Willy Tarreau
06675db4bf BUG/CRITICAL: mjson: fix possible DoS when parsing numbers
Mjson comes with its own strtod() implementation for portability
reasons and probably also because many generic strtod() versions as
provided by operating systems do not focus on resource preservation
and may call malloc(), which is not welcome in a parser.

The strtod() implementation used here apparently originally comes from
https://gist.github.com/mattn/1890186 and seems to have purposely
omitted a few parts that were considered as not needed in this context
(e.g. skipping white spaces, or setting errno). But when subject to the
relevant test cases of the designated file above, the current function
provides the same results.

The aforementioned implementation uses pow() to calculate exponents,
but mjson authors visibly preferred not to introduce a libm dependency
and replaced it with an iterative loop in O(exp) time. The problem is
that the exponent is not bounded and that this loop can take a huge
amount of time. There's even an issue already opened on mjson about
this: https://github.com/cesanta/mjson/issues/59. In the case of
haproxy, fortunately, the watchdog will quickly stop a runaway process
but this remains a possible denial of service.

A first approach would consist in reintroducing pow() like in the
original implementation, but if haproxy is built without Lua nor
51Degrees, -lm is not used so this will not work everywhere.

Anyway here we're dealing with integer exponents, so an easy alternate
approach consists in simply using shifts and squares, to compute the
exponent in O(log(exp)) time. Not only it doesn't introduce any new
dependency, but it turns out to be even faster than the generic pow()
(85k req/s per core vs 83.5k on the same machine).

This must be backported as far as 2.4, where mjson was introduced.

Many thanks to Oula Kivalo for reporting this issue.

CVE-2025-11230 was assigned to this issue.
2025-10-02 09:37:43 +02:00
Olivier Houchard
b71bb6c2ae BUG/MEDIUM: fwlc: Handle memory allocation failures.
Properly handle memory allocation failures, by checking the return value
for pool_alloc(), and if it fails, make sure that the caller will take
it into account.
The only use of pool_alloc() in fwlc is to allocate the tree elements in
order to properly queue the server into the ebtree, so if that
allocation fails, just schedule the requeue tasklet, that will try
again, until it hopefully eventually succeeds.

This should be backported to 3.2.
This should fix github issue #3143.
2025-10-01 18:13:33 +02:00
Olivier Houchard
f4a9c6ffae MEDIUM: fwlc: Make it so fwlc_srv_reposition works with unqueued srv
Modify fwlc_srv_reposition() so that it does not assume that the server
was already queued, and so make it so it works even if s->tree_elt is
NULL.
While the server will usually be queued, there is an unlikely
possibility that when the server attempted to get queued when it got up,
it failed due to a memory allocation failure, and it just expect the
server_requeue tasklet to run to take care of that later.

This should be backported to 3.2.
This is part of an attempt to fix github issue #3143
2025-10-01 18:13:33 +02:00
Olivier Houchard
822ee90dc2 MEDIUM: servers: Schedule the server requeue target on creation
On creation, schedule the server requeue once it's been created.
It is possible that when the server went up, it tried to queue itself
into the lb specific code, failed to do so, and expect the tasklet to
run to take care of that.

This should be backported to 3.2.
This is part of an attempt to fix github issue #3143.
2025-10-01 18:13:33 +02:00
Willy Tarreau
7ea80cc5b6 MEDIUM: ssl: don't always process pending handshakes on closed connections
If a client aborts a pending SSL connection for whatever reason (timeout
etc) and the listen queue is large, it may inflict a severe load to a
frontend which will spend the CPU creating new sessions then killing the
connection. This is similar to HTTP requests aborted just after being
sent, except that asymmetric crypto is way more expensive.

Unfortunately "option abortonclose" has no effect on this, because it
only applies at a higher level.

This patch ensures that handshakes being received on a frontend having
"option abortonclose" set will be checked for a pending close, and if
this is the case, then the connection will be aborted before the heavy
calculations. The principle is to use recv(MSG_PEEK) to detect the end,
and to destroy the pending handshake data before returning to the SSL
library so that it cannot start computing, notices the error and stops.
We don't do it without abortonclose though, because this can be used for
health checks from other haproxy nodes or even other components which
just want to see a handshake succeed.

This is in relation with GH issue #3124.
2025-10-01 10:23:04 +02:00
Willy Tarreau
1afaa7b59d MINOR: rawsock: introduce CO_RFL_TRY_HARDER to detect closures on complete reads
Normally, when reading a full buffer, or exactly the requested size, it
is not really possible to know if the peer had closed immediately after,
and usually we don't care. There's a problematic case, though, which is
with SSL: the SSL layer reads in small chunks of a few bytes, and can
consume a client_hello this way, then start computation without knowing
yet that the client has aborted. In order to permit knowing more, we now
introduce a new read flag, CO_RFL_TRY_HARDER, which says that if we've
read up to the permitted limit and the flag is set, then we attempt one
extra byte using MSG_PEEK to detect whether the connection was closed
immediately after that content or not. The first use case will obviously
be related to SSL and client_hello, but it might possibly also make sense
on HTTP responses to detect a pending FIN at the end of a response (e.g.
if a close was already advertised).
2025-10-01 10:23:01 +02:00
Willy Tarreau
dae4cfe8c5 MINOR: ssl: add the ssl_bc_sni sample fetch function to retrieve backend SNI
Sometimes in order to debug certain difficult situations it can be useful
to know what SNI was configured on a connection going to a server, for
example to match it against what the server saw or to detect cases where
a server would route on SNI instead of Host. This sample fetch function
simply retrieves the SNI configured on the backend connection, if any.
2025-10-01 10:18:53 +02:00
Willy Tarreau
205f1cbf4c BUG/MEDIUM: wdt: improve stuck task detection accuracy
The fact that the watchdog timer measures the execution time from the
last return from the poller tends to amplify the impact of multiple
bad tasks, and may explain some of the panics reported by Felipe and
Ricardo in GH issues #3084, #3092 and #3101. The problem is that we
check the time if we see that the scheduler appears not to be moving
anymore, but one situation may still arise and catch a bad task:
  - one slow task takes so long a time that it triggers the watchdog
    twice, emitting a warning the second time (~200ms). The scheduler
    is rightfully marked as stuck.
  - then it completes and the scheduler is no longer stuck. Many other
    tasks run in turn, they all take quite some time but not enough to
    trigger a warning. But collectively their cost adds up.
  - then a task takes more than the warning time (100ms), and causes
    the total execution time to cross the second. The watchdog is
    called, sees that we've spend more than 1 second since we left the
    poller, and marks the thread as stuck.
  - the task is not finished, the watchdog is called again, sees more
    than one second with a stuck thread and panics 100ms later.

The total time away from the poller is indeed more than one second,
which is very bad, but no single task caused this individually, and
while the warnings are OK, the watchdog should not panic in this case.

This patch revisits the approach to store the moment the scheduler was
marked as stuck in the wdt context. The idea is that this date will be
used to detect warnings and panics. And by doing so and exploiting the
new is_sched_alive(thr), we can greatly simplify the mechanism so that
the signal handling thread does the strict minimum (mark the scheduler
as possibly stuck and update the stuck_start date), and only bounces to
the reporting thread if the scheduler made no progress since last call.
This means that without even doing computations in the handing thread,
we can continue to avoid all bounces unless a warning is required. Then
when the reporting thread is signaled, it will check the dates from the
last moment the scheduler was marked, and will decide to warn or panic.

The panic decision continues to pass via a TH_FL_STUCK flag to probe the
code so that exceptionally slow code (e.g. live cert generation etc) can
still find a way to avoid the panic if absolutely certain that things
are still moving.

This means that now we have the guarantee that panics will only happen
if a given task spends more than one full second not moving, and that
warnings will be issued for other calls crossing the warn delay boundary.

This was tested using artificially slow operations, and all combinations
which individually took less than a second only resulted in floods of
warnings even if the total reported time in the warning was much higher,
while those above one second provoked the panic.

One improvement could consist in reporting the time since last stuck
in the thread dumps to differentiate the individual task from the whole
set.

This needs to be backported to 3.2 along with the two previous patches:

    MINOR: sched: let's permit to share the local ctx between threads
    MINOR: sched: pass the thread number to is_sched_alive()
2025-10-01 10:18:53 +02:00
Willy Tarreau
25f5f357cc MINOR: sched: pass the thread number to is_sched_alive()
Now it will be possible to query any thread's scheduler state, not
only the current one. This aims at simplifying the watchdog checks
for reported threads. The operation is now a simple atomic xchg.
2025-10-01 10:18:53 +02:00
Willy Tarreau
7c7e17a605 MINOR: sched: let's permit to share the local ctx between threads
The watchdog timer has to go through complex operations due to not being
able to check if another thread's scheduler is still ticking. This is
simply because the scheduler status is marked as thread-local while it
could in fact also be an array. Let's do that (and align the array to
avoid false sharing) so that it's now possible to check any scheduler's
status.
2025-10-01 10:18:53 +02:00
Olivier Houchard
21ae35dd29 BUG/MEDIUM: stick-tables: Make sure not to free a pending entry
There is a race condition, an entry can be free'd by stksess_kill()
between the time stktable_add_pend_updates() gets the entry from the
mt_list, and the time it adds it to the ebtree.
To prevent this, use the newly implemented MT_LIST_POP_LOCKED() to keep
the stksess locked until it is added to the tree. That way,
__stksess_kill() will wait until we're done with it.

This should be backported to 3.2.
2025-09-30 16:25:07 +02:00
William Lallemand
3ce597bfa2 BUG/MEDIUM: acme: free() of i2d_X509_REQ() with AWS-LC
When using AWS-LC, the free() of the data ptr resulting from
i2d_X509_REQ() might crash, because it uses the free() of the libc
instead of OPENSSL_free().

It does not seems to be a problem on openssl builds.

Must be backported in 3.2.
2025-09-29 13:46:51 +02:00
William Lallemand
b70c7f48fa MINOR: acme: implement "reuse-key" option
The new "reuse-key" option in the "acme" section, allows to keep the
private key instead of generating a new one at each renewal.
2025-09-27 21:41:39 +02:00
William Lallemand
a9ccf692e7 BUG/MEDIUM: acme: cfg_postsection_acme() don't init correctly acme sections
The cfg_postsection_acme() redefines its own cur_acme variable, pointing
to the first acme section created. Meaning that the first section would
be init multiple times, and the next sections won't never be
initialized.

It could result in crashes at the first use of all sections that are not
the first one.

Must be backported in 3.2
2025-09-27 19:58:44 +02:00
William Lallemand
406fd0ceb1 BUG/MINOR: acme: don't unlink from acme_ctx_destroy()
Unlinking the acme_ctx element from acme_ctx_destroy() requires to have
the element unlocked, because MT_LIST_DELETE() locks the element.

acme_ctx_destroy() frees the data from acme_ctx with the ctx still
linked and unlocked, then lock to unlink. So there's a small risk of
accessing acme_ctx from somewhere else. The only way to do that would be
to use the `acme challenge_ready` CLI command at the same time.

Fix the issue by doing a mt_list_unlock_link() and a
mt_list_unlock_self() to unlink the element under the lock, then destroy
the element.

This must be backported in 3.2.
2025-09-27 18:52:56 +02:00
Chris Staite
54f53bc875 MINOR: backend: srv_is_up converter
There is currently an srv_queue converter which is capable of taking the
output of a dynamic name and determining the queue length for a given
server.  In addition there is a sample fetcher for whether a server is
currently up.  This simply combines the two such that srv_is_up can be
used as a converter too.

Future work might extend this to other sample fetchers for servers, but
this is probably the most useful for acl routing.
2025-09-26 10:46:48 +02:00
Chris Staite
faba98c85f MINOR: backend: srv_queue helper
In preparation of providing further server converters, split the code
for finding the server from the sample out.

Additionally, update the documentation for srv_queue converter to note
security concerns.
2025-09-26 10:46:48 +02:00
William Lallemand
b3b910cc3f BUILD: acme: fix false positive null pointer dereference
src/acme.c: In function ‘cfg_parse_acme_vars_provider’:
src/acme.c:471:9: error: potential null pointer dereference [-Werror=null-dereference]
  471 |         free(*dst);
      |         ^~~~~~~~~~

gcc13 on ubuntu 24.04 detects a false positive when building
3e72a9f ("MINOR: acme: provider-name for dpapi sink").
Indeed dst can't be NULL. Clarify the code so gcc don't complain
anymore.
2025-09-26 10:34:35 +02:00
William Lallemand
3e72a9f618 MINOR: acme: provider-name for dpapi sink
Like "acme-vars", the "provider-name" in the acme section is used in
case of DNS-01 challenge and is sent to the dpapi sink.

This is used to pass the name of a DNS provider in order to chose the
DNS API to use.

This patch implements the cfg_parse_acme_vars_provider() which parses
either acme-vars or provider-name options and escape their strings.

Example:

     $ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock -  | cat -e
     <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
     acme-vars "var1=foobar\"toto\",var2=var2"$
     provider-name "godaddy"$
     {$
       "identifier": {$
         "type": "dns",$
         "value": "example.com"$
       },$
       "status": "pending",$
       "expires": "2025-09-25T14:41:57Z",$
       [...]
2025-09-26 10:23:35 +02:00
William Lallemand
c52d69cc78 BUG/MEDIUM: ssl: ca-file directory mode must read every certificates of a file
The httpclient is configured with @system-ca by default, which uses the
directory returned by X509_get_default_cert_dir().

On debian/ubuntu systems, this directory contains multiple certificate
files that are loaded successfully. However it seems that on other
systems the files in this directory is the direct result of
ca-certificates instead of its source. Meaning that you would only have
a bundle file with every certificates in it.

The loading was not done correctly in case of directory loading, and was
only loading the first certificate of each file.

This patch fixes the issue by using X509_STORE_load_locations() on each
file from the scandir instead of trying to load it manually with BIO.

Not that we can't use X509_STORE_load_locations with the `dir` argument,
which would be simpler, because it uses X509_LOOKUP_hash_dir() which
requires a directory in hash form. That wouldn't be suited for this use
case.

Must be backported in every stable branches.

Fix issue #3137.
2025-09-26 09:36:55 +02:00
Christopher Faulet
7aa9f5ec98 BUG/MINOR: pattern: Fix pattern lookup for map with opt@ prefix
When we look for a map file reference, the file@ prefix is removed because
if may be omitted. The same is true with opt@ prefix. However this case was
not properly performed in pat_ref_lookup(). Let's do so.

This patch must be backported as far as 3.0.
2025-09-25 15:28:22 +02:00
William Lallemand
c325e34e6d CLEANUP: acme: acme_will_expire() uses acme_schedule_date()
Date computation between acme_will_expire() and acme_schedule_date() are
the same. Call acme_schedule_date() from acme_will_expire() and put the
functions as static. The patch also move the functions in the right
order.
2025-09-25 15:14:31 +02:00
William Lallemand
f256b5fdf3 BUG/MINOR: acme: possible overflow in acme_will_expire()
acme_will_expire() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.

This is unlikely to happen and would mean an incorrect certificate.

This patch fixes the issue by checking that notAfter > notBefore.

It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.

`(date.tv_sec + diff > notAfter)` was also replaced by `if (notAfter -
diff <= date.tv_sec)` to avoid an overflow.

Fix issue #3135.

Need to be backported to 3.2.
2025-09-25 15:12:14 +02:00
William Lallemand
68770479ea BUG/MINOR: acme: possible overflow on scheduling computation
acme_schedule_date() computes the schedule date using notAfter and
notBefore from the certificate. However notBefore could be greater than
notAfter and could result in an overflow.

This is unlikely to happen and would mean an incorrect certificate.

This patch fixes the issue by checking that notAfter > notBefore.

It also replace the int type by a time_t to avoid overflow on 64bits
architecture which is also unlikely to happen with certificates.

Fix issue #3136.

Need to be backported to 3.2.
2025-09-25 15:12:03 +02:00
Christopher Faulet
3be8b06a60 BUG/MINOR: pattern: Properly flag virtual maps as using samples
When a map file is load, internally, the pattern reference is flagged as
based on a sample. However it is not performed for virtual maps. This flag
is only used during startup to check the map compatibility when it used at
different places. At runtime this does not change anything. But errors can
be triggered during configuration parsing. For instance, the following valid
config will trigger an error:

    http-request set-map(virt@test) foo bar if !{ str(foo),map(virt@test) -m found }
    http-request set-var(txn.foo) str(foo),map(virt@test)

The fix is quite obvious. PAT_REF_SMP flag must be set for virtual map as
any other map.

A workaround is to use optional map (opt@...) by checking the map id cannot
reference an existing file.

This patch must be backported as far as 3.0.
2025-09-25 10:16:53 +02:00
Christopher Faulet
23e5d272af BUG/MINOR: compression: Test payload size only if content-length is specified
When a minimum size is defined to performe the comression, the message
payload size is tested. To do so, information from the HTX message a used to
determine the message length. However it is performed regardless the payload
length is fully known or not. Concretely, the test must on be performed when
a content-length value was speficied or when the message was fully received
(EOM flag set). Otherwise, we are unable to really determine the real
payload length.

Because of this bug, compression may be skipped for a large chunked message
because the first chunks received are too small. But this does not mean the
whole message is small.

This patch must be backported to 3.2.
2025-09-25 10:16:53 +02:00
Olivier Houchard
71199e394c BUG/MEDIUM: stick-tables: Don't let table_process_entry() handle refcnt
Instead of having table_process_entry() decrement the session's ref
counter, do it outside, from the caller. Some were missed, such as when
an action was invalid, which would lead to the ref counter not being
decremented, and the session not being destroyable.
It makes more sense to do that from the caller, who just obtained the
ref counter, anyway.
This should be backporter up to 2.8.
2025-09-22 23:14:19 +02:00
William Lallemand
fbffd2e25f BUG/MINOR: acme/cli: wrong description for "acme challenge_ready"
The "acme challenge_ready" command mistakenly use the description of the
"acme status" command. This patch adds the right description.

Must be backported to 3.2.
2025-09-22 19:14:54 +02:00
William Lallemand
34cdc5e191 MINOR: acme: check acme-vars allocation during escaping
Handle allocation properly during acme-vars parsing.
Check if we have a allocation failure in both the malloc and the
realloc and emits an error if that's the case.
2025-09-19 18:11:50 +02:00
William Lallemand
92c31a6fb7 MINOR: acme: acme-vars allow to pass data to the dpapi sink
In the case of the dns-01 challenge, the agent that handles the
challenge might need some extra information which depends on the DNS
provider.

This patch introduces the "acme-vars" option in the acme section, which
allows to pass these data to the dpapi sink. The double quotes will be
escaped when printed in the sink.

Example:

    global
        setenv VAR1 'foobar"toto"'

    acme LE
        directory https://acme-staging-v02.api.letsencrypt.org/directory
        challenge DNS-01
        acme-vars "var1=${VAR1},var2=var2"

Would output:

    $ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock -  | cat -e
    <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
    acme-vars "var1=foobar\"toto\",var2=var2"$
    {$
      "identifier": {$
        "type": "dns",$
        "value": "example.com"$
      },$
      "status": "pending",$
      "expires": "2025-09-25T14:41:57Z",$
      [...]
2025-09-19 16:40:53 +02:00
Christopher Faulet
331689d216 BUG/MEDIUM: http-client: Fix the test on the response start-line
The commit 88aa7a780 ("MINOR: http-client: Trigger an error if first
response block isn't a start-line") introduced a bug. From an endpoint, an
applet or a mux, the <first> index must never be used. It is reserved to the
HTTP analyzers. From endpoint, this value may be undefined or just point on
any other block that the first one. Instead we must always get the head
block.

In taht case, to be sure the first HTX block in a response is a start-line,
we must use htx_get_head_type() function instead of htx_get_first_type().
Otherwise, we can trigger an error while the response is in fact properly
formatted.

It is a 3.3-speific issue. cNo backport needed.
2025-09-19 14:59:28 +02:00
Aurelien DARRAGON
5c299dee5a MEDIUM: stats: consider that shared stats pointers may be NULL
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.

The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)

Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.

So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
2025-09-18 16:49:51 +02:00
Aurelien DARRAGON
40eb1dd135 BUG/MEDIUM: sink: fix unexpected double postinit of sink backend
Willy experienced an unexpected behavior with the config below:

    global
        stats socket :1514

    ring buf1
        server srv1 127.0.0.1:1514

Indeed, haproxy would connect to the ring server twice since commit 23e5f18b
("MEDIUM: sink: change the sink mode type to PR_MODE_SYSLOG"), and one of the
connection would report errors.

The reason behind is is, despite the above commit saying no change of behavior
is expected, with the sink forward_px proxy now being set with PR_MODE_SYSLOG,
postcheck_log_backend() was being automatically executed in addition to the
manual cfg_post_parse_ring() function for each "ring" section. The consequence
is that sink_finalize() was called twice for a given "ring" section, which
means the connection init would be triggered twice.. which in turn resulted in
the behavior described above, plus possible unexpected side-effects.

To fix the issue, when we create the forward_px proxy, we now set the
PR_CAP_INT capability on it to tell haproxy not to automatically manage the
proxy (ie: to skip the automatic log backend postinit), because we are about
to manually manage the proxy from the sink API.

No backport needed, this bug is specific to 3.3
2025-09-18 16:49:29 +02:00
Willy Tarreau
79ef362d9e OPTIM: ring: avoid reloading the tail_ofs value before the CAS in ring_write()
The load followed by the CAS seem to cause two bus cycles, one to
retrieve the cache line in shared state and a second one to get
exclusive ownership of it. Tests show that on x86 it's much better
to just rely on the previous value and preset it to zero before
entering the loop. We just mask the ring lock in case of failure
so as to challenge it on next iteration and that's done.

This little change brings 2.3% extra performance (11.34M msg/s) on
a 64-core AMD.
2025-09-18 15:27:32 +02:00
Willy Tarreau
a727c6eaa5 OPTIM: ring: check the queue's owner using a CAS on x86
In the loop where the queue's leader tries to get the tail lock,
we also need to check if another thread took ownership of the queue
the current thread is currently working for. This is currently done
using an atomic load.

Tests show that on x86, using a CAS for this is much more efficient
because it allows to keep the cache line in exclusive state for a
few more cycles that permit the queue release call after the loop
to be done without having to wait again. The measured gain is +5%
for 128 threads on a 64-core AMD system (11.08M msg/s vs 10.56M).
However, ARM loses about 1% on this, and we cannot afford that on
machines without a fast CAS anyway, so the load is performed using
a CAS only on x86_64. It might not be as efficient on low-end models
but we don't care since they are not the ones dealing with high
contention.
2025-09-18 15:08:12 +02:00
Willy Tarreau
d25099b359 OPTIM: ring: always relax in the ring lock and leader wait loop
Tests have shown that AMD systems really need to use a cpu_relax()
in these two loops. The performance improves from 10.03 to 10.56M
messages per second (+5%) on a 128-thread system, without affecting
intel nor ARM, so let's do this.
2025-09-18 15:07:56 +02:00
Willy Tarreau
eca1f90e16 CLEANUP: ring: rearrange the wait loop in ring_write()
The loop is constructed in a complicated way with a single break
statement in the middle and many continue statements everywhere,
making it hard to better factor between variants. Let's first
reorganize it so as to make it easier to escape when the ring
tail lock is obtained. The sequence of instrucitons remains the
same, it's only better organized.
2025-09-18 14:58:38 +02:00
Willy Tarreau
4431e3bd26 OPTIM: sink: reduce contention on sink_announce_dropped()
perf top shows that sink_announce_dropped() consumes most of the CPU
on a 128-thread x86 system. Digging further reveals that the atomic
fetch_or() on the dropped field used to detect the presence of another
thread is entirely responsible for this. Indeed, the compiler implements
it using a CAS that loops without relaxing and makes all threads wait
until they can synchronize on this one, only to discover later that
another thread is there and they need to give up.

Let's just replace this with a hand-crafted CAS loop that will detect
*before* attempting the CAS if another thread is there. Doing so
achieves the same goal without forcing threads to agree. With this
simple change, the sustained request rate on h1 with all traces on
bumped from 110k/s to 244k/s!

This should be backported to stable releases where it's often needed
to help debugging.
2025-09-18 08:38:34 +02:00
Willy Tarreau
361c227465 MINOR: trace: don't call strlen() on the function's name
Currently there's a small mistake in the way the trace function and
macros. The calling function name is known as a constant until the
macro and passed as-is to the __trace() function. That one needs to
know its length and will call ist() on it, resulting in a real call
to strlen() while that length was known before the call. Let's use
an ist instead of a const char* for __trace() and __trace_enabled()
so that we can now completely avoid calling strlen() during this
operation. This has significantly reduced the importance of
__trace_enabled() in perf top.
2025-09-18 08:31:57 +02:00
Willy Tarreau
06fa9f717f MINOR: trace: don't call strlen() on the thread-id numeric encoding
In __trace(), we're making an integer for the thread id but this one
is passed through strlen() in the call to ist() because it's not a
constant. We do know that it's exactly 3 chars long so we can manage
this using ist2() and pass it the length instead in order to reduce
the number of calls to strlen().

Also let's note that the thread number will no longer be numeric for
thread numbers above 100.
2025-09-18 08:02:59 +02:00
Willy Tarreau
d53ad49ad1 BUG/MEDIUM: ring: invert the length check to avoid an int overflow
Vincent Gramer reported in GH issue #3125 a case of crash on a BUG_ON()
condition in the rings. What happens is that a message that is one byte
less than the maximum ring size is emitted, and it passes all the checks,
but once inflated by the extra +1 for the refcount, it can no longer. But
the check was made based on message size compared to space left, except
that this space left can now be negative, which is a high positive for
size_t, so the check remained valid and triggered a BUG_ON() later.

Let's compute the size the other way around instead (i.e. current +
needed) since we can't have rings as large as half of the memory space
anyway, thus we have no risk of overflow on this one.

This needs to be backported to all versions supporting multi-threaded
rings (3.0 and above).

Thanks to Vincent for the easy and working reproducer.
2025-09-17 18:45:13 +02:00
Willy Tarreau
8c077c17eb MINOR: server: add the "cc" keyword to set the TCP congestion controller
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
2025-09-17 17:19:33 +02:00
Willy Tarreau
4ed3cf295d MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with incoming connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Permission issues
might be reported (as warnings).
2025-09-17 17:03:42 +02:00
Willy Tarreau
fef4cfbd21 IMPORT: ebtree: only use __builtin_prefetch() when supported
It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.

This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.
2025-09-17 14:30:32 +02:00
Willy Tarreau
6c54bf7295 IMPORT: eb32/eb64: place an unlikely() on the leaf test
In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.

This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.
2025-09-17 14:30:32 +02:00
Willy Tarreau
6af17d491f IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs
The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.

Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.

What this patch does is relatively simple:

  - we try to prefetch the next two branches as soon as the
    node is known, which will help dereference the selected node in
    the next iteration; it was shown that it only works with the next
    changes though, otherwise it can reduce the performance instead.
    In practice the prefetching will start a bit later once the node
    is really in the cache, but since there's no dependency between
    these instructions and any other one, we let the CPU optimize as
    it wants.

  - we preload all important data from the node (next two branches,
    key and node.bit) very early even if not immediately needed.
    This is cheap, it doesn't cause any pipeline stall and speeds
    up later operations.

  - we pre-calculate 1<<bit that we assign into a register, so as
    to avoid serializing instructions when deciding which branch to
    take.

  - we assign the troot based on a ternary operation (or if/else) so
    that the CPU knows upfront the two possible next addresses without
    waiting for the end of a calculation and can prefetch their contents
    every time the branch prediction unit guesses right.

Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):

  eb32   1k:  29.07 -> 33.17  +14.1%
        10k:  14.27 -> 15.74  +10.3%
       100k:   6.64 ->  8.00  +20.5%
  eb64   1k:  27.51 -> 34.40  +25.0%
        10k:  13.54 -> 16.17  +19.4%
       100k:   7.53 ->  8.38  +11.3%

The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).

Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.

It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).

This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.
2025-09-17 14:30:31 +02:00
Willy Tarreau
dcd4d36723 IMPORT: ebtree: delete unusable ebpttree.c
Since commit 21fd162 ("[MEDIUM] make ebpttree rely solely on eb32/eb64
trees") it was no longer used and no longer builds. The commit message
mentions that the file is no longer needed, probably that a rebase failed
and left the file there.

This is ebtree commit fcfaf8df90e322992f6ba3212c8ad439d3640cb7.
2025-09-17 14:30:31 +02:00
Aurelien DARRAGON
31b3be7aae CLEANUP: log: remove deadcode in px_parse_log_steps()
When logsteps proxy storage was migrated from eb nodes to bitmasks in
6a92b14 ("MEDIUM: log/proxy: store log-steps selection using a bitmask,
not an eb tree"), some unused eb node related code was left over in
px_parse_log_steps()

Not only this code is unused, it also resulted in wasted memory since
an eb node was allocated for nothing.

This should fix GH #3121
2025-09-17 11:31:17 +02:00
Willy Tarreau
3d73e6c818 BUG/MEDIUM: pattern: fix possible infinite loops on deletion (try 2)
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

While this bug is 3.3 only, it turns out that 3.2 is also affected by
the incorrect loop construct in pat_ref_set_from_node(), where it's
possible to run an infinite loop since commit 010c34b8c7 ("MEDIUM:
pattern: consider gen_id in pat_ref_set_from_node()") due to the
"continue" statement being placed before the ebmb_next_dup() call.

As such the relevant part of this fix (pat_ref_set_from_elt) will
need to be backported to 3.2.
2025-09-16 16:32:39 +02:00
Willy Tarreau
f1b1d3682a Revert "BUG/MEDIUM: pattern: fix possible infinite loops on deletion"
This reverts commit 359a829ccb.
The fix is neither sufficient nor correct (it triggers ASAN). Better
redo it cleanly rather than accumulate invalid fixes.
2025-09-16 16:32:39 +02:00
Willy Tarreau
359a829ccb BUG/MEDIUM: pattern: fix possible infinite loops on deletion
Commit e36b3b60b3 ("MEDIUM: migrate the patterns reference to cebs_tree")
changed the construction of the loops used to look up matching nodes, and
since we don't need two elements anymore, the "continue" statement now
loops on the same element when deleting. Let's fix this to make sure it
passes through the next one.

No backport is needed, this is only 3.3.
2025-09-16 11:49:01 +02:00
Willy Tarreau
4edff4a2cc CLEANUP: vars: use the item API for the variables trees
The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.
2025-09-16 10:51:23 +02:00
Willy Tarreau
c058cc5ddf CLEANUP: tools: use the item API for the file names tree
The file names tree uses the immediate cebtree API, better use the
item one which is more expressive and safer.
2025-09-16 10:41:19 +02:00
Willy Tarreau
2d6b5c7a60 MEDIUM: connection: reintegrate conn_hash_node into connection
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.

With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.

Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
2025-09-16 09:23:46 +02:00
Willy Tarreau
ceaf8c1220 MEDIUM: connection: move idle connection trees to ceb64
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.

This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):

     struct       before  after  delta
  conn_hash_nodea   56     32     -24
  srv_per_thread    96     72     -24

The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
2025-09-16 09:23:46 +02:00
Willy Tarreau
95b8adff67 MINOR: connection: pass the thread number to conn_delete_from_tree()
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
2025-09-16 09:23:46 +02:00
Willy Tarreau
efe519ab89 CLEANUP: backend: use a single variable for removed in srv_cleanup_idle_conns()
Probably due to older code, there's a boolean variable used to set
another one which is then checked. Also the first check is made under
the lock, which is unnecessary. Let's simplify this and use a single
variable. This only makes the code clearer, it doesn't change the output
code.
2025-09-16 09:23:46 +02:00
Willy Tarreau
f7d1fc2b08 MINOR: server: pass the server and thread to srv_migrate_conns_to_remove()
We'll need to have access to the srv_per_thread element soon from this
function, and there's no particular reason for passing it list pointers
so let's pass the server and the thread so that it is autonomous. It
also makes the calling code simpler.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d1c5df6866 CLEANUP: server: use eb64_entry() not ebmb_entry() to convert an eb64
There were a few leftovers from an earlier version of the conn_hash_node
that was using ebmb nodes. A few calls to ebmb_first() and ebmb_entry()
were still present while acting on an eb64 tree. These are harmless as
one is just eb_first() and the other container_of(), but it's confusing
so let's clean them up.
2025-09-16 09:23:46 +02:00
Willy Tarreau
3d18a0d4c2 CLEANUP: backend: factor the connection lookup loop
The connection lookup loop is made of two identical blocks, one looking
in the idle or safe lists and the other one looking into the safe list
only. The second one is skipped if a connection was found or if the request
looks for a safe one (since already done). Also the two are slightly
different due to leftovers from earlier versions in that the second one
checks for safe connections and not the first one, and the second one
sets is_safe which is not used later.

Let's just rationalize all this by placing them in a loop which checks
first from the idle conns and second from the safe ones, or skips the
first step if the request wants a safe connection. This reduces the
code and shortens the time spent under the lock.
2025-09-16 09:23:46 +02:00
Willy Tarreau
d18d972b1f MEDIUM: server: index server ID using compact trees
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
2025-09-16 09:23:46 +02:00
Willy Tarreau
66191584d1 MEDIUM: listener: index listener ID using compact trees
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
2025-09-16 09:23:46 +02:00