The shard keyword is already used by the peers and on the server lines. And
it is unrelated with the session keys distribution. So instead of talking
about shard for the session key hashing, we now use the term "bucket".
The purpose of this new BUG_ON is beyond BUG_ON_HOT(). While BUG_ON_HOT()
is meant to be light but placed on very hot code paths, BUG_ON_STRESS()
might be heavy and only used under stress-testing, to try to detect early
that something bad is starting to happen. This one is not even type-checked
when not defined because we don't want to risk the compiler emitting the
slightest piece of code there in production mode, so as to give enough
freedom to the developers.
DEBUG_STRESS is currently used only to expose "stress-level". With this
patch, we go a bit further, by automatically forcing DEBUG_STRICT and
DEBUG_STRICT_ACTION to their highest values in order to enable all
BUG_ON levels, and make all of them result in a crash. In addition,
care is taken to always only have 0 or 1 in the macro, so that it can be
tested using "#if DEBUG_STRESS > 0" as well as "if (DEBUG_STRESS) { }"
everywhere.
The goal will be to ease insertion of extra tests for builds dedicated
to stress-testing that enable possibly expensive extra checks on certain
code paths that cannot reasonably be compiled in for production code
right now.
The target patch fixes a rare race condition which happen when a MUX IO
handler is working on a connection already moved into the purge list. In
this case, the handler will incorrectly moved back the connection into
the idle list.
To fix this, conn_delete_from_tree() was extended to remove flags along
with the connection from the idle list. This was performed when the
connection is moved into the purge list. However, it introduces another
issue related to the idle server connection accounting. Thus it is
necessary to revert it prior to the incoming newer fix.
This patch must be backported to every version where the original commit
is.
AWS-LC features are not easily tested with just the openssl version
constant. AWS-LC uses its own API versioning stored in the
AWSLC_API_VERSION constant.
This patch add the two awslc_api_atleast and awslc_api_before predicates
that help to check the AWS-LC API.
check-reuse-pool can only perform as expected if reuse policy on the
backend is set to aggressive or higher. Update the documentation to
reflect this and implement a server diag warning.
The current ACME scheduler suffers from problems due to the way the
tasks are stored:
- MT_LIST are not scalables when having a lot of ACME tasks and having
to look for a specific one.
- the acme_task pointer was stored in the ckch_store in order to not
passing through the whole list. But a ckch_store can be updated and
the pointer lost in the previous one.
- when a task fails, the ptr in the ckch_store was not removed because
we only work with a copy of the original ckch_store, it would need to
lock the ckchs_tree and remove this pointer.
This patch fixes the issues by removing the MT_LIST-based architecture,
and replacing it by a simple ebmbtree + rwlock design.
The pointer to the task is not stored anymore in the ckch_store, but
instead it is stored in the acme_tasks tree. Finding a task is done by
doing a lookup on this tree with a RDLOCK.
Instead of checking if store->acme_task is not NULL, a lookup is also
done.
This allow to remove the stuck "acme_task" pointer in the store, which
was preventing to restart an acme task when the previous failed for this
specific certificate.
Must be backported in 3.2.
During 0-RTT sessions, some server transport parameters are reused after having
been save from previous sessions. These parameters must not be reduced
when it resends them. The client must check this is the case when some early data
are accepted by the server. This is what is implemented by this patch.
Implement qc_early_tranport_params_validate() which checks the new server parameters
are not reduced.
Also implement qc_ssl_eary_data_accepted() which was not implemented for TLS
stack without 0-RTT support (for instance wolfssl). That said this function
was no more used. This is why the compilation against wolfssl could not fail.
This patch allows the use of 0-RTT feature on QUIC server lines with "allow-0rtt"
option. In fact 0-RTT is really enabled only if ssl_sock_srv_try_reuse_sess()
successfully manages to reuse the SSL session and the chosen application protocol
from previous connections.
Note that, at this time, 0-RTT works only with quictls and aws-lc as TLS stack.
(0-RTT does not work at all (even for QUIC frontends) with libressl).
This patch is required to make 0-RTT work. It modifies the prototype of
quic_build_post_handshake_frames() to send post handshake frames from a
list of frames in place of the application encryption level (used
as <qc->ael> local variable).
This patch does not modify at all the current QUIC stack behavior (even for
QUIC frontends). It must be considered as a preparation for the code
to come about 0-RTT support for QUIC backends.
This function is called for both TCP and QUIC connections to reuse SSL sessions
saved by ssl_sess_new_srv_cb() callback called upon new SSL session creation.
In addition to this, a QUIC SSL session must reuse the ALPN and some specific QUIC
transport parameters. This is what is added by this patch for QUIC 0-RTT sessions.
Note that for now on, ssl_sock_srv_try_reuse_sess() may fail for QUIC connections
if it did not managed to reuse the ALPN. The caller must be informed of such an
issue. It must not enable 0-RTT for the current session in this case. This is
impossible without ALPN which is required to start a mux.
ssl_sock_srv_try_reuse_sess() is modified to always succeeds for TCP connections.
For both TCP and QUIC connections, this is ssl_sess_new_srv_cb() callback which
is called when a new SSL session is created. Its role is to save the session to
be reused for the next sessions.
This patch modifies this callback to save the QUIC parameters to be reused
for the next 0-RTT sessions (or during SSL session resumption).
The already existing path_params->nego_alpn member is used to store the ALPN as
this is done for TCP alongside path_params->tps new quic_early_transport_params
struct used to save the QUIC transport parameters to be reused for 0-RTT sessions.
Implement quic_reuse_srv_params() whose role is to reuse the ALPN negotiated
during a first connection to a QUIC backend alongside its transport parameters.
Define quic_early_transport_params new struct for QUIC transport parameters
in relation with 0-RTT. This parameters must be saved during a first session to
be reused for 0-RTT next sessions.
qc_early_transport_params_cpy() copies the 0-RTT transport parameters to be
saved during a first connection to a backend. The copy is made from
a quic_transport_params struct to a quic_ealy_transport_params struct.
On the contrary, qc_early_transport_params_reuse() copies the transport parameters
to be reused for a 0-RTT session from a previous one. The copy is made
from a quic_early_transport_params strcut to a quic_transport_params struct.
Also add QUIC_EV_EARLY_TRANSP_PARAMS trace event to dump such 0-RTT
transport parameters from traces.
Add a per thread ist struct to srv_per_thread struct to store the QUIC token to
be reused for subsequent sessions.
Parse at packet level (from qc_parse_ptk_frms()) these tokens and store
them calling qc_try_store_new_token() newly implemented function. This is
this new function which does its best (may fail) to update the tokens.
Modify qc_do_build_pkt() to resend these tokens calling quic_enc_token()
implemented by this patch.
Rename ->data qf_new_token struct field to ->w_data to distinguish it from
->r_data new field used to parse the NEW_TOKEN frame. Indeed to build the
NEW_TOKEN we need to write it to a static buffer into the frame struct. To
parse it we only need to store the address of the token field into the
RX buffer.
On Initial packet parsing, a new quic_conn instance is allocated via
qc_new_conn(). Then a CID is allocated with its value derivated from
client ODCID. On CID tree insert, a collision can occur if another
thread was already parsing an Initial packet from the same client. In
this case, the connection is released and the packet will be requeued to
the other thread.
Originally, CID collision check was performed prior to quic_conn
allocation. This was changed by the commit below, as this could cause
issue on quic_conn alloc failure.
commit 4ae29be18c
BUG/MINOR: quic: Possible endless loop in quic_lstnr_dghdlr()
However, this procedure is less optimal. Indeed, qc_new_conn() performs
many steps, thus it could be better to skip it on Initial CID collision,
which can happen frequently. This patch restores the older order of
operations, with CID collision check prior to quic_conn allocation.
To ensure this does not cause again the same bug, the CID is removed in
case of quic_conn alloc failure. This should prevent any loop as it
ensures that a CID found in the global tree does not point to a NULL
quic_conn, unless if CID is attach to a foreign thread. When this thread
will parse a re-enqueued packet, either the quic_conn is already
allocated or the CID has been removed, triggering a fresh CID and
quic_conn allocation procedure.
CIDs are provided by haproxy so that the peer can use them as DCID of
its packets. Their value is set via a random generator. It happens on
several occasions during connection lifetime:
* via ODCID derivation if haproxy is the server
* on quic_conn init if haproxy is the client
* during post-handshake if haproxy is the server
* on RETIRE_CONNECTION_ID frame parsing
CIDs are stored in a global tree. On ODCID derivation, a check is
performed to ensure the CID is not a duplicate value. This is mandatory
to properly handle multiple INITIAL packets from the same client on
different thread.
However, for the other cases, no check is performed for CID collision.
As _quic_cid_insert() is silent, the issue is not detected at all. This
results in a CID advertized to the peer but not stored in the global
one. In the end, this may cause two issues. The first one is that
packets from the client which use the new CID will be rejected by
haproxy, most probably with a STATELESS_RESET. The second issue is that
it can cause a crash during quic_conn release. Indeed, the CID is stored
in the quic_conn local tree and thus eb_delete() for the global tree
will be performed. As <leaf_p> member is uninit, this results in a
segfault.
Note that this issue is pretty rare. It can only be observed if running
with a high number of concurrent connections in parallel, so that the
random generator will provide duplicate values. Patch is still labelled
as MEDIUM as this modifies code paths used frequently.
To fix this, _quic_cid_insert() unsafe function is completely removed.
Instead, quic_cid_insert() can be used, which reports an error code if a
collision happens. CID are then stored in the quic_conn tree only after
global tree insert success. Here is the solution for each steps if a
collision occurs :
* on init as client: the connection is completely released
* post-handshake: the CID is immediately released. The connection is
kept, but it will miss an extra CID.
* on RETIRE_CONNECTION_ID parsing: a loop is implemented to retry random
generation. It it fails several times, the connection is closed in
error.
A small convenience change is made to quic_cid_insert(). Output
parameter <new_tid> can now be NULL, which is useful as most of the
times caller do not care about it.
This must be backported up to 2.6.
Split new_quic_cid() function into multiple ones. This patch should not
introduce any visible change. The objective is to render CID allocation
and generation more modular.
The first advantage of this patch is to bring code simplication. In
particular, conn CID sequence number increment and insertion into
connection tree is simpler than before. Another improvment is also that
errors could now be handled easier at each different steps of the CID
init.
This patch is a prerequisite for the fix on CID collision, thus it must
be backported prior to it to every affected version.
This stick-table field was atomically updated with the last update id pushed
and dumped on the CLI but never used otherwise. And all peer sessions share
the same id because it is a stick-table info. So the info in peers dump is
pretty limited.
So, let's remove it.
This is the same as the equivalent fix in ebtree:
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
This is cebtree commit 2d08958858c2b8a1da880061aed941324e20e748.
The purpose here is to look in the environment for a variable whose
name looks like the provided one. This will be used to try to auto-
correct misspelled environment variables that would silently be turned
to an empty string.
The word fingerprinting functions are used to compare similar words to
suggest a correctly spelled one that looks like what the user proposed.
Currently the functions only support const char*, but there's no reason
for this, and it would be convenient to support substrings extracted
from random pieces of configurations. Here we're adding new variants
"_with_len" that take these ISTs and which are in fact a slight change
of the original ones that the old ones now rely on.
When a server is being disabled or deleted, in case it matches the
backend's ready_srv, this one is reset. However it's currently done in
a non-atomic way when the server goes down, and that could occasionally
reset the entry matching another server, but more importantly if in
parallel some requests are dequeued for that server, it may re-appear
there after having been removed, leading to a possible crash once it
is fully removed, as shown in issue #3177.
Let's make sure we reset the pointer when detaching the server from
the proxy, and use a CAS in both cases to only reset this server.
This fix needs to be backported to 3.2. There, srv_detach() is in
server.c instead of server.h. Thanks to Basha Mougamadou for the
detailed report and the useful backtraces.
The <total> field in the channel structure is now useless, so it can be
removed. The <bytes_in> field from the SC is used instead.
This patch is related to issue #1617.
The helper function applet_output_data() returns the amount of data in the
output buffer of an applet. For applets using the new API, it is based on
data present in the outbuf buffer. For legacy applets, it is based on input
data present in the input channel's buffer. The HTX version,
applet_htx_output_data(), is also available
This patch is related to issue #1617.
In previous patches, these counters were added per frontend, backend, server
and listener. With this patch, these counters are reported on stats,
including promex.
Note that the stats file minor version was incremented by one because the
shm_stats_file_object struct size has changed.
This patch is related to issue #1617.
bytes_in and bytes_out counters per frontend, backend, listener and server
were removed and we now rely on, respectively on, req_in and res_in
counters.
This patch is related to issue #1617.
per-stream bytes_in and bytes_out counters was removed and replaced by
req.in and res.in. Coorresponding samples still exists but replies on new
counters.
This patch is related to issue #1617.
Thanks to the previous patch, and based on info available on the stream, it
is now possible to have counters for frontends, backends, servers and
listeners to report number of bytes received and sent on both sides.
This patch is related to issue #1617.
req.in and req.out samples can now be used to get the number of bytes
received by a client and send to the server. And res.in and res.out samples
can be used to get the number of bytes received by a server and send to the
client. These info are stored in the logs structure inside a stream.
This patch is related to issue #1617.
<bytes_in> and <bytes_out> counters were added to SC to count, respectively,
the number of bytes received from an endpoint or sent to an endpoint. These
counters are updated for connections and applets.
This patch is related to issue #1617.
Almost all callers of _srv_add_idle() lock the list then call the
function. It's not the most efficient and it requires some care from
the caller to take care of that lock. Let's change this a little bit by
having srv_add_idle() that takes the lock and calls _srv_add_idle() that
is now inlined. This way callers don't have to handle the lock themselves
anymore, and the lock is only taken around the sensitive parts, not the
function call+return.
Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS
to 2.32-2.37M RPS on a 128-thread system.
This patch provides two functions acme_gen_tmp_pkey() and
acme_gen_tmp_x509().
These functions generates a unique keypair and X509 certificate that
will be stored in tmp_x509 and tmp_pkey. If the key pair or certificate
was already generated they will return the existing one.
The key is an RSA2048 and the X509 is generated with a expiration in the
past. The CN is "expired".
These are just placeholders to be used if we don't have files.
This is an API change, instead of passing a ckch_data alone, the
ckch_conf_kws.func() is called with a ckch_store.
This allows the callback to access the whole ckch_store, with the
ckch_conf and the ckch_data. But it requires the ckch_conf to be
actually put in the ckch_store before.
This patch removes <mux_state> field from quic_conn structure. The
purpose of this field was to indicate if MUX layer above quic_conn is
not yet initialized, active, or already released.
It became tedious to properly set it as initialization order of the
various quic_conn/conn/MUX layers now differ between the frontend and
backend sides, and also depending if 0-RTT is used or not. Recently, a
new change introduced in connect_server() will allow to initialize QUIC
MUX earlier if ALPN is cached on the server structure. This had another
level of complexity.
Thus, this patch removes <mux_state> field completely. Instead, a new
flag QUIC_FL_CONN_XPRT_CLOSED is defined. It is set at a single place
only on close XPRT callback invokation. It can be mixed with the new
utility functions qc_wait_for_conn()/qc_is_conn_ready() to determine the
status of conn/MUX layers now without an extra quic_conn field.
There's currently a function conn_delete_from_tree() which is used to
detach an idle connection from the tree it's currently attached to so
that it is no longer found. This function is used in three circumstances:
- when picking a new connection that no longer has any avail stream
- when temporarily working on the connection from an I/O handler,
in which case it's re-added at the end
- when killing a connection
The 2nd case above is quite specific, as it requires to preserve the
CO_FL_LIST_MASK flags so that the connection can be re-inserted into
the proper tree when leaving the handler. However, there's a catch.
When killing a connection, we want to be certain it will not be
reinserted into the tree. The flags preservation is causing a tiny
race if an I/O happens while the connection is in the kill list,
because in this case the I/O handler will note the connection flags,
do its work, then reinsert the connection where it believed it was,
then the connection gets purged, and another user can find it in the
tree.
The issue is very difficult to reproduce. On a 128-thread machine it
happens in H2 around 500k req/s after around 50M requests. In H1 it
happens after around 1 billion requests.
The fix here consists in passing an extra argument to the function to
indicate if the removal is permanent or not. When it's permanent, the
function will clear the associated flags. The callers were adjusted
so that all those dequeuing a connection in order to kill it do it
permanently and all other ones do it only temporarily.
A slightly different approach could have worked: the function could
always remove all flags, and the callers would need to restore them.
But this would require trickier modifications of the various call
places, compared to only passing 0/1 to indicate the permanent status.
This will need to be backported to all stable versions. The issue was
at least reproduced since 3.1 (not tested before). The patch will need
to be adjusted for 3.2 and older, because a 2nd argument "thr" was
added in 3.3, so the patch will not apply to older versions as-is.
Add a rwlock to control the server's path_parameter, to make sure
multiple threads don't set it at the same time, and it can't be seen in
an inconsistent state.
Also don't set the parameter every time, only set them if they have
changed, to prevent needless writes.
This does not need to be backported.
This patch is similar to the previous one, this time dealing with
qc_new_conn(). This function was asymetric on frontend and backend side,
as connection argument was set only in the latter case.
This was required prior due to qc_alloc_ssl_sock_ctx() signature. This
has changed with the previous patch, thus qc_new_conn() can also be
realigned on both FE and BE sides. <conn> member of quic_conn instance
is always set outside it, in qc_xprt_start() on the backend case.
ssl_sock_ctx is a generic object used both on TCP/SSL and QUIC stacks.
Most notably it contains a <conn> member which is a pointer to struct
connection.
On QUIC frontend side, this member is always set to NULL. Indeed,
connection is only created after handshake completion. However, this has
changed for backend side, where the connection is instantiated prior to
its quic_conn counterpart. Thus, ssl_sock_ctx member would be set in
this case as a convenience for use later in qc_ssl_do_hanshake().
However, this method was unsafe as the connection can be released,
without resetting ssl_sock_ctx member. Thus, the previous patch fixes
this by using on <conn> member through the quic_conn instance which is
the proper way.
Thus, this patch resets ssl_sock_ctx <conn> member to NULL. This is
deemed the cleanest method as it ensures that both frontend and backend
sides must not use it anymore.
Perf top showed that h1_snd_buf() was having great difficulties accessing
the proxy's server_id_hdr_name field in the middle of the headers loop.
Moving the assignment out of the loop to a local variable moved the
problem there as well:
| if (!(h1m->flags & H1_MF_RESP) && isttest(h1c->px->server_id_hdr_n
0.10 |20b0: mov -0x120(%rbp),%rdi
1.33 | mov 0x60(%rdi),%r10
0.01 | test %eax,%eax
0.18 | jne 2118
12.87 | mov 0x350(%r10),%rdi
0.01 | test %rdi,%rdi
0.05 | je 2118
| mov 0x358(%r10),%r11
It turns out that there are several atomically accessed fields in its
vicinity, causing the cache line to bounce all the time. Let's collect
the few frequently changed fields and place them together at the end
of the structure, and plug the 32-bit hole with another isolated field.
Doing so also reduced a little bit the cost of decrementing be->be_conn
in process_stream(), and overall the HTTP/1 performance increased by
about 1% both on ARM and x86_64.
When trying to reuse a backend connection, a connection hash is
calculated to match an entry with similar parameters. Previously, this
operation was skipped if the stream content wasn't based on HTTP, as it
would have been incompatible with http-reuse.
With the introduction of SPOP backends, this condition was removed, so
that it can also benefit from connection reuse. However, this means that
now hash calcul is always performed when connecting to a server, even
for TCP or log backends. This is unnecessary as these proxies cannot
perform connection reuse.
Note also that reuse mode is resetted on postparsing for incompatible
backends. This at least guarantees that no tree lookup will be performed
via be_reuse_connection(). However, connection lookup is still performed
in the session via session_get_conn() which is another unnecessary
operation.
Thus, this patch restores the condition so that reuse operations are now
entirely skipped if a backend mode is incompatible. This is implemented
via a new utility function named be_supports_conn_reuse().
This could be backported up to 3.1, as this commit could be considered
as a performance regression for tcp/log backend modes.
Ensure that QUIC support is compiled into haproxy when a QUIC server is
configured. This check is performed during _srv_parse_finalize() so that
it is detected both on configuration parsing and when adding a dynamic
server via the CLI.
Note that this changes the behavior of srv_is_quic() utility function.
Previously, it always returned false when QUIC support wasn't compiled.
With this new check introduced, it is now guaranteed that a QUIC server
won't exist if compilation support is not active. Hence srv_is_quic()
does not rely anymore on USE_QUIC define.
The mux currently refrains from sending data before H2_CS_FRAME_H, i.e.
before the peer's SETTINGS frame was received. While it makes sense on
the frontend, it's causing harm on the backend because it forces the
first request to be sent in two halves over an extra RTT: first the
preface and settings, second the request once the settings are received.
This is totally contrary to the philosophy of the H2 protocol, consisting
in permitting the client to send as soon as possible.
Actually what happens is the following:
- process_stream() calls connect_server()
- connect_server() creates a connection, and if the proto/alpn is guessed
or known, the mux is instantiated for the current request.
- the H2 init code wakes the h2 tasklet up and returns
- process_stream() tries to send the request using h2_snd_buf(), but that
one sees that we're before H2_CS_FRAME_H, refrains from doing so and
returns.
- process_stream() subscribes and quits
- the h2 tasklet can now execute to send the preface and settings, which
leave as a first TCP segment. The connection is ready.
- the iocb is woken again once the server's SETTINGS frame is received,
turning the connection to the H2_CS_FRAME_H state, and the iocb wake
up process_stream().
- process_stream() executes again and can try to send again.
- h2_snd_buf() is called and finally sends the request as a second TCP
segment.
Not only this is inefficient, but it also renders 0-RTT and TFO impossible
on H2 connections. When 0-RTT is used, only the preface and settings leave
as early data (the very first data of that connection), which is totally
pointless.
In order to fix this, we have to go through a few steps:
- first we need to let data be sent to a server immediately after the
SETTINGS frame was sent (i.e. in H2_CS_SETTINGS1 state instead of
H2_CS_FRAME_H). However, some protocol extensions are advertised by
the server using SETTINGS (e.g. RFC8441) and some requests might need
to know the existence of such extensions. For this reason we're adding
a new h2c flag, H2_CF_SETTINGS_NEEDED, which indicates that some
operations were not done because a server's SETTINGS frame is needed.
This is set when trying to send a protocol upgrade or extended CONNECT
during H2_CS_SETTINGS1, indicating that it's needed to wait for
H2_CS_FRAME_H in this case. The flag is always set on frontend
connections. This is what is being done in this patch.
- second, we need to be able to push the preface opportunistically with
the first h2_snd_buf() so that it's not needed to wake the tasklet up
just to send that and wake process_stream() again. This will be in a
separate patch.
By doing the first step, we're at least saving one needless tasklet
wakeup per connection (~9%), which results in ~5% backend connection
rate increase.
Returns a pointer to the first bind_conf matching <name> in a frontend
<front>.
When name is prefixed by a @ (@<filename>:<linenum>), it tries to look
for the corresponding filename and line of the configuration file.
NULL is returned if no match is found.
This patch adds functions to expose Encrypted Client Hello (ECH) status
and outer SNI information for logging and sample fetching.
Two new helper functions are introduced in ech.c:
- conn_get_ech_status() places the ECH processing status string into a
buffer.
- conn_get_ech_outer_sni() retrieves the outer SNI value if ECH
succeeded.
Two new sample fetch keywords are added:
- "ssl_fc_ech_status" returns the ECH status string.
- "ssl_fc_ech_outer_sni" returns the outer SNI value seen during ECH.
These allow ECH information to be used in HAProxy logs, ACLs, and
captures.
This patch introduces the USE_ECH option in the Makefile to enable
support for Encrypted Client Hello (ECH) with OpenSSL.
A new function, load_echkeys, is added to load ECH keys from a specified
directory. The SSL context initialization process in ssl_sock.c is
updated to load these keys if configured.
A new configuration directive, `ech`, is introduced to allow users to
specify the ECH key directory in the listener configuration.
A private keys that is password protected and was decoded during init
thanks to the password obtained thanks to 'ssl-passphrase-cmd' should
not be dumped via 'dump ssl cert' CLI command.
When a certificate is protected by a password, we can provide the
password via the dedicated pem_password_cb param provided to
PEM_read_bio_PrivateKey.
HAProxy will fetch the password automatically during init by calling a
user-defined external command that should dump the right password on its
standard output (see new 'ssl-passphrase-cmd' global option).
The devnull fd might be needed during configuration parsing, if some
options require to fork/exec for instance. So we now create it much
earlier in the init process and without depending on the '-q' or '-d'
parameters.
Since 3.0 where the CLI started to use rcv_buf, it appears that some
external tools sending chained commands are randomly experiencing
failures. Each time this happens when the whole command is sent as a
single packet, immediately followed by a close. This is not a correct
way to use the CLI but this has been working for ages for simple
netcat-based scripts, so we should at least try to preserve this.
The cause of the failure is that the first LF that acks a command is
immediately sent back to the client and rejected due to the closed
connection. This in turn forwards the error back to the applet which
aborts its processing.
Before 3.0 the responses would be queued into the buffer, then sent
back to the channel, and would all fail at once. This changed when
snd_buf/rcv_buf were implemented because the applets are much more
responsive and since they yield between each command, they can
deliver one ACK at a time that is immediately forwarded down the
chain.
An easy way to observe the problem is to send 5 map updates, a shutdown,
and immediately close via tcploop, and in parallel run a periodic
"show map" to count the number of elements:
$ tcploop -U /tmp/sock1 C S:"add map #0 1 1; add map #0 2 2; add map #0 3 3; add map #0 4 4; add map #0 5 5\n" F K
Before 3.0, there would always be 5 elements. Since 3.0 and before
20ec1de214 ("MAJOR: cli: Refacor parsing and execution of pipelined
commands"), almost always 2. And since that commit above in 3.2, almost
always one. Doing the same using socat or netcat shows almost always 5...
It's entirely timing-dependent, and might even vary based on the RTT
between the client and haproxy!
The approach taken here consists in doing the same principle as MSG_MORE
or Nagle but on the response buffer: the applet doesn't need to send a
single ACK for each command when it has already been woken up and is
scheduled to come back to work. It's fine (and even desirable) that
ACKs are grouped in a single packet as much as possible.
For this reason, this patch implements APPCTX_CLI_ST1_YIELD, a new CLI
flag which indicates that the applet left in yielding condition, i.e.
it has not finished its work. This flag is used by .rcv_buf to hold
pending data. This way we won't return partial responses for no reason,
and we can continue to emulate the previous behavior.
One very nice benefit to this is that it saves huge amounts of CPU on
the client. In the test below that tries to update 1M map entries, the
CPU used by socat went from 100% to 0% and the total transfer time
dropped by 28%:
before:
$ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null
real 0m2.407s
user 0m1.485s
sys 0m1.682s
after:
$ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \
printf "add map #0 %d %d\n",i,i,i }}' | socat /tmp/sock1 - >/dev/null
real 0m1.721s
user 0m0.952s
sys 0m0.057s
The difference is also quite visible on the number of syscalls during
the test (for 1k updates):
before:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.071691 0 100001 sendmsg
after:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000011 1 9 sendmsg
This patch will need to be backported to 3.0, and depends on these two
patches to be backported as well:
MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty
MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()
As a folow-up to f40f5401b9, explicitely
use atomic operations to set the prev and next fields, to make sure the
compiler can't assume anything about it, and just does it.
This should be backported after f40f5401b9 up to 2.8.
The USE_KTLS test is currently being done outside of the USE_OPENSSL
guard so disabling USE_OPENSSL still results in build failures on
libcs built with support for kernels before 4.17, because we enable
KTLS by default on linux. Let's move the KTLS block inside the
USE_OPENSSL guard instead.
No backport is needed since KTLS is only in 3.3.
This is a second attempt at fixing issues on 32bits systems which would
trigger the following BUG_ON() statement:
FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this
This is a drop-in replacement for d30b88a6c + 4693ee0ff, as suggested by
Willy.
Indeed, on supported platforms unsigned int can be assumed to be 4 bytes
long, and long can be assumed to be 8 bytes long. As such, the previous
attempt was overkill and added unecessary maintenance complexity which
could result in bugs if not used properly. Moreover, it would only
partially solve the issue, since on little endian vs big endian
architectures, the provisioned memory areas (originating from the same
shm stats file) could be read differently by the host.
Instead we fix the aligments issues, and this alone helps to ensure
struct memory consistency on 64 vs 32bits platforms. It was tested
on both i386 and i586.
last_change and last_sess counters are now stored as unsigned int, as
it helped to fix the alignment issues and they were found to be used
as 32bits integers anyway.
Thanks to Willy for problem analysis and the patch proposal.
No backport needed.
This reverts commit 466a603b59.
Due to the last 2 commits, this macro is now unused, and will probably
never be used, so let's get rid of that for now.
Several settings can be set to control stream multiplexing and
associated receive window. Previously, all of these settings were
configured using prefix "tune.quic.frontend.", despite being applied
blindly on both sides.
Fix this by duplicating these settings specific to frontend and backend
side. Options are also renamed to use the standardize prefix
"tune.quic.[be|fe].stream." notation.
Also, each option is individually renamed to better reflect its purpose
and hide technical details relative to QUIC transport parameter naming :
* max-data-size -> stream.rxbuf
* max-streams-bidi -> stream.max-concurrent
* stream-data-ratio -> stream.data-ratio
No need to backport.
Streamline max-idle-timeout option. Rename it to use the newer cohesive
naming scheme 'tune.quic.fe|be.'.
Two different fields were already defined in global struct. These fields
are moved into quic_tune along with other QUIC settings. However, no
parser was defined for backend option, this commit fixes this.
No need to backport this.
On frontend side, a quic_conn can have a dedicated FD or use the
listener one. These different modes can be activated via a global QUIC
tune setting.
This patch adjusts the option. First, it is renamed to the more
meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now
either 'default-on' or 'force-off'. The objective is to better highlight
reliationship with 'quic-socket' bind option.
The older option is deprecated and will be removed in 3.5.
A QUIC global tune setting is defined to be able to force Retry emission
prior to handshake. By definition, this ability is only supported by
QUIC servers, hence it is a frontend option only.
Rename the option to use "fe" prefix. The old option name is deprecated
and will be removed in 3.5
QUIC global memory can be limited across the entire process via a global
tune setting. Previously, this setting used to misleading "frontend"
prefix. As this is applied as a sum between all QUIC connections, both
from frontend and backend sides, remove the prefix. The new option name
is "tune.quic.mem.tx-max".
The older option name is deprecated and will be removed in 3.5.
This patch is similar to the previous one, except that it is focused on
Tx QUIC settings. It is now possible to toggle GSO and pacing on
frontend and backend sides independently.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUI
settings. Older options are deprecated and will be removed on 3.5
release.
Various settings can be configured related to QUIC congestion controler.
This patch duplicates them to be able to set independent values on
frontend and backend sides.
As with previous patch, option are renamed to use "fe/be" unified
prefixes. This is part of the current serie of commits which unify QUIC
settings. Older options are deprecated and will be removed on 3.5
release.
Previously, QUIC glitches support was only implemented for frontend
side. Extend this so that the option can be specified separately both on
frontend and backend sides. Function _qcc_report_glitch() now retrieves
the relevant max value based on connection side.
In addition to this, option has been renamed to use "fe/be" prefixes.
This is part of the current serie of commits which unify QUIC settings.
Older options are deprecated and will be removed on 3.5 release.
Rename the option to quickly enable/disable every QUIC listeners. It now
takes an argument on/off. The documentation is extended to reflect the
fact that QUIC backend are not impacted by this option.
The older keyword is simply removed. Deprecation is considered
unnecessary as this setting is only useful during debugging.
A major reorganization of QUIC settings is going to be performed. One of
its objective is to clearly define options which can be separately
configured on frontend and backend proxy sides.
To implement this, quic_tune structure is extended to support fe and be
options. A set of macros/functions is also defined : it allows to
retrieve an option defined on both sides with unified code, based on
proxy side of a quic_conn/connection instance.
Avoid setting both el->prev and el->next on the same line.
The goal is to set both el->prev and el->next to el, but a naive
compiler, such as when we're using -O0, will set el->next first, then
will set el->prev to the value of el->next, but if we're unlucky,
el->next will have been set to something else by another thread.
So explicitely set both to what we want.
This should be backported up to 2.8.
As reported by @tianon on GH #3168, running haproxy on 32bits i386
platform would trigger the following BUG_ON() statement:
FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825
shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this
In fact, some efforts were already taken to ensure shm_stats_file_object
struct size remains consistent on 64 vs 32 bits platforms, since
shm_stats_file_object is part of the public API and directly exposed in
the stats file.
However, some parts were overlooked: some structs that are embedded in
shm_stats_file_object struct itself weren't using fixed-width integers,
and would sometime be unaligned. The result of this is that it was
up to the compiler (platform-dependent) to choose how to deal with such
ambiguities, which could cause the struct mapping/size to be inconsistent
from one platform to another.
Hopefully this was caught by the BUG_ON() statement and with the precious
help of @tianon
To fix this, we now use fixed-width integers everywhere for members
(and submembers) of shm_stats_file_object struct, and we use explicit
padding where missing to avoid automatic padding when we don't expect
one. As for the previous commit, we leverage FIXED_SIZE() and
FIXED_SIZE_ARRAY() macro to set the expected width for each integer
without causing build issues on platform that don't support larger
integers.
No backport needed, this feature was introduced during 3.3-dev.
freq-ctr struct is used by the shm_stats_file API, and more precisely,
it is used in the shm_stats_file_object struct for counters.
shm_stats_file_object struct requires to be plateform-independent, thus
we switch to using explicit size types (AKA fixed width integer types)
for freq-ctr, in the attempt to make freq-ctr size and memory mapping
consistent from one platform to another.
We cannot simply use fixed-width integer because some of them are
involved in atomic operations, and forcing a given width could
cause build issues on some platforms where atomic ops are not
implemented for large integers. Instead we leverage the FIXED_SIZE
macro to keep handling the integers as before, but forcing them to
be stored using expected number of bytes (unused bytes will simply
be ignored).
No change of behavior should be expected.
FIXED_SIZE() macro can be used to instruct the compiler that the struct
member named <name>, handled as <type>, must be stored using <size> bytes
and that even if the type used is actualler smaller than the expected size
FIXED_SIZE_ARRAY(), similar to FIXED_SIZE() but for arrays: it takes an
extra argument which is the number of members.
They may be used for portability concerns to ensure a structure mapping
remains consistent between platforms.
The previous commit switch from ncbuf to ncbmbuf as storage for received
CRYPTO frames. The latter ensures that buffering of such frames cannot
fail anymore due to gaps size.
Previously, extra mechanism were implemented on QUIC frames parsing
function to overcome the limitation of ncbuf on gaps size. Before
insertion, CRYPTO frames were stored in a temporary tree to order their
insertion. As this is not necessary anymore, this commit removes the
temporary tree insertion.
This commit is closely associated to the previous bug fix. As it
provides a neat optimization and code simplication, it can be backported
with it, but not in the next immediate release to spot potential
regression.
In QUIC, TLS handshake messages such as ClientHello are encapsulated in
CRYPTO frames. Each QUIC implementation can split the content in several
frames of random sizes. In fact, this feature is now used by several
clients, based on chrome so-called "Chaos protection" mechanism :
https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7
To support this, haproxy uses a ncbuf storage to store received CRYPTO
frames before passing it to the SSL library. However, this storage
suffers from a limitation as gaps between two filled blocks cannot be
smaller than 8 bytes. Thus, depending on the size of received CRYPTO
frames and their order, ncbuf may not be sufficient. Over time, several
mechanisms were implemented in haproxy QUIC frames parsing to overcome
the ncbuf limitation.
However, reports recently highlight that with some clients haproxy is
not able to deal with CRYPTO frames reception. In particular, this is
the case with the latest ngtcp2 release, which implements a similar
chaos protection mechanism via the following patch. It also seems that
this impacts haproxy interaction with firefox.
commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c
Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com>
Date: Mon Aug 4 22:48:06 2025 +0900
Crumble Client Initial CRYPTO (aka chaos protection)
To fix haproxy CRYPTO frames buffering once and for all, an alternative
non-contiguous buffer named ncbmbuf has been recently implemented. This
type does not suffer from gaps size limitation, albeit at the cost of a
small reduction in the size available for data storage.
Thus, the purpose of this current patch is to replace ncbuf with the
newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used
to buffer received frames which is guaranteed to suceed. The only
remaining case of error is if a received frame offset and length exceed
the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED
error code.
A notable behavior change when switching to ncbmbuf implementation is
that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead,
crypto frame content received at a similar offset will be overwritten.
A final note regarding STREAM frames parsing. For now, it is considered
unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does
not perform aggressive fragmentation for them. Keeping ncbuf ensure that
the data storage size is bigger than the equivalent ncbmbuf area.
This should fix github issue #3141.
This patch must be backported up to 2.6. It is first necessary to pick
the relevant commits for ncbmbuf implementation prior to it.
Implement ncbmb_advance() function for the ncbmbuf type. This allows to
remove bytes in front of the buffer, regardless of the existing gaps.
This is implemented by resetting the corresponding bits of the bitmap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Implement ncbmb_data() function for the ncbmbuf type. Its purpose is
similar to its ncbuf counterpart : it returns the size in bytes of data
starting at a specific offset until the next gap.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
This patch implements add operation for ncbmbuf type.
This function is simpler than its ncbuf counterpart. Indeed, for now
only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen
as ncbmbuf will be first used for QUIC CRYPTO frames handling, which
does not mandate to compare existing filled blocks during insertion.
As the previous patch, this commit must be backported prior to the fix
to come on QUIC CRYPTO frames parsing.
Define ncbmbuf which is an alternative non-contiguous buffer
implementation. "bm" abbreviation stands for bitmap, which reflects how
gaps and filled blocks are encoded. The main purpose of this
implementation is to get rid of the ncbuf limitation regarding the
minimal size for gaps between two blocks of data.
This commit adds the new module ncbmbuf. Along with it, some utility
functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are
defined. Public API of ncbmbuf will be extended in the following
patches.
This patch is not considered a bug fix. However, it will be required to
fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be
necessary to backport the current patch prior to the fix to come.
ncbuf is a module which provide a non-contiguous buffer type
implementation. This patch extracts some basic types related to it into
a new file ncbuf_common.h.
This patch will be useful to provide a new non-contiguous buffer
alternative implementation based on a bitmap.
This patch is not a bug fix. However, it is necessary for ncbmbuf
implementation which will be required to fix a QUIC issue on CRYPTO
frames parsing. This, it will be necessary to backport the current patch
prior to the fix to come.
Instead of having per-table expiration tasks, just use one per shard.
The task will now go through all the tables to expire entries. When a
table gets an expiration earlier than the one previously known, it will
be put in a mt-list, and the task will be responsible to put it into an
eb32, ordered based on the next expiration.
Each per-shard task will run on a different thread, so it should lead to
a better load distribution than the per-table tasks.
Add a new initcall stage, STG_INIT_2, for stuff to be called after
step_init_2() is called, so after we know for sure that global.nbthread
will be set.
Modify stick-tables stkt_late_init() to run at STG_INIT_2 instead of
STG_INIT, in anticipation for it to be enhanced and have a need for
global.nbthread.
In mt_list_delete(), if the element was not in a list, then n and p will
point to it, and so setting n->prev and n->next will be enough to unlock it.
Don't do it twice, as once it's been done the first time, another thread may
be working with it, and may have added it to a list already, and doing it
a second time can lead to list inconsistencies.
This should be backported up to 2.8.
The SSL counters were not handled at all for QUIC connections. This patch
implement ssl_sock_update_counters() extracting the code from ssl_sock.c
and call this function where applicable both in TLS/TCP and QUIC parts.
Must be backported as far as 2.8.
Better check that munmap() always works, otherwise it means we might
have miscalculated an address, and if it fails silently, it will eat
all the memory extremely quickly. Let's add a BUG_ON() on munmap's
return.
As reported by Christopher, in UAF mode memory release of aligned
objects as introduced in commit ef915e672a ("MEDIUM: pools: respect
pool alignment in allocations") does not work. The padding calculation
in the freeing code is no longer correct since it now depends on the
alignment, so munmap() fails on EINVAL. Fortunately we don't care much
about it since we know it's the low bits of the passed address, which
is much simpler to compute, since all mmaps are page-aligned.
There's no need to backport this, as this was introduced in 3.3.
The pcre2 matching requires an array of matches for grouping, that is
allocated when executing the rule by pre-processing it, and that is
immediately freed after use. This is quite inefficient and results in
annoying patterns in "show profiling" that attribute the allocations
to libpcre2 and the releases to haproxy.
A good suggestion from Dragan is to pre-allocate these per thread,
since the entry is not specific to a regex. In addition we're already
limited to MAX_MATCH matches so we don't even have the problem of
having to grow it while parsing nor processing.
The current patch adds a per-thread pair of init/deinit functions to
allocate a thread-local entry for that, and gets rid of the dynamic
allocations. It will result in cleaner memory management patterns and
slightly higher performance (+2.5%) when using pcre2.
A certificate that does not have the 'jwt' flag enabled cannot be used
for JWT validation. We now raise a specific return value so that such a
case can be identified.
This option can be used to enable the use of a given certificate for JWT
verification. It defaults to 'off' so certificates that are declared in
a crt-store and will be used for JWT verification must have a
"jwt on" option in the configuration.
This converter will be in charge of performing the same operation as the
'jwt_verify' one except that it takes a full-on pem certificate path
instead of a public key path as parameter.
The certificate path can be either provided directly as a string or via
a variable. This allows to use certificates that are not known during
init to perform token validation.
The jwt_verify converter will not take full-on certificates anymore
in favor of a new soon to come jwt_verify_cert. We might end up with a
new jwt_verify_hmac in the future as well which would allow to deprecate
the jwt_verify converter and remove the need for a specific internal
tree for public keys.
The logic to always look into the internal jwt tree by default and
resolve to locking the ckch tree as little as possible will also be
removed. This allows to get rid of the duplicated reference to
EVP_PKEYs, the one in the jwt tree entry and the one in the ckch_store.
By refactoring the HTX to remove the extra field, a bug was introduced in
the stream-connector part. The <kip> (known input payload) value of a sedesc
was moved to <kop> (knwon output payload) using the same sedesc. Of course,
this is totally wrong. <kip> value of a sedesc must be forwarded to the
opposite side.
In addition, the operation is performed in sc_conn_send(). In this function,
we manipulate the stream-connectors. So se_fwd_kip() function was changed to
use the stream-connectors directely.
Now, the function sc_ep_fwd_kip() is now called with the both
stream-connectors to properly forward <kip> from on side to the opposite
side.
The bug is 3.3-specific. No backport needed.
Thanks for previous changes, it is now possible to remove the <extra> field
from the HTX structure. HTX_FL_ALTERED_PAYLOAD flag is also removed because
it is now unsued.
For now, the HTX extra value is used to specify the known part, in bytes, of
the HTTP payload we will receive. It may concerne the full payload if a
content-length is specified or the current chunk for a chunk-encoded
message. The main purpose of this value is to be used on the opposite side
to be able to announce chunks bigger than a buffer. It can also be used to
check the validity of the payload on the sending path, to properly detect
too big or too short payload.
However, setting this information in the HTX message itself is not really
appropriate because the information is lost when the HTX message is consumed
and the underlying buffer released. So the producer must take care to always
add it in all HTX messages. it is especially an issue when the payload is
altered by a filter.
So to fix this design issue, the information will be moved in the sedesc. It
is a persistent area to save the information. In addition, to avoid the
ambiguity between what the producer say and what the consumer see, the
information will be splitted in two fields. In this patch, the fields are
added:
* kip : The known input payload length
* kop : The known output payload lenght
The producer will be responsible to set <kip> value. The stream will be
responsible to decrement <kip> and increment <kop> accordingly. And the
consumer will be responsible to remove consumed bytes from <kop>.
With this function we can now pass the desired default value for the
abortonclose option when neither the option nor its opposite were set.
Let's also take this opportunity for using it directly from the HTTP
analyser since there's no point in re-checking the proxy's mode there.
As discussed on https://github.com/orgs/haproxy/discussions/3146 and on
the mailing list, there's a marked preference for having abortonclose
enabled by default when relevant. The point being that with todays'
internet, the large majority of requests sent with a closed input
channel are aborted requests, and that it's pointless to waste resources
processing them.
This patch now considers both "option abortonclose" and its opposite
"no option abortonclose" to figure whether abortonclose is enabled or
disabled in a backend. When neither are set (thus not even inherited
from a defaults section), then it considers the proxy's mode, and HTTP
mode implies abortonclose by default.
This may make some legacy services fail starting with 3.3. In this case
it will be sufficient to add "no option abortonclose" in either the
affected backend or the defaults section it derives from. But for
internet-facing proxies it's better to stay with the option enabled.
In order to prepare for changing the way abortonclose works, let's
replace the direct flag check with a similarly named function
(proxy_abrt_close) which returns the on/off status of the directive
for the proxy. For now it simply reflects the flag's state.
Normally, when reading a full buffer, or exactly the requested size, it
is not really possible to know if the peer had closed immediately after,
and usually we don't care. There's a problematic case, though, which is
with SSL: the SSL layer reads in small chunks of a few bytes, and can
consume a client_hello this way, then start computation without knowing
yet that the client has aborted. In order to permit knowing more, we now
introduce a new read flag, CO_RFL_TRY_HARDER, which says that if we've
read up to the permitted limit and the flag is set, then we attempt one
extra byte using MSG_PEEK to detect whether the connection was closed
immediately after that content or not. The first use case will obviously
be related to SSL and client_hello, but it might possibly also make sense
on HTTP responses to detect a pending FIN at the end of a response (e.g.
if a close was already advertised).
Now it will be possible to query any thread's scheduler state, not
only the current one. This aims at simplifying the watchdog checks
for reported threads. The operation is now a simple atomic xchg.
Implement MT_LIST_POP_LOCKED(), that behaves as MT_LIST_POP() and
removes the first element from the list, if any, but keeps it locked.
This should be backported to 3.2, as it will be use in a bug fix in the
stick tables that affects 3.2 too.
Like "acme-vars", the "provider-name" in the acme section is used in
case of DNS-01 challenge and is sent to the dpapi sink.
This is used to pass the name of a DNS provider in order to chose the
DNS API to use.
This patch implements the cfg_parse_acme_vars_provider() which parses
either acme-vars or provider-name options and escape their strings.
Example:
$ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock - | cat -e
<0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
acme-vars "var1=foobar\"toto\",var2=var2"$
provider-name "godaddy"$
{$
"identifier": {$
"type": "dns",$
"value": "example.com"$
},$
"status": "pending",$
"expires": "2025-09-25T14:41:57Z",$
[...]
In the case of the dns-01 challenge, the agent that handles the
challenge might need some extra information which depends on the DNS
provider.
This patch introduces the "acme-vars" option in the acme section, which
allows to pass these data to the dpapi sink. The double quotes will be
escaped when printed in the sink.
Example:
global
setenv VAR1 'foobar"toto"'
acme LE
directory https://acme-staging-v02.api.letsencrypt.org/directory
challenge DNS-01
acme-vars "var1=${VAR1},var2=var2"
Would output:
$ ( echo "@@1 show events dpapi -w -0"; cat - ) | socat /tmp/master.sock - | cat -e
<0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$
acme-vars "var1=foobar\"toto\",var2=var2"$
{$
"identifier": {$
"type": "dns",$
"value": "example.com"$
},$
"status": "pending",$
"expires": "2025-09-25T14:41:57Z",$
[...]
This patch looks huge, but it has a very simple goal: protect all
accessed to shared stats pointers (either read or writes), because
we know consider that these pointers may be NULL.
The reason behind this is despite all precautions taken to ensure the
pointers shouldn't be NULL when not expected, there are still corner
cases (ie: frontends stats used on a backend which no FE cap and vice
versa) where we could try to access a memory area which is not
allocated. Willy stumbled on such cases while playing with the rings
servers upon connection error, which eventually led to process crashes
(since 3.3 when shared stats were implemented)
Also, we may decide later that shared stats are optional and should
be disabled on the proxy to save memory and CPU, and this patch is
a step further towards that goal.
So in essence, this patch ensures shared stats pointers are always
initialized (including NULL), and adds necessary guards before shared
stats pointers are de-referenced. Since we already had some checks
for backends and listeners stats, and the pointer address retrieval
should stay in cpu cache, let's hope that this patch doesn't impact
stats performance much.
If we see that another thread is already busy trying to announce the
dropped counter, there's no point going there, so let's just skip all
that operation from sink_write() and avoid disturbing the other thread.
This results in a boost from 244 to 262k req/s.
Currently there's a small mistake in the way the trace function and
macros. The calling function name is known as a constant until the
macro and passed as-is to the __trace() function. That one needs to
know its length and will call ist() on it, resulting in a real call
to strlen() while that length was known before the call. Let's use
an ist instead of a const char* for __trace() and __trace_enabled()
so that we can now completely avoid calling strlen() during this
operation. This has significantly reduced the importance of
__trace_enabled() in perf top.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with outgoing connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Unknown or
forbidden algorithms will be ignored and the default one will continue to
be used.
It is possible on at least Linux and FreeBSD to set the congestion control
algorithm to be used with incoming connections, among the list of supported
and permitted ones. Let's expose this setting with "cc". Permission issues
might be reported (as warnings).
The C standard specifies that it's undefined behavior to dereference
NULL (even if you use & right after). The hand-rolled offsetof idiom
&(((s*)NULL)->f) is thus technically undefined. This clutters the
output of UBSan and is simple to fix: just use the real offsetof when
it's available.
Note that there's no clear statement about this point in the spec,
only several points which together converge to this:
- From N3220, 6.5.3.4:
A postfix expression followed by the -> operator and an identifier
designates a member of a structure or union object. The value is
that of the named member of the object to which the first expression
points, and is an lvalue.
- From N3220, 6.3.2.1:
An lvalue is an expression (with an object type other than void) that
potentially designates an object; if an lvalue does not designate an
object when it is evaluated, the behavior is undefined.
- From N3220, 6.5.4.4 p3:
The unary & operator yields the address of its operand. If the
operand has type "type", the result has type "pointer to type". If
the operand is the result of a unary * operator, neither that operator
nor the & operator is evaluated and the result is as if both were
omitted, except that the constraints on the operators still apply and
the result is not an lvalue. Similarly, if the operand is the result
of a [] operator, neither the & operator nor the unary * that is
implied by the [] is evaluated and the result is as if the & operator
were removed and the [] operator were changed to a + operator.
=> In short, this is saying that C guarantees these identities:
1. &(*p) is equivalent to p
2. &(p[n]) is equivalent to p + n
As a consequence, &(*p) doesn't result in the evaluation of *p, only
the evaluation of p (and similar for []). There is no corresponding
special carve-out for ->.
See also: https://pvs-studio.com/en/blog/posts/cpp/0306/
After this patch, HAProxy can run without crashing after building w/
clang-19 -fsanitize=undefined -fno-sanitize=function,alignment
This is ebtree commit bd499015d908596f70277ddacef8e6fa998c01d5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 5211c2f71d78bf546f5d01c8d3c1484e868fac13.
We'll use this to improve the definition of container_of(). Let's define
it if it does not exist. We can rely on __builtin_offsetof() on recent
enough compilers.
This is ebtree commit 1ea273e60832b98f552b9dbd013e6c2b32113aa5.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 69b2ef57a8ce321e8de84486182012c954380401.
From 'man gcc': passing 0 as the argument to "__builtin_ctz" or
"__builtin_clz" invokes undefined behavior. This triggers UBsan
in HAProxy.
[wt: tested in treebench and verified not to cause any performance
regression with opstime-u32 nor stress-u32]
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029.
Signed-off-by: Willy Tarreau <w@1wt.eu>
This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.
While the previous optimizations couldn't be preserved due to the
possibility of out-of-bounds accesses, at least the prefetch is useful.
A test on treebench shows that for 64k short strings, the lookup time
falls from 276 to 199ns per lookup (28% savings), and the insert falls
from 311 to 296ns (4.9% savings), which are pretty respectable, so
let's do this.
This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.
It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no
mention of it in 3.0's doc. Let's replace it with eb_prefetch() which
maps to __builtin_prefetch() on supported compilers and falls back to
the usual do{}while(0) on other ones. It was tested to properly build
with tcc as well as gcc-2.95.
This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.
Similar to previous patches, let's improve the insert() descent loop to
avoid discovering mandatory data too late. The change here is even
simpler than previous ones, a prefetch was installed and troot is
calculated before last instruction in a speculative way. This was enough
to gain +50% insertion rate on random data.
This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.
This is the same principles as for the latest improvements made on
integer trees. Applying the same recipes made the ebmb_lookup()
function jump from 10.07 to 12.25 million lookups per second on a
10k random values tree (+21.6%).
It's likely that the ebmb_lookup_longest() code could also benefit
from this, though this was neither explored nor tested.
This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.
In the loop we can help the compiler build slightly more efficient code
by placing an unlikely() around the leaf test. This shows a consistent
0.5% performance gain both on eb32 and eb64.
This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.
This one was previously used to preload from the node and keep a copy
in a register on i386 machines with few registers. With the new more
optimal code it's totally useless, so let's get rid of it. By the way
the 64 bit code didn't use that at all already.
This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.
Instead of shifting the XOR value right and comparing it to 1, which
roughly requires 2 sequential instructions, better test if the XOR has
any bit above the current bit, which means any bit set among those
strictly higher, or in other words that XOR & (-bit << 1) is non-zero.
This is one less instruction in the fast path and gives another nice
performance gain on random keys (in million lookups/s):
eb32 1k: 33.17 -> 37.30 +12.5%
10k: 15.74 -> 17.08 +8.51%
100k: 8.00 -> 9.00 +12.5%
eb64 1k: 34.40 -> 38.10 +10.8%
10k: 16.17 -> 17.10 +5.75%
100k: 8.38 -> 8.87 +5.85%
This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.
The current code calculates the next troot based on a calculation.
This was efficient when the algorithm was developed many years ago
on K6 and K7 CPUs running at low frequencies with few registers and
limited branch prediction units but nowadays with ultra-deep pipelines
and high latency memory that's no longer efficient, because the CPU
needs to have completed multiple operations before knowing which
address to start fetching from. It's sad because we only have two
branches each time but the CPU cannot know it. In addition, the
calculation is performed late in the loop, which does not help the
address generation unit to start prefetching next data.
Instead we should help the CPU by preloading data early from the node
and calculing troot as soon as possible. The CPU will be able to
postpone that processing until the dependencies are available and it
really needs to dereference it. In addition we must absolutely avoid
serializing instructions such as "(a >> b) & 1" because there's no
way for the compiler to parallelize that code nor for the CPU to pre-
process some early data.
What this patch does is relatively simple:
- we try to prefetch the next two branches as soon as the
node is known, which will help dereference the selected node in
the next iteration; it was shown that it only works with the next
changes though, otherwise it can reduce the performance instead.
In practice the prefetching will start a bit later once the node
is really in the cache, but since there's no dependency between
these instructions and any other one, we let the CPU optimize as
it wants.
- we preload all important data from the node (next two branches,
key and node.bit) very early even if not immediately needed.
This is cheap, it doesn't cause any pipeline stall and speeds
up later operations.
- we pre-calculate 1<<bit that we assign into a register, so as
to avoid serializing instructions when deciding which branch to
take.
- we assign the troot based on a ternary operation (or if/else) so
that the CPU knows upfront the two possible next addresses without
waiting for the end of a calculation and can prefetch their contents
every time the branch prediction unit guesses right.
Just doing this provides significant gains at various tree sizes on
random keys (in million lookups per second):
eb32 1k: 29.07 -> 33.17 +14.1%
10k: 14.27 -> 15.74 +10.3%
100k: 6.64 -> 8.00 +20.5%
eb64 1k: 27.51 -> 34.40 +25.0%
10k: 13.54 -> 16.17 +19.4%
100k: 7.53 -> 8.38 +11.3%
The performance is now much closer to the sequential keys. This was
done for all variants ({32,64}{,i,le,ge}).
Another point, the equality test in the loop improves the performance
when looking up random keys (since we don't need to reach the leaf),
but is counter-productive for sequential keys, which can gain ~17%
without that test. However sequential keys are normally not used with
exact lookups, but rather with lookup_ge() that spans a time frame,
and which does not have that test for this precise reason, so in the
end both use cases are served optimally.
It's interesting to note that everything here is solely based on data
dependencies, and that trying to perform *less* operations upfront
always ends up with lower performance (typically the original one).
This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.
Let's explicitly mention that fe_counters_shared_tg and
be_counters_shared_tg structs are embedded in shm_stats_file_object
struct so any change in those structs will result in shm stats file
incompatibility between processes, thus extra precaution must be
taken when making changes to them.
Note that the provisionning made in shm_stats_file_object struct could
be used to add members to {fe,be}_counters_shared_tg without changing
shm_stats_file_object struct size if needed in order to preserve
shm stats file version.
The variables trees use the immediate cebtree API, better use the
item one which is more expressive and safer. The "node" field was
renamed to "name_node" to avoid any ambiguity.
Previously the conn_hash_node was placed outside the connection due
to the big size of the eb64_node that could have negatively impacted
frontend connections. But having it outside also means that one
extra allocation is needed for each backend connection, and that one
memory indirection is needed for each lookup.
With the compact trees, the tree node is smaller (16 bytes vs 40) so
the overhead is much lower. By integrating it into the connection,
We're also eliminating one pointer from the connection to the hash
node and one pointer from the hash node to the connection (in addition
to the extra object bookkeeping). This results in saving at least 24
bytes per total backend connection, and only inflates connections by
16 bytes (from 240 to 256), which is a reasonable compromise.
Tests on a 64-core EPYC show a 2.4% increase in the request rate
(from 2.08 to 2.13 Mrps).
Idle connection trees currently require a 56-byte conn_hash_node per
connection, which can be reduced to 32 bytes by moving to ceb64. While
ceb64 is theoretically slower, in practice here we're essentially
dealing with trees that almost always contain a single key and many
duplicates. In this case, ceb64 insert and lookup functions become
faster than eb64 ones because all duplicates are a list accessed in
O(1) while it's a subtree for eb64. In tests it is impossible to tell
the difference between the two, so it's worth reducing the memory
usage.
This commit brings the following memory savings to conn_hash_node
(one per backend connection), and to srv_per_thread (one per thread
and per server):
struct before after delta
conn_hash_nodea 56 32 -24
srv_per_thread 96 72 -24
The delicate part is conn_delete_from_tree(), because we need to
know the tree root the connection is attached to. But thanks to
recent cleanups, it's now clear enough (i.e. idle/safe/avail vs
session are easy to distinguish).
We'll soon need to choose the server's root based on the connection's
flags, and for this we'll need the thread it's attached to, which is
not always the current one. This patch simply passes the thread number
from all callers. They know it because they just set the idle_conns
lock on it prior to calling the function.
The proxy struct has several small holes that deserved being plugged by
moving a few fields around. Now we're down to 3056 from 3072 previously,
and the remaining holes are small.
At the moment, compared to before this series, we're seeing these
sizes:
type\size 7d554ca62 current delta
listener 752 704 -48 (-6.4%)
server 4032 3840 -192 (-4.8%)
proxy 3184 3056 -128 (-4%)
stktable 3392 3328 -64 (-1.9%)
Configs with many servers have shrunk by about 4% in RAM and configs
with many proxies by about 3%.
The struct server still has a lot of holes and padding that make it
quite big. By moving a few fields aronud between areas which do not
interact (e.g. boot vs aligned areas), it's quite easy to plug some
of them and/or to arrange larger ones which could be reused later with
a bit more effort. Here we've reduced holes by 40 bytes, allowing the
struct to shrink by one more cache line (64 bytes). The new size is
3840 bytes.
The server ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct server for this node, plus 8 bytes in
struct proxy. The server struct is still 3904 bytes large (due to
alignment) and the proxy struct is 3072.
The listener ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct listener for this node, plus 8 bytes
in struct proxy. The struct listener is now 704 bytes large, and the
struct proxy 3080.
The proxy ID is currently stored as a 32-bit int using an eb32 tree.
It's used essentially to find holes in order to automatically assign IDs,
and to detect duplicates. Let's change this to use compact trees instead
in order to save 24 bytes in struct proxy for this node, plus 8 bytes in
the root (which is static so not much relevant here). Now the proxy is
3088 bytes large.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated server_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This was previously achieved via the generic get_next_id() but we'll soon
get rid of generic ID trees so let's have a dedicated listener_get_next_id().
As a bonus it reduces the exposure of the tree's root outside of the functions.
This is used to index the proxy's name and it contains a copy of the
pointer to the proxy's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct proxy, which is now 3112
bytes large.
Here we need to continue to support duplicates since they're still
allowed between type-incompatible proxies.
Interestingly, the use of cebis_next_dup() instead of cebis_next() in
proxy_find_by_name() allows us to get rid of an strcmp() that was
performed for each use_backend rule. A test with a large config
(100k backends) shows that we can get 3% extra performance on a
config involving a static use_backend rule (3.09M to 3.18M rps),
and even 4.5% on a dynamic rule selecting a random backend (2.47M
to 2.59M).
This member is used to index the hostname_dn contents for DNS resolution.
Let's replace it with a cebis_tree to save another 32 bytes (24 for the
node + 8 by avoiding the duplication of the pointer). The struct server is
now at 3904 bytes.
This is used to index the server name and it contains a copy of the
pointer to the server's name in <id>. Changing that for a ceb_node placed
just before <id> saves 32 bytes to the struct server, which remains 3968
bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144.
It's worth noting that the current way duplicate names are handled remains
based on the previous mechanism where dups were permitted. Ideally we
should now reject them during insertion and use unique key trees instead.
This contains the text representation of the server's address, for use
with stick-tables with "srvkey addr". Switching them to a compact node
saves 24 more bytes from this structure. The key was moved to an external
pointer "addr_key" right after the node.
The server struct is now 3968 bytes (down from 4032) due to alignment, and
the proxy struct shrinks by 8 bytes to 3152.
The current guid struct size is 56 bytes. Once reduced using compact
trees, it goes down to 32 (almost half). We're not on a critical path
and size matters here, so better switch to this.
It's worth noting that the name part could also be stored in the
guid_node at the end to save 8 extra byte (no pointer needed anymore),
however the purpose of this struct is to be embedded into other ones,
which is not compatible with having a dynamic size.
Affected struct sizes in bytes:
Before After Diff
server 4032 4032 0*
proxy 3184 3160 -24
listener 752 728 -24
*: struct server is full of holes and padding (176 bytes) and is
64-byte aligned. Moving the guid_node elsewhere such as after sess_conn
reduces it to 3968, or one less cache line. There's no point in moving
anything now because forthcoming patches will arrange other parts.
cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern
references are only used during map/acl updates, so their storage is
pure loss between updates (which most of the time never happen). By
switching their indexing to compact trees, we can save 16 to 24 bytes
per entry depending on alightment (here it's 24 per struct but 16
practical as malloc's alignment keeps 8 unused).
Tested on core i7-8650U running at 3.0 GHz, with a file containing
17.7M IP addresses (16.7M different):
$ time ./haproxy -c -f acl-ip.cfg
Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the
startup (5.8%, from 19.2s to 18.2s), a part of which possible being
attributed to having to write less memory. Note that this is on small
strings. On larger ones such as user-agents, ebtree doesn't reread
the whole key and might be more efficient.
Before:
RAM (VSZ/RSS): 4443912 3912444
real 0m19.211s
user 0m18.138s
sys 0m1.068s
Overhead Command Shared Object Symbol
44.79% haproxy haproxy [.] ebst_insert
25.07% haproxy haproxy [.] ebmb_insert_prefix
3.44% haproxy libc-2.33.so [.] __libc_calloc
2.71% haproxy libc-2.33.so [.] _int_malloc
2.33% haproxy haproxy [.] free_pattern_tree
1.78% haproxy libc-2.33.so [.] inet_pton4
1.62% haproxy libc-2.33.so [.] _IO_fgets
1.58% haproxy libc-2.33.so [.] _int_free
1.56% haproxy haproxy [.] pat_ref_push
1.35% haproxy libc-2.33.so [.] malloc_consolidate
1.16% haproxy libc-2.33.so [.] __strlen_avx2
0.79% haproxy haproxy [.] pat_idx_tree_ip
0.76% haproxy haproxy [.] pat_ref_read_from_file
0.60% haproxy libc-2.33.so [.] __strrchr_avx2
0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.54% haproxy libc-2.33.so [.] __memchr_avx2
0.46% haproxy haproxy [.] pat_ref_append
After:
RAM (VSZ/RSS): 4166108 3634768
real 0m18.114s
user 0m17.113s
sys 0m0.996s
Overhead Command Shared Object Symbol
38.99% haproxy haproxy [.] cebs_insert
27.09% haproxy haproxy [.] ebmb_insert_prefix
3.63% haproxy libc-2.33.so [.] __libc_calloc
3.18% haproxy libc-2.33.so [.] _int_malloc
2.69% haproxy haproxy [.] free_pattern_tree
1.99% haproxy libc-2.33.so [.] inet_pton4
1.74% haproxy libc-2.33.so [.] _IO_fgets
1.73% haproxy libc-2.33.so [.] _int_free
1.57% haproxy haproxy [.] pat_ref_push
1.48% haproxy libc-2.33.so [.] malloc_consolidate
1.22% haproxy libc-2.33.so [.] __strlen_avx2
1.05% haproxy libc-2.33.so [.] __strcmp_avx2
0.80% haproxy haproxy [.] pat_idx_tree_ip
0.74% haproxy libc-2.33.so [.] __memchr_avx2
0.69% haproxy libc-2.33.so [.] __strrchr_avx2
0.69% haproxy libc-2.33.so [.] _IO_getline_info
0.62% haproxy haproxy [.] pat_ref_read_from_file
0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0
0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5
0.46% haproxy haproxy [.] pat_ref_append
If the addresses are totally disordered (via "shuf" on the input file),
we see both implementations reach exactly 68.0s (slower due to much
higher cache miss ratio).
On large strings such as user agents (1 million here), it's now slightly
slower (+9%):
Before:
real 0m2.475s
user 0m2.316s
sys 0m0.155s
After:
real 0m2.696s
user 0m2.544s
sys 0m0.147s
But such patterns are much less common than short ones, and the memory
savings do still count.
Note that while it could be tempting to get rid of the list that chains
all these pat_ref_elt together and only enumerate them by walking along
the tree to save 16 extra bytes per entry, that's not possible due to
the problem that insertion ordering is critical (think overlapping regex
such as /index.* and /index.html). Currently it's not possible to proceed
differently because patterns are first pre-loaded into the pat_ref via
pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(),
which has to only redo the second part anyway for maps/acls declared
multiple times.
The support for duplicates is necessary for various use cases related
to config names, so let's upgrade to the latest version which brings
this support. This updates the cebtree code to commit 808ed67 (tag
0.5.0). A few tiny adaptations were needed:
- replace a few ceb_node** with ceb_root** since pointers are now
tagged ;
- replace cebu*.h with ceb*.h since both are now merged in the same
include file. This way we can drop the unused cebu*.h files from
cebtree that are provided only for compatibility.
- rename immediate storage functions to cebXX_imm_XXX() as per the API
change in 0.5 that makes immediate explicit rather than implicit.
This only affects vars and tools.c:copy_file_name().
The tests continue to work.
If an ocsp response is set to be updated automatically and some
certificate or CA updates are performed on the CLI, if the CLI update
happens while the OCSP response is being updated and is then detached
from the udapte tree, it might be wrongly inserted into the update tree
in 'ssl_sock_load_ocsp', and then reinserted when the update finishes.
The update tree then gets corrupted and we could end up crashing when
accessing other nodes in the ocsp response update tree.
This patch must be backported up to 2.8.
This patch fixes GitHub #3100.