haproxy

mirror of https://github.com/haproxy/haproxy.git synced 2026-04-15 21:59:41 -04:00

Author	SHA1	Message	Date
Christopher Faulet	fc6e3e9081	MINOR: stick-tables: Rename stksess shards to use buckets The shard keyword is already used by the peers and on the server lines. And it is unrelated with the session keys distribution. So instead of talking about shard for the session key hashing, we now use the term "bucket".	2025-11-17 07:42:51 +01:00
Willy Tarreau	675c86c4aa	DEBUG: add BUG_ON_STRESS(): a BUG_ON() implemented only when DEBUG_STRESS > 0 The purpose of this new BUG_ON is beyond BUG_ON_HOT(). While BUG_ON_HOT() is meant to be light but placed on very hot code paths, BUG_ON_STRESS() might be heavy and only used under stress-testing, to try to detect early that something bad is starting to happen. This one is not even type-checked when not defined because we don't want to risk the compiler emitting the slightest piece of code there in production mode, so as to give enough freedom to the developers.	2025-11-14 16:42:53 +01:00
Willy Tarreau	3d441e78e5	DEBUG: extend DEBUG_STRESS to ease testing and turn on extra checks DEBUG_STRESS is currently used only to expose "stress-level". With this patch, we go a bit further, by automatically forcing DEBUG_STRICT and DEBUG_STRICT_ACTION to their highest values in order to enable all BUG_ON levels, and make all of them result in a crash. In addition, care is taken to always only have 0 or 1 in the macro, so that it can be tested using "#if DEBUG_STRESS > 0" as well as "if (DEBUG_STRESS) { }" everywhere. The goal will be to ease insertion of extra tests for builds dedicated to stress-testing that enable possibly expensive extra checks on certain code paths that cannot reasonably be compiled in for production code right now.	2025-11-14 16:38:04 +01:00
Amaury Denoyelle	d79295d89b	Revert "BUG/MEDIUM: connections: permit to permanently remove an idle conn" The target patch fixes a rare race condition which happen when a MUX IO handler is working on a connection already moved into the purge list. In this case, the handler will incorrectly moved back the connection into the idle list. To fix this, conn_delete_from_tree() was extended to remove flags along with the connection from the idle list. This was performed when the connection is moved into the purge list. However, it introduces another issue related to the idle server connection accounting. Thus it is necessary to revert it prior to the incoming newer fix. This patch must be backported to every version where the original commit is.	2025-11-14 16:06:34 +01:00
William Lallemand	3d15c07ed0	MINOR: cfgcond: add "awslc_api_atleast" and "awslc_api_before" AWS-LC features are not easily tested with just the openssl version constant. AWS-LC uses its own API versioning stored in the AWSLC_API_VERSION constant. This patch add the two awslc_api_atleast and awslc_api_before predicates that help to check the AWS-LC API.	2025-11-14 11:01:45 +01:00
Amaury Denoyelle	8415254cea	MINOR: check: clarify check-reuse-pool interaction with reuse policy check-reuse-pool can only perform as expected if reuse policy on the backend is set to aggressive or higher. Update the documentation to reflect this and implement a server diag warning.	2025-11-14 10:44:05 +01:00
William Lallemand	2bdf5a7937	BUG/MEDIUM: acme: move from mt_list to a rwlock + ebmbtree The current ACME scheduler suffers from problems due to the way the tasks are stored: - MT_LIST are not scalables when having a lot of ACME tasks and having to look for a specific one. - the acme_task pointer was stored in the ckch_store in order to not passing through the whole list. But a ckch_store can be updated and the pointer lost in the previous one. - when a task fails, the ptr in the ckch_store was not removed because we only work with a copy of the original ckch_store, it would need to lock the ckchs_tree and remove this pointer. This patch fixes the issues by removing the MT_LIST-based architecture, and replacing it by a simple ebmbtree + rwlock design. The pointer to the task is not stored anymore in the ckch_store, but instead it is stored in the acme_tasks tree. Finding a task is done by doing a lookup on this tree with a RDLOCK. Instead of checking if store->acme_task is not NULL, a lookup is also done. This allow to remove the stuck "acme_task" pointer in the store, which was preventing to restart an acme task when the previous failed for this specific certificate. Must be backported in 3.2.	2025-11-13 15:18:12 +01:00
Frederic Lecaille	d84463f9f6	MINOR: quic-be: validate the 0-RTT transport parameters During 0-RTT sessions, some server transport parameters are reused after having been save from previous sessions. These parameters must not be reduced when it resends them. The client must check this is the case when some early data are accepted by the server. This is what is implemented by this patch. Implement qc_early_tranport_params_validate() which checks the new server parameters are not reduced. Also implement qc_ssl_eary_data_accepted() which was not implemented for TLS stack without 0-RTT support (for instance wolfssl). That said this function was no more used. This is why the compilation against wolfssl could not fail.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	6419b9f204	MEDIUM: quic-be: enable the use of 0-RTT This patch allows the use of 0-RTT feature on QUIC server lines with "allow-0rtt" option. In fact 0-RTT is really enabled only if ssl_sock_srv_try_reuse_sess() successfully manages to reuse the SSL session and the chosen application protocol from previous connections. Note that, at this time, 0-RTT works only with quictls and aws-lc as TLS stack. (0-RTT does not work at all (even for QUIC frontends) with libressl).	2025-11-13 14:04:31 +01:00
Frederic Lecaille	a4bbbc75db	MINOR: quic-be: Send post handshake frames from list of frames (0-RTT) This patch is required to make 0-RTT work. It modifies the prototype of quic_build_post_handshake_frames() to send post handshake frames from a list of frames in place of the application encryption level (used as <qc->ael> local variable). This patch does not modify at all the current QUIC stack behavior (even for QUIC frontends). It must be considered as a preparation for the code to come about 0-RTT support for QUIC backends.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	6e14365a5b	MEDIUM: quic-be: modify ssl_sock_srv_try_reuse_sess() to reuse backend sessions (0-RTT) This function is called for both TCP and QUIC connections to reuse SSL sessions saved by ssl_sess_new_srv_cb() callback called upon new SSL session creation. In addition to this, a QUIC SSL session must reuse the ALPN and some specific QUIC transport parameters. This is what is added by this patch for QUIC 0-RTT sessions. Note that for now on, ssl_sock_srv_try_reuse_sess() may fail for QUIC connections if it did not managed to reuse the ALPN. The caller must be informed of such an issue. It must not enable 0-RTT for the current session in this case. This is impossible without ALPN which is required to start a mux. ssl_sock_srv_try_reuse_sess() is modified to always succeeds for TCP connections.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	5309dfb56b	MINOR: quic-be: Save the backend 0-RTT parameters For both TCP and QUIC connections, this is ssl_sess_new_srv_cb() callback which is called when a new SSL session is created. Its role is to save the session to be reused for the next sessions. This patch modifies this callback to save the QUIC parameters to be reused for the next 0-RTT sessions (or during SSL session resumption). The already existing path_params->nego_alpn member is used to store the ALPN as this is done for TCP alongside path_params->tps new quic_early_transport_params struct used to save the QUIC transport parameters to be reused for 0-RTT sessions.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	41e40eb431	MINOR: quic-be: helper quic_reuse_srv_params() function to reuse server params (0-RTT) Implement quic_reuse_srv_params() whose role is to reuse the ALPN negotiated during a first connection to a QUIC backend alongside its transport parameters.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	33564ca54c	MINOR: quic-be: helper functions to save/restore transport params (0-RTT) Define quic_early_transport_params new struct for QUIC transport parameters in relation with 0-RTT. This parameters must be saved during a first session to be reused for 0-RTT next sessions. qc_early_transport_params_cpy() copies the 0-RTT transport parameters to be saved during a first connection to a backend. The copy is made from a quic_transport_params struct to a quic_ealy_transport_params struct. On the contrary, qc_early_transport_params_reuse() copies the transport parameters to be reused for a 0-RTT session from a previous one. The copy is made from a quic_early_transport_params strcut to a quic_transport_params struct. Also add QUIC_EV_EARLY_TRANSP_PARAMS trace event to dump such 0-RTT transport parameters from traces.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	80070fe51c	MEDIUM: quic-be: Parse, store and reuse tokens provided by NEW_TOKEN Add a per thread ist struct to srv_per_thread struct to store the QUIC token to be reused for subsequent sessions. Parse at packet level (from qc_parse_ptk_frms()) these tokens and store them calling qc_try_store_new_token() newly implemented function. This is this new function which does its best (may fail) to update the tokens. Modify qc_do_build_pkt() to resend these tokens calling quic_enc_token() implemented by this patch.	2025-11-13 14:04:31 +01:00
Frederic Lecaille	8f23d4d287	MINOR: quic-be: Parse the NEW_TOKEN frame Rename ->data qf_new_token struct field to ->w_data to distinguish it from ->r_data new field used to parse the NEW_TOKEN frame. Indeed to build the NEW_TOKEN we need to write it to a static buffer into the frame struct. To parse it we only need to store the address of the token field into the RX buffer.	2025-11-13 14:04:31 +01:00
Amaury Denoyelle	5a8728d03a	MEDIUM/OPTIM: quic: alloc quic_conn after CID collision check Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details On Initial packet parsing, a new quic_conn instance is allocated via qc_new_conn(). Then a CID is allocated with its value derivated from client ODCID. On CID tree insert, a collision can occur if another thread was already parsing an Initial packet from the same client. In this case, the connection is released and the packet will be requeued to the other thread. Originally, CID collision check was performed prior to quic_conn allocation. This was changed by the commit below, as this could cause issue on quic_conn alloc failure. commit `4ae29be18c` BUG/MINOR: quic: Possible endless loop in quic_lstnr_dghdlr() However, this procedure is less optimal. Indeed, qc_new_conn() performs many steps, thus it could be better to skip it on Initial CID collision, which can happen frequently. This patch restores the older order of operations, with CID collision check prior to quic_conn allocation. To ensure this does not cause again the same bug, the CID is removed in case of quic_conn alloc failure. This should prevent any loop as it ensures that a CID found in the global tree does not point to a NULL quic_conn, unless if CID is attach to a foreign thread. When this thread will parse a re-enqueued packet, either the quic_conn is already allocated or the CID has been removed, triggering a fresh CID and quic_conn allocation procedure.	2025-11-10 12:10:14 +01:00
Amaury Denoyelle	2623e0a0b7	BUG/MEDIUM: quic: handle collision on CID generation CIDs are provided by haproxy so that the peer can use them as DCID of its packets. Their value is set via a random generator. It happens on several occasions during connection lifetime: * via ODCID derivation if haproxy is the server * on quic_conn init if haproxy is the client * during post-handshake if haproxy is the server * on RETIRE_CONNECTION_ID frame parsing CIDs are stored in a global tree. On ODCID derivation, a check is performed to ensure the CID is not a duplicate value. This is mandatory to properly handle multiple INITIAL packets from the same client on different thread. However, for the other cases, no check is performed for CID collision. As _quic_cid_insert() is silent, the issue is not detected at all. This results in a CID advertized to the peer but not stored in the global one. In the end, this may cause two issues. The first one is that packets from the client which use the new CID will be rejected by haproxy, most probably with a STATELESS_RESET. The second issue is that it can cause a crash during quic_conn release. Indeed, the CID is stored in the quic_conn local tree and thus eb_delete() for the global tree will be performed. As <leaf_p> member is uninit, this results in a segfault. Note that this issue is pretty rare. It can only be observed if running with a high number of concurrent connections in parallel, so that the random generator will provide duplicate values. Patch is still labelled as MEDIUM as this modifies code paths used frequently. To fix this, _quic_cid_insert() unsafe function is completely removed. Instead, quic_cid_insert() can be used, which reports an error code if a collision happens. CID are then stored in the quic_conn tree only after global tree insert success. Here is the solution for each steps if a collision occurs : * on init as client: the connection is completely released * post-handshake: the CID is immediately released. The connection is kept, but it will miss an extra CID. * on RETIRE_CONNECTION_ID parsing: a loop is implemented to retry random generation. It it fails several times, the connection is closed in error. A small convenience change is made to quic_cid_insert(). Output parameter <new_tid> can now be NULL, which is useful as most of the times caller do not care about it. This must be backported up to 2.6.	2025-11-10 12:10:14 +01:00
Amaury Denoyelle	419e5509d8	MINOR: quic: split CID alloc/generation function Split new_quic_cid() function into multiple ones. This patch should not introduce any visible change. The objective is to render CID allocation and generation more modular. The first advantage of this patch is to bring code simplication. In particular, conn CID sequence number increment and insertion into connection tree is simpler than before. Another improvment is also that errors could now be handled easier at each different steps of the CID init. This patch is a prerequisite for the fix on CID collision, thus it must be backported prior to it to every affected version.	2025-11-10 12:10:14 +01:00
Christopher Faulet	ecc2c3a35d	MEDIUM: peers: Remove commitupdate field on stick-tables This stick-table field was atomically updated with the last update id pushed and dumped on the CLI but never used otherwise. And all peer sessions share the same id because it is a stick-table info. So the info in peers dump is pretty limited. So, let's remove it.	2025-11-07 12:17:53 +01:00
Ben Kallus	d5ca3bb3b4	IMPORT: cebtree: Replace offset calculation with offsetof to avoid UB Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details This is the same as the equivalent fix in ebtree: The C standard specifies that it's undefined behavior to dereference NULL (even if you use & right after). The hand-rolled offsetof idiom &(((s*)NULL)->f) is thus technically undefined. This clutters the output of UBSan and is simple to fix: just use the real offsetof when it's available. This is cebtree commit 2d08958858c2b8a1da880061aed941324e20e748.	2025-11-07 07:32:58 +01:00
Willy Tarreau	14087e48b9	MINOR: tools: add env_suggest() to suggest alternate variable names The purpose here is to look in the environment for a variable whose name looks like the provided one. This will be used to try to auto- correct misspelled environment variables that would silently be turned to an empty string.	2025-11-06 19:57:44 +01:00
Willy Tarreau	a4d78dd4f5	MINOR: tools: add support for ist to the word fingerprinting functions The word fingerprinting functions are used to compare similar words to suggest a correctly spelled one that looks like what the user proposed. Currently the functions only support const char*, but there's no reason for this, and it would be convenient to support substrings extracted from random pieces of configurations. Here we're adding new variants "_with_len" that take these ISTs and which are in fact a slight change of the original ones that the old ones now rely on.	2025-11-06 19:57:44 +01:00
Willy Tarreau	0144426dfb	BUG/MEDIUM: server: close a race around ready_srv when deleting a server When a server is being disabled or deleted, in case it matches the backend's ready_srv, this one is reset. However it's currently done in a non-atomic way when the server goes down, and that could occasionally reset the entry matching another server, but more importantly if in parallel some requests are dequeued for that server, it may re-appear there after having been removed, leading to a possible crash once it is fully removed, as shown in issue #3177. Let's make sure we reset the pointer when detaching the server from the proxy, and use a CAS in both cases to only reset this server. This fix needs to be backported to 3.2. There, srv_detach() is in server.c instead of server.h. Thanks to Basha Mougamadou for the detailed report and the useful backtraces.	2025-11-06 19:57:44 +01:00
Christopher Faulet	a1b5325a7a	MINOR: channel: Remove total field from channels The <total> field in the channel structure is now useless, so it can be removed. The <bytes_in> field from the SC is used instead. This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	1effe0fc0a	MINOR: applet: Add function to get amount of data in the output buffer The helper function applet_output_data() returns the amount of data in the output buffer of an applet. For applets using the new API, it is based on data present in the outbuf buffer. For legacy applets, it is based on input data present in the input channel's buffer. The HTX version, applet_htx_output_data(), is also available This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	4991a51208	MINOR: stats: Add stats about request and response bytes received and sent In previous patches, these counters were added per frontend, backend, server and listener. With this patch, these counters are reported on stats, including promex. Note that the stats file minor version was incremented by one because the shm_stats_file_object struct size has changed. This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	0084baa6ba	MINOR: counters: Remove bytes_in and bytes_out counter from fe/be/srv/li bytes_in and bytes_out counters per frontend, backend, listener and server were removed and we now rely on, respectively on, req_in and res_in counters. This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	567df50d91	MINOR: stream: Remove bytes_in and bytes_out counters from stream per-stream bytes_in and bytes_out counters was removed and replaced by req.in and res.in. Coorresponding samples still exists but replies on new counters. This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	1c62a6f501	MINOR: counters: Add req_in/req_out/res_in/res_out counters for fe/be/srv/li Thanks to the previous patch, and based on info available on the stream, it is now possible to have counters for frontends, backends, servers and listeners to report number of bytes received and sent on both sides. This patch is related to issue #1617.	2025-11-06 15:01:29 +01:00
Christopher Faulet	ac9201f929	MINOR: stream: Add samples to get number of bytes received or sent on each side req.in and req.out samples can now be used to get the number of bytes received by a client and send to the server. And res.in and res.out samples can be used to get the number of bytes received by a server and send to the client. These info are stored in the logs structure inside a stream. This patch is related to issue #1617.	2025-11-06 15:01:28 +01:00
Christopher Faulet	629fbbce19	MINOR: stconn: Add counters to SC to know number of bytes received and sent <bytes_in> and <bytes_out> counters were added to SC to count, respectively, the number of bytes received from an endpoint or sent to an endpoint. These counters are updated for connections and applets. This patch is related to issue #1617.	2025-11-06 15:01:28 +01:00
Willy Tarreau	5fe4677231	MINOR: server: move the lock inside srv_add_idle() Almost all callers of _srv_add_idle() lock the list then call the function. It's not the most efficient and it requires some care from the caller to take care of that lock. Let's change this a little bit by having srv_add_idle() that takes the lock and calls _srv_add_idle() that is now inlined. This way callers don't have to handle the lock themselves anymore, and the lock is only taken around the sensitive parts, not the function call+return. Interestingly, perf tests show a small perf increase from 2.28-2.32M RPS to 2.32-2.37M RPS on a 128-thread system.	2025-11-06 13:16:24 +01:00
William Lallemand	546c67d137	MINOR: acme: generate a temporary key pair This patch provides two functions acme_gen_tmp_pkey() and acme_gen_tmp_x509(). These functions generates a unique keypair and X509 certificate that will be stored in tmp_x509 and tmp_pkey. If the key pair or certificate was already generated they will return the existing one. The key is an RSA2048 and the X509 is generated with a expiration in the past. The CN is "expired". These are just placeholders to be used if we don't have files.	2025-11-06 11:56:27 +01:00
William Lallemand	1df55b441b	MEDIUM: ssl/ckch: use ckch_store instead of ckch_data for ckch_conf_kws This is an API change, instead of passing a ckch_data alone, the ckch_conf_kws.func() is called with a ckch_store. This allows the callback to access the whole ckch_store, with the ckch_conf and the ckch_data. But it requires the ckch_conf to be actually put in the ckch_store before.	2025-11-06 11:56:27 +01:00
Amaury Denoyelle	b9809fe0d0	MINOR: quic: remove <mux_state> field Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details This patch removes <mux_state> field from quic_conn structure. The purpose of this field was to indicate if MUX layer above quic_conn is not yet initialized, active, or already released. It became tedious to properly set it as initialization order of the various quic_conn/conn/MUX layers now differ between the frontend and backend sides, and also depending if 0-RTT is used or not. Recently, a new change introduced in connect_server() will allow to initialize QUIC MUX earlier if ALPN is cached on the server structure. This had another level of complexity. Thus, this patch removes <mux_state> field completely. Instead, a new flag QUIC_FL_CONN_XPRT_CLOSED is defined. It is set at a single place only on close XPRT callback invokation. It can be mixed with the new utility functions qc_wait_for_conn()/qc_is_conn_ready() to determine the status of conn/MUX layers now without an extra quic_conn field.	2025-11-05 14:03:34 +01:00
Willy Tarreau	096999ee20	BUG/MEDIUM: connections: permit to permanently remove an idle conn There's currently a function conn_delete_from_tree() which is used to detach an idle connection from the tree it's currently attached to so that it is no longer found. This function is used in three circumstances: - when picking a new connection that no longer has any avail stream - when temporarily working on the connection from an I/O handler, in which case it's re-added at the end - when killing a connection The 2nd case above is quite specific, as it requires to preserve the CO_FL_LIST_MASK flags so that the connection can be re-inserted into the proper tree when leaving the handler. However, there's a catch. When killing a connection, we want to be certain it will not be reinserted into the tree. The flags preservation is causing a tiny race if an I/O happens while the connection is in the kill list, because in this case the I/O handler will note the connection flags, do its work, then reinsert the connection where it believed it was, then the connection gets purged, and another user can find it in the tree. The issue is very difficult to reproduce. On a 128-thread machine it happens in H2 around 500k req/s after around 50M requests. In H1 it happens after around 1 billion requests. The fix here consists in passing an extra argument to the function to indicate if the removal is permanent or not. When it's permanent, the function will clear the associated flags. The callers were adjusted so that all those dequeuing a connection in order to kill it do it permanently and all other ones do it only temporarily. A slightly different approach could have worked: the function could always remove all flags, and the callers would need to restore them. But this would require trickier modifications of the various call places, compared to only passing 0/1 to indicate the permanent status. This will need to be backported to all stable versions. The issue was at least reproduced since 3.1 (not tested before). The patch will need to be adjusted for 3.2 and older, because a 2nd argument "thr" was added in 3.3, so the patch will not apply to older versions as-is.	2025-11-05 11:08:25 +01:00
Olivier Houchard	7d4aa7b22b	BUG/MEDIUM: server: Add a rwlock to path parameter Add a rwlock to control the server's path_parameter, to make sure multiple threads don't set it at the same time, and it can't be seen in an inconsistent state. Also don't set the parameter every time, only set them if they have changed, to prevent needless writes. This does not need to be backported.	2025-11-04 18:47:34 +01:00
Amaury Denoyelle	efe60745b3	MINOR: quic: remove connection arg from qc_new_conn() Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details This patch is similar to the previous one, this time dealing with qc_new_conn(). This function was asymetric on frontend and backend side, as connection argument was set only in the latter case. This was required prior due to qc_alloc_ssl_sock_ctx() signature. This has changed with the previous patch, thus qc_new_conn() can also be realigned on both FE and BE sides. <conn> member of quic_conn instance is always set outside it, in qc_xprt_start() on the backend case.	2025-11-04 17:47:42 +01:00
Amaury Denoyelle	5a17cade4f	MINOR: quic: do not set conn member if ssl_sock_ctx ssl_sock_ctx is a generic object used both on TCP/SSL and QUIC stacks. Most notably it contains a <conn> member which is a pointer to struct connection. On QUIC frontend side, this member is always set to NULL. Indeed, connection is only created after handshake completion. However, this has changed for backend side, where the connection is instantiated prior to its quic_conn counterpart. Thus, ssl_sock_ctx member would be set in this case as a convenience for use later in qc_ssl_do_hanshake(). However, this method was unsafe as the connection can be released, without resetting ssl_sock_ctx member. Thus, the previous patch fixes this by using on <conn> member through the quic_conn instance which is the proper way. Thus, this patch resets ssl_sock_ctx <conn> member to NULL. This is deemed the cleanest method as it ensures that both frontend and backend sides must not use it anymore.	2025-11-04 17:38:09 +01:00
Willy Tarreau	fd012b6c59	OPTIM: proxy: move atomically access fields out of the read-only ones Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details Perf top showed that h1_snd_buf() was having great difficulties accessing the proxy's server_id_hdr_name field in the middle of the headers loop. Moving the assignment out of the loop to a local variable moved the problem there as well: \| if (!(h1m->flags & H1_MF_RESP) && isttest(h1c->px->server_id_hdr_n 0.10 \|20b0: mov -0x120(%rbp),%rdi 1.33 \| mov 0x60(%rdi),%r10 0.01 \| test %eax,%eax 0.18 \| jne 2118 12.87 \| mov 0x350(%r10),%rdi 0.01 \| test %rdi,%rdi 0.05 \| je 2118 \| mov 0x358(%r10),%r11 It turns out that there are several atomically accessed fields in its vicinity, causing the cache line to bounce all the time. Let's collect the few frequently changed fields and place them together at the end of the structure, and plug the 32-bit hole with another isolated field. Doing so also reduced a little bit the cost of decrementing be->be_conn in process_stream(), and overall the HTTP/1 performance increased by about 1% both on ARM and x86_64.	2025-11-03 13:54:49 +01:00
Amaury Denoyelle	6bfabfdc77	OPTIM: backend: skip conn reuse for incompatible proxies Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details When trying to reuse a backend connection, a connection hash is calculated to match an entry with similar parameters. Previously, this operation was skipped if the stream content wasn't based on HTTP, as it would have been incompatible with http-reuse. With the introduction of SPOP backends, this condition was removed, so that it can also benefit from connection reuse. However, this means that now hash calcul is always performed when connecting to a server, even for TCP or log backends. This is unnecessary as these proxies cannot perform connection reuse. Note also that reuse mode is resetted on postparsing for incompatible backends. This at least guarantees that no tree lookup will be performed via be_reuse_connection(). However, connection lookup is still performed in the session via session_get_conn() which is another unnecessary operation. Thus, this patch restores the condition so that reuse operations are now entirely skipped if a backend mode is incompatible. This is implemented via a new utility function named be_supports_conn_reuse(). This could be backported up to 3.1, as this commit could be considered as a performance regression for tcp/log backend modes.	2025-11-03 10:43:50 +01:00
Amaury Denoyelle	14a6468df5	MINOR: quic: reject conf with QUIC servers if not compiled Ensure that QUIC support is compiled into haproxy when a QUIC server is configured. This check is performed during _srv_parse_finalize() so that it is detected both on configuration parsing and when adding a dynamic server via the CLI. Note that this changes the behavior of srv_is_quic() utility function. Previously, it always returned false when QUIC support wasn't compiled. With this new check introduced, it is now guaranteed that a QUIC server won't exist if compilation support is not active. Hence srv_is_quic() does not rely anymore on USE_QUIC define.	2025-10-31 11:32:20 +01:00
Willy Tarreau	b0e8edaef2	MEDIUM: mux-h2: do not needlessly refrain from sending data early The mux currently refrains from sending data before H2_CS_FRAME_H, i.e. before the peer's SETTINGS frame was received. While it makes sense on the frontend, it's causing harm on the backend because it forces the first request to be sent in two halves over an extra RTT: first the preface and settings, second the request once the settings are received. This is totally contrary to the philosophy of the H2 protocol, consisting in permitting the client to send as soon as possible. Actually what happens is the following: - process_stream() calls connect_server() - connect_server() creates a connection, and if the proto/alpn is guessed or known, the mux is instantiated for the current request. - the H2 init code wakes the h2 tasklet up and returns - process_stream() tries to send the request using h2_snd_buf(), but that one sees that we're before H2_CS_FRAME_H, refrains from doing so and returns. - process_stream() subscribes and quits - the h2 tasklet can now execute to send the preface and settings, which leave as a first TCP segment. The connection is ready. - the iocb is woken again once the server's SETTINGS frame is received, turning the connection to the H2_CS_FRAME_H state, and the iocb wake up process_stream(). - process_stream() executes again and can try to send again. - h2_snd_buf() is called and finally sends the request as a second TCP segment. Not only this is inefficient, but it also renders 0-RTT and TFO impossible on H2 connections. When 0-RTT is used, only the preface and settings leave as early data (the very first data of that connection), which is totally pointless. In order to fix this, we have to go through a few steps: - first we need to let data be sent to a server immediately after the SETTINGS frame was sent (i.e. in H2_CS_SETTINGS1 state instead of H2_CS_FRAME_H). However, some protocol extensions are advertised by the server using SETTINGS (e.g. RFC8441) and some requests might need to know the existence of such extensions. For this reason we're adding a new h2c flag, H2_CF_SETTINGS_NEEDED, which indicates that some operations were not done because a server's SETTINGS frame is needed. This is set when trying to send a protocol upgrade or extended CONNECT during H2_CS_SETTINGS1, indicating that it's needed to wait for H2_CS_FRAME_H in this case. The flag is always set on frontend connections. This is what is being done in this patch. - second, we need to be able to push the preface opportunistically with the first h2_snd_buf() so that it's not needed to wake the tasklet up just to send that and wake process_stream() again. This will be in a separate patch. By doing the first step, we're at least saving one needless tasklet wakeup per connection (~9%), which results in ~5% backend connection rate increase.	2025-10-30 18:16:54 +01:00
William Lallemand	1e2f920be6	MINOR: listener: implement bind_conf_find_by_name() Returns a pointer to the first bind_conf matching <name> in a frontend <front>. When name is prefixed by a @ (@<filename>:<linenum>), it tries to look for the corresponding filename and line of the configuration file. NULL is returned if no match is found.	2025-10-30 10:37:42 +01:00
sftcd	23f5cbb411	MINOR: ssl/ech: add logging and sample fetches for ECH status and outer SNI This patch adds functions to expose Encrypted Client Hello (ECH) status and outer SNI information for logging and sample fetching. Two new helper functions are introduced in ech.c: - conn_get_ech_status() places the ECH processing status string into a buffer. - conn_get_ech_outer_sni() retrieves the outer SNI value if ECH succeeded. Two new sample fetch keywords are added: - "ssl_fc_ech_status" returns the ECH status string. - "ssl_fc_ech_outer_sni" returns the outer SNI value seen during ECH. These allow ECH information to be used in HAProxy logs, ACLs, and captures.	2025-10-30 10:37:30 +01:00
sftcd	dba4fd248a	MEDIUM: ssl/ech: config and load keys This patch introduces the USE_ECH option in the Makefile to enable support for Encrypted Client Hello (ECH) with OpenSSL. A new function, load_echkeys, is added to load ECH keys from a specified directory. The SSL context initialization process in ssl_sock.c is updated to load these keys if configured. A new configuration directive, `ech`, is introduced to allow users to specify the ECH key directory in the listener configuration.	2025-10-30 10:37:12 +01:00
Remi Tricot-Le Breton	dc35a3487b	MINOR: ssl: Do not dump decrypted privkeys in 'dump ssl cert' A private keys that is password protected and was decoded during init thanks to the password obtained thanks to 'ssl-passphrase-cmd' should not be dumped via 'dump ssl cert' CLI command.	2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton	478dd7bad0	MEDIUM: ssl: Add certificate password callback that calls external command When a certificate is protected by a password, we can provide the password via the dedicated pem_password_cb param provided to PEM_read_bio_PrivateKey. HAProxy will fetch the password automatically during init by calling a user-defined external command that should dump the right password on its standard output (see new 'ssl-passphrase-cmd' global option).	2025-10-29 10:54:17 +01:00
Remi Tricot-Le Breton	1ec59d3426	MINOR: init: Make devnullfd global and create it earlier in init The devnull fd might be needed during configuration parsing, if some options require to fork/exec for instance. So we now create it much earlier in the init process and without depending on the '-q' or '-d' parameters.	2025-10-29 10:54:17 +01:00
Willy Tarreau	2d7e3ddd4a	BUG/MEDIUM: cli: do not return ACKs one char at a time Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details Since 3.0 where the CLI started to use rcv_buf, it appears that some external tools sending chained commands are randomly experiencing failures. Each time this happens when the whole command is sent as a single packet, immediately followed by a close. This is not a correct way to use the CLI but this has been working for ages for simple netcat-based scripts, so we should at least try to preserve this. The cause of the failure is that the first LF that acks a command is immediately sent back to the client and rejected due to the closed connection. This in turn forwards the error back to the applet which aborts its processing. Before 3.0 the responses would be queued into the buffer, then sent back to the channel, and would all fail at once. This changed when snd_buf/rcv_buf were implemented because the applets are much more responsive and since they yield between each command, they can deliver one ACK at a time that is immediately forwarded down the chain. An easy way to observe the problem is to send 5 map updates, a shutdown, and immediately close via tcploop, and in parallel run a periodic "show map" to count the number of elements: $ tcploop -U /tmp/sock1 C S:"add map #0 1 1; add map #0 2 2; add map #0 3 3; add map #0 4 4; add map #0 5 5\n" F K Before 3.0, there would always be 5 elements. Since 3.0 and before `20ec1de214` ("MAJOR: cli: Refacor parsing and execution of pipelined commands"), almost always 2. And since that commit above in 3.2, almost always one. Doing the same using socat or netcat shows almost always 5... It's entirely timing-dependent, and might even vary based on the RTT between the client and haproxy! The approach taken here consists in doing the same principle as MSG_MORE or Nagle but on the response buffer: the applet doesn't need to send a single ACK for each command when it has already been woken up and is scheduled to come back to work. It's fine (and even desirable) that ACKs are grouped in a single packet as much as possible. For this reason, this patch implements APPCTX_CLI_ST1_YIELD, a new CLI flag which indicates that the applet left in yielding condition, i.e. it has not finished its work. This flag is used by .rcv_buf to hold pending data. This way we won't return partial responses for no reason, and we can continue to emulate the previous behavior. One very nice benefit to this is that it saves huge amounts of CPU on the client. In the test below that tries to update 1M map entries, the CPU used by socat went from 100% to 0% and the total transfer time dropped by 28%: before: $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \ printf "add map #0 %d %d\n",i,i,i }}' \| socat /tmp/sock1 - >/dev/null real 0m2.407s user 0m1.485s sys 0m1.682s after: $ time awk 'BEGIN{ printf "prompt i\n"; for (i=0;i<1000000;i++) { \ printf "add map #0 %d %d\n",i,i,i }}' \| socat /tmp/sock1 - >/dev/null real 0m1.721s user 0m0.952s sys 0m0.057s The difference is also quite visible on the number of syscalls during the test (for 1k updates): before: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.071691 0 100001 sendmsg after: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000011 1 9 sendmsg This patch will need to be backported to 3.0, and depends on these two patches to be backported as well: MINOR: applet: do not put SE_FL_WANT_ROOM on rcv_buf() if the channel is empty MINOR: cli: create cli_raw_rcv_buf() from the generic applet_raw_rcv_buf()	2025-10-27 16:57:07 +01:00
Olivier Houchard	837351245a	BUG/MEDIUM: mt_list: Use atomic operations to prevent compiler optims As a folow-up to `f40f5401b9`, explicitely use atomic operations to set the prev and next fields, to make sure the compiler can't assume anything about it, and just does it. This should be backported after `f40f5401b9` up to 2.8.	2025-10-24 13:34:41 +02:00
Willy Tarreau	2ec6df59bf	BUILD: openssl-compat: fix build failure with OPENSSL=0 and KTLS=1 The USE_KTLS test is currently being done outside of the USE_OPENSSL guard so disabling USE_OPENSSL still results in build failures on libcs built with support for kernels before 4.17, because we enable KTLS by default on linux. Let's move the KTLS block inside the USE_OPENSSL guard instead. No backport is needed since KTLS is only in 3.3.	2025-10-24 10:45:02 +02:00
Aurelien DARRAGON	d655ed5f14	BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency (2nd attempt) Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details This is a second attempt at fixing issues on 32bits systems which would trigger the following BUG_ON() statement: FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this This is a drop-in replacement for `d30b88a6c` + `4693ee0ff`, as suggested by Willy. Indeed, on supported platforms unsigned int can be assumed to be 4 bytes long, and long can be assumed to be 8 bytes long. As such, the previous attempt was overkill and added unecessary maintenance complexity which could result in bugs if not used properly. Moreover, it would only partially solve the issue, since on little endian vs big endian architectures, the provisioned memory areas (originating from the same shm stats file) could be read differently by the host. Instead we fix the aligments issues, and this alone helps to ensure struct memory consistency on 64 vs 32bits platforms. It was tested on both i386 and i586. last_change and last_sess counters are now stored as unsigned int, as it helped to fix the alignment issues and they were found to be used as 32bits integers anyway. Thanks to Willy for problem analysis and the patch proposal. No backport needed.	2025-10-24 09:35:38 +02:00
Aurelien DARRAGON	a931779dde	Revert "MINOR: compiler: add FIXED_SIZE(size, type, name) macro" This reverts commit `466a603b59`. Due to the last 2 commits, this macro is now unused, and will probably never be used, so let's get rid of that for now.	2025-10-24 09:35:34 +02:00
Aurelien DARRAGON	8277f891d2	Revert "MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct" This reverts commit `4693ee0ff7`. As discussed in GH #3168, this works but it is not the proper way to fix the issue. See following commits.	2025-10-24 09:35:29 +02:00
Aurelien DARRAGON	c0d952ccc1	Revert "BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency" This reverts commit `d30b88a6cc`. As discussed in GH #3168, this works but it is not the proper way to fix the issue. See following commits.	2025-10-24 09:35:25 +02:00
Amaury Denoyelle	7ba4b0ad5f	BUG/MINOR: quic: rename and duplicate stream settings Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details Several settings can be set to control stream multiplexing and associated receive window. Previously, all of these settings were configured using prefix "tune.quic.frontend.", despite being applied blindly on both sides. Fix this by duplicating these settings specific to frontend and backend side. Options are also renamed to use the standardize prefix "tune.quic.[be\|fe].stream." notation. Also, each option is individually renamed to better reflect its purpose and hide technical details relative to QUIC transport parameter naming : * max-data-size -> stream.rxbuf * max-streams-bidi -> stream.max-concurrent * stream-data-ratio -> stream.data-ratio No need to backport.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	d5142706f8	BUG/MINOR: quic: split option for congestion max window size	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	33afba0dda	BUG/MINOR: quic: split max-idle-timeout option for FE/BE usage Streamline max-idle-timeout option. Rename it to use the newer cohesive naming scheme 'tune.quic.fe\|be.'. Two different fields were already defined in global struct. These fields are moved into quic_tune along with other QUIC settings. However, no parser was defined for backend option, this commit fixes this. No need to backport this.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	5bc659a4a2	MINOR: quic: rename frontend sock-per-conn setting On frontend side, a quic_conn can have a dedicated FD or use the listener one. These different modes can be activated via a global QUIC tune setting. This patch adjusts the option. First, it is renamed to the more meaningful name 'tune.quic.fe.sock-per-conn'. Also, arguments are now either 'default-on' or 'force-off'. The objective is to better highlight reliationship with 'quic-socket' bind option. The older option is deprecated and will be removed in 3.5.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	a14c6cee17	MINOR: quic: rename retry-threshold setting A QUIC global tune setting is defined to be able to force Retry emission prior to handshake. By definition, this ability is only supported by QUIC servers, hence it is a frontend option only. Rename the option to use "fe" prefix. The old option name is deprecated and will be removed in 3.5	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	d248c5bd21	MINOR: quic: rename max Tx mem setting QUIC global memory can be limited across the entire process via a global tune setting. Previously, this setting used to misleading "frontend" prefix. As this is applied as a sum between all QUIC connections, both from frontend and backend sides, remove the prefix. The new option name is "tune.quic.mem.tx-max". The older option name is deprecated and will be removed in 3.5.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	9bfe9b9e21	MINOR: quic: split Tx options for FE/BE usage This patch is similar to the previous one, except that it is focused on Tx QUIC settings. It is now possible to toggle GSO and pacing on frontend and backend sides independently. As with previous patch, option are renamed to use "fe/be" unified prefixes. This is part of the current serie of commits which unify QUI settings. Older options are deprecated and will be removed on 3.5 release.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	33a8cb87a9	MINOR: quic: split congestion controler options for FE/BE usage Various settings can be configured related to QUIC congestion controler. This patch duplicates them to be able to set independent values on frontend and backend sides. As with previous patch, option are renamed to use "fe/be" unified prefixes. This is part of the current serie of commits which unify QUIC settings. Older options are deprecated and will be removed on 3.5 release.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	7640e9a9ee	MINOR: quic: duplicate glitches FE option on BE side Previously, QUIC glitches support was only implemented for frontend side. Extend this so that the option can be specified separately both on frontend and backend sides. Function _qcc_report_glitch() now retrieves the relevant max value based on connection side. In addition to this, option has been renamed to use "fe/be" prefixes. This is part of the current serie of commits which unify QUIC settings. Older options are deprecated and will be removed on 3.5 release.	2025-10-23 16:49:20 +02:00
Amaury Denoyelle	b34cd0b506	MINOR: quic: rename "no-quic" to "tune.quic.listen" Rename the option to quickly enable/disable every QUIC listeners. It now takes an argument on/off. The documentation is extended to reflect the fact that QUIC backend are not impacted by this option. The older keyword is simply removed. Deprecation is considered unnecessary as this setting is only useful during debugging.	2025-10-23 16:47:58 +02:00
Amaury Denoyelle	42e5ec6519	MINOR: quic: prepare support for options on FE/BE side A major reorganization of QUIC settings is going to be performed. One of its objective is to clearly define options which can be separately configured on frontend and backend proxy sides. To implement this, quic_tune structure is extended to support fe and be options. A set of macros/functions is also defined : it allows to retrieve an option defined on both sides with unified code, based on proxy side of a quic_conn/connection instance.	2025-10-23 15:06:01 +02:00
Olivier Houchard	f40f5401b9	BUG/MEDIUM: mt_lists: Avoid el->prev = el->next = el Avoid setting both el->prev and el->next on the same line. The goal is to set both el->prev and el->next to el, but a naive compiler, such as when we're using -O0, will set el->next first, then will set el->prev to the value of el->next, but if we're unlucky, el->next will have been set to something else by another thread. So explicitely set both to what we want. This should be backported up to 2.8.	2025-10-23 14:43:51 +02:00
Aurelien DARRAGON	d30b88a6cc	BUG/MAJOR: stats-file: ensure shm_stats_file_object struct mapping consistency Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details As reported by @tianon on GH #3168, running haproxy on 32bits i386 platform would trigger the following BUG_ON() statement: FATAL: bug condition "sizeof(struct shm_stats_file_object) != 544" matched at src/stats-file.c:825 shm_stats_file_object struct size changed, is is part of the exported API: ensure all precautions were taken (ie: shm_stats_file version change) before adjusting this In fact, some efforts were already taken to ensure shm_stats_file_object struct size remains consistent on 64 vs 32 bits platforms, since shm_stats_file_object is part of the public API and directly exposed in the stats file. However, some parts were overlooked: some structs that are embedded in shm_stats_file_object struct itself weren't using fixed-width integers, and would sometime be unaligned. The result of this is that it was up to the compiler (platform-dependent) to choose how to deal with such ambiguities, which could cause the struct mapping/size to be inconsistent from one platform to another. Hopefully this was caught by the BUG_ON() statement and with the precious help of @tianon To fix this, we now use fixed-width integers everywhere for members (and submembers) of shm_stats_file_object struct, and we use explicit padding where missing to avoid automatic padding when we don't expect one. As for the previous commit, we leverage FIXED_SIZE() and FIXED_SIZE_ARRAY() macro to set the expected width for each integer without causing build issues on platform that don't support larger integers. No backport needed, this feature was introduced during 3.3-dev.	2025-10-22 20:52:22 +02:00
Aurelien DARRAGON	4693ee0ff7	MEDIUM: freq-ctr: use explicit-size types for freq-ctr struct freq-ctr struct is used by the shm_stats_file API, and more precisely, it is used in the shm_stats_file_object struct for counters. shm_stats_file_object struct requires to be plateform-independent, thus we switch to using explicit size types (AKA fixed width integer types) for freq-ctr, in the attempt to make freq-ctr size and memory mapping consistent from one platform to another. We cannot simply use fixed-width integer because some of them are involved in atomic operations, and forcing a given width could cause build issues on some platforms where atomic ops are not implemented for large integers. Instead we leverage the FIXED_SIZE macro to keep handling the integers as before, but forcing them to be stored using expected number of bytes (unused bytes will simply be ignored). No change of behavior should be expected.	2025-10-22 20:52:18 +02:00
Aurelien DARRAGON	466a603b59	MINOR: compiler: add FIXED_SIZE(size, type, name) macro FIXED_SIZE() macro can be used to instruct the compiler that the struct member named <name>, handled as <type>, must be stored using <size> bytes and that even if the type used is actualler smaller than the expected size FIXED_SIZE_ARRAY(), similar to FIXED_SIZE() but for arrays: it takes an extra argument which is the number of members. They may be used for portability concerns to ensure a structure mapping remains consistent between platforms.	2025-10-22 20:52:12 +02:00
Amaury Denoyelle	f50425c021	MINOR: quic: remove received CRYPTO temporary tree storage Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details The previous commit switch from ncbuf to ncbmbuf as storage for received CRYPTO frames. The latter ensures that buffering of such frames cannot fail anymore due to gaps size. Previously, extra mechanism were implemented on QUIC frames parsing function to overcome the limitation of ncbuf on gaps size. Before insertion, CRYPTO frames were stored in a temporary tree to order their insertion. As this is not necessary anymore, this commit removes the temporary tree insertion. This commit is closely associated to the previous bug fix. As it provides a neat optimization and code simplication, it can be backported with it, but not in the next immediate release to spot potential regression.	2025-10-22 15:24:02 +02:00
Amaury Denoyelle	4c11206395	BUG/MAJOR: quic: use ncbmbuf for CRYPTO handling In QUIC, TLS handshake messages such as ClientHello are encapsulated in CRYPTO frames. Each QUIC implementation can split the content in several frames of random sizes. In fact, this feature is now used by several clients, based on chrome so-called "Chaos protection" mechanism : https://quiche.googlesource.com/quiche/+/cb6b51054274cb2c939264faf34a1776e0a5bab7 To support this, haproxy uses a ncbuf storage to store received CRYPTO frames before passing it to the SSL library. However, this storage suffers from a limitation as gaps between two filled blocks cannot be smaller than 8 bytes. Thus, depending on the size of received CRYPTO frames and their order, ncbuf may not be sufficient. Over time, several mechanisms were implemented in haproxy QUIC frames parsing to overcome the ncbuf limitation. However, reports recently highlight that with some clients haproxy is not able to deal with CRYPTO frames reception. In particular, this is the case with the latest ngtcp2 release, which implements a similar chaos protection mechanism via the following patch. It also seems that this impacts haproxy interaction with firefox. commit 89c29fd8611d5e6d2f6b1f475c5e3494c376028c Author: Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com> Date: Mon Aug 4 22:48:06 2025 +0900 Crumble Client Initial CRYPTO (aka chaos protection) To fix haproxy CRYPTO frames buffering once and for all, an alternative non-contiguous buffer named ncbmbuf has been recently implemented. This type does not suffer from gaps size limitation, albeit at the cost of a small reduction in the size available for data storage. Thus, the purpose of this current patch is to replace ncbuf with the newer ncbmbuf for QUIC CRYPTO frames parsing. Now, ncbmb_add() is used to buffer received frames which is guaranteed to suceed. The only remaining case of error is if a received frame offset and length exceed the ncbmbuf data storage, which would result in a CRYPTO_BUFFER_EXCEEDED error code. A notable behavior change when switching to ncbmbuf implementation is that NCB_ADD_COMPARE mode cannot be used anymore during add. Instead, crypto frame content received at a similar offset will be overwritten. A final note regarding STREAM frames parsing. For now, it is considered unnecessary to switch from ncbuf in this case. Indeed, QUIC clients does not perform aggressive fragmentation for them. Keeping ncbuf ensure that the data storage size is bigger than the equivalent ncbmbuf area. This should fix github issue #3141. This patch must be backported up to 2.6. It is first necessary to pick the relevant commits for ncbmbuf implementation prior to it.	2025-10-22 15:04:41 +02:00
Amaury Denoyelle	8b8ab2824e	MINOR: ncbmbuf: implement advance operation Implement ncbmb_advance() function for the ncbmbuf type. This allows to remove bytes in front of the buffer, regardless of the existing gaps. This is implemented by resetting the corresponding bits of the bitmap. As the previous patch, this commit must be backported prior to the fix to come on QUIC CRYPTO frames parsing.	2025-10-22 15:04:06 +02:00
Amaury Denoyelle	42c495f3d7	MINOR: ncbmbuf: implement ncbmb_data() Implement ncbmb_data() function for the ncbmbuf type. Its purpose is similar to its ncbuf counterpart : it returns the size in bytes of data starting at a specific offset until the next gap. As the previous patch, this commit must be backported prior to the fix to come on QUIC CRYPTO frames parsing.	2025-10-22 15:04:06 +02:00
Amaury Denoyelle	1e1a3aa6aa	MINOR: ncbmbuf: implement add This patch implements add operation for ncbmbuf type. This function is simpler than its ncbuf counterpart. Indeed, for now only NCB_ADD_OVERWRT mode is supported. This compromise has been chosen as ncbmbuf will be first used for QUIC CRYPTO frames handling, which does not mandate to compare existing filled blocks during insertion. As the previous patch, this commit must be backported prior to the fix to come on QUIC CRYPTO frames parsing.	2025-10-22 15:04:06 +02:00
Amaury Denoyelle	b9f91ad3ff	MINOR: ncbmbuf: define new ncbmbuf type Define ncbmbuf which is an alternative non-contiguous buffer implementation. "bm" abbreviation stands for bitmap, which reflects how gaps and filled blocks are encoded. The main purpose of this implementation is to get rid of the ncbuf limitation regarding the minimal size for gaps between two blocks of data. This commit adds the new module ncbmbuf. Along with it, some utility functions such as ncbmb_make(), ncbmb_init() and ncbmb_is_empty() are defined. Public API of ncbmbuf will be extended in the following patches. This patch is not considered a bug fix. However, it will be required to fix issue encountered on QUIC CRYPTO frames parsing. Thus, it will be necessary to backport the current patch prior to the fix to come.	2025-10-22 15:04:06 +02:00
Amaury Denoyelle	59f0bafef2	MINOR: ncbuf: extract common types ncbuf is a module which provide a non-contiguous buffer type implementation. This patch extracts some basic types related to it into a new file ncbuf_common.h. This patch will be useful to provide a new non-contiguous buffer alternative implementation based on a bitmap. This patch is not a bug fix. However, it is necessary for ncbmbuf implementation which will be required to fix a QUIC issue on CRYPTO frames parsing. This, it will be necessary to backport the current patch prior to the fix to come.	2025-10-22 11:11:20 +02:00
Olivier Houchard	d5562e31bd	MEDIUM: stick-tables: Remove the table lock Remove the table lock, it was only protecting the per-table expiration date, and that task is gone.	2025-10-20 15:04:47 +02:00
Olivier Houchard	8bc8a21b25	MEDIUM: stick-tables: Use a per-shard expiration task Instead of having per-table expiration tasks, just use one per shard. The task will now go through all the tables to expire entries. When a table gets an expiration earlier than the one previously known, it will be put in a mt-list, and the task will be responsible to put it into an eb32, ordered based on the next expiration. Each per-shard task will run on a different thread, so it should lead to a better load distribution than the per-table tasks.	2025-10-20 15:04:47 +02:00
Olivier Houchard	945aa0ea82	MINOR: initcalls: Add a new initcall stage, STG_INIT_2 Add a new initcall stage, STG_INIT_2, for stuff to be called after step_init_2() is called, so after we know for sure that global.nbthread will be set. Modify stick-tables stkt_late_init() to run at STG_INIT_2 instead of STG_INIT, in anticipation for it to be enhanced and have a need for global.nbthread.	2025-10-20 15:04:41 +02:00
Olivier Houchard	7a33b90b3c	BUG/MEDIUM: mt_list: Make sure not to unlock the element twice Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details In mt_list_delete(), if the element was not in a list, then n and p will point to it, and so setting n->prev and n->next will be enough to unlock it. Don't do it twice, as once it's been done the first time, another thread may be working with it, and may have added it to a list already, and doing it a second time can lead to list inconsistencies. This should be backported up to 2.8.	2025-10-19 23:21:42 +02:00
Frederic Lecaille	51eca5cbce	BUG/MINOR: quic: SSL counters not handled Some checks are pending Contrib / build (push) Waiting to run Details alpine/musl / gcc (push) Waiting to run Details VTest / Generate Build Matrix (push) Waiting to run Details VTest / (push) Blocked by required conditions Details Windows / Windows, gcc, all features (push) Waiting to run Details The SSL counters were not handled at all for QUIC connections. This patch implement ssl_sock_update_counters() extracting the code from ssl_sock.c and call this function where applicable both in TLS/TCP and QUIC parts. Must be backported as far as 2.8.	2025-10-17 12:13:43 +02:00
Willy Tarreau	17930edecc	MEDIUM: pools: detect() when munmap() fails in UAF mode Some checks failed Contrib / build (push) Has been cancelled Details alpine/musl / gcc (push) Has been cancelled Details VTest / Generate Build Matrix (push) Has been cancelled Details Windows / Windows, gcc, all features (push) Has been cancelled Details VTest / (push) Has been cancelled Details Better check that munmap() always works, otherwise it means we might have miscalculated an address, and if it fails silently, it will eat all the memory extremely quickly. Let's add a BUG_ON() on munmap's return.	2025-10-13 19:22:31 +02:00
Willy Tarreau	0e6a233217	BUG/MEDIUM: pools: fix bad freeing of aligned pools in UAF mode As reported by Christopher, in UAF mode memory release of aligned objects as introduced in commit `ef915e672a` ("MEDIUM: pools: respect pool alignment in allocations") does not work. The padding calculation in the freeing code is no longer correct since it now depends on the alignment, so munmap() fails on EINVAL. Fortunately we don't care much about it since we know it's the low bits of the passed address, which is much simpler to compute, since all mmaps are page-aligned. There's no need to backport this, as this was introduced in 3.3.	2025-10-13 19:19:39 +02:00
Willy Tarreau	fda6dc9597	MINOR: regex: use a thread-local match pointer for pcre2 The pcre2 matching requires an array of matches for grouping, that is allocated when executing the rule by pre-processing it, and that is immediately freed after use. This is quite inefficient and results in annoying patterns in "show profiling" that attribute the allocations to libpcre2 and the releases to haproxy. A good suggestion from Dragan is to pre-allocate these per thread, since the entry is not specific to a regex. In addition we're already limited to MAX_MATCH matches so we don't even have the problem of having to grow it while parsing nor processing. The current patch adds a per-thread pair of init/deinit functions to allocate a thread-local entry for that, and gets rid of the dynamic allocations. It will result in cleaner memory management patterns and slightly higher performance (+2.5%) when using pcre2.	2025-10-13 16:56:43 +02:00
Remi Tricot-Le Breton	bf5b912a62	MINOR: jwt: Add specific error code for known but unavailable certificate A certificate that does not have the 'jwt' flag enabled cannot be used for JWT validation. We now raise a specific return value so that such a case can be identified.	2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton	18ff130e9d	MINOR: jwt: Add new "jwt" certificate option This option can be used to enable the use of a given certificate for JWT verification. It defaults to 'off' so certificates that are declared in a crt-store and will be used for JWT verification must have a "jwt on" option in the configuration.	2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton	53957c50c3	MINOR: jwt: Do not look into ckch_store for jwt_verify converter We must not try to load full-on certificates for 'jwt_verify' converter anymore. 'jwt_verify_cert' is the only one that accepts a certificate.	2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton	f5632fd481	MINOR: jwt: Add new jwt_verify_cert converter This converter will be in charge of performing the same operation as the 'jwt_verify' one except that it takes a full-on pem certificate path instead of a public key path as parameter. The certificate path can be either provided directly as a string or via a variable. This allows to use certificates that are not known during init to perform token validation.	2025-10-13 10:38:52 +02:00
Remi Tricot-Le Breton	c3c0597a34	MEDIUM: jwt: Remove certificate support in jwt_verify converter The jwt_verify converter will not take full-on certificates anymore in favor of a new soon to come jwt_verify_cert. We might end up with a new jwt_verify_hmac in the future as well which would allow to deprecate the jwt_verify converter and remove the need for a specific internal tree for public keys. The logic to always look into the internal jwt tree by default and resolve to locking the ckch tree as little as possible will also be removed. This allows to get rid of the duplicated reference to EVP_PKEYs, the one in the jwt tree entry and the one in the ckch_store.	2025-10-13 10:38:52 +02:00
Christopher Faulet	4145a61101	BUG/MEDIUM: stconn: Properly forward kip to the opposite SE descriptor By refactoring the HTX to remove the extra field, a bug was introduced in the stream-connector part. The <kip> (known input payload) value of a sedesc was moved to <kop> (knwon output payload) using the same sedesc. Of course, this is totally wrong. <kip> value of a sedesc must be forwarded to the opposite side. In addition, the operation is performed in sc_conn_send(). In this function, we manipulate the stream-connectors. So se_fwd_kip() function was changed to use the stream-connectors directely. Now, the function sc_ep_fwd_kip() is now called with the both stream-connectors to properly forward <kip> from on side to the opposite side. The bug is 3.3-specific. No backport needed.	2025-10-10 11:01:21 +02:00
Christopher Faulet	914538cd39	MEDIUM: htx: Remove the HTX extra field Thanks for previous changes, it is now possible to remove the <extra> field from the HTX structure. HTX_FL_ALTERED_PAYLOAD flag is also removed because it is now unsued.	2025-10-08 11:10:42 +02:00
Christopher Faulet	c0b6db2830	MINOR: stconn: Add two fields in sedesc to replace the HTX extra value For now, the HTX extra value is used to specify the known part, in bytes, of the HTTP payload we will receive. It may concerne the full payload if a content-length is specified or the current chunk for a chunk-encoded message. The main purpose of this value is to be used on the opposite side to be able to announce chunks bigger than a buffer. It can also be used to check the validity of the payload on the sending path, to properly detect too big or too short payload. However, setting this information in the HTX message itself is not really appropriate because the information is lost when the HTX message is consumed and the underlying buffer released. So the producer must take care to always add it in all HTX messages. it is especially an issue when the payload is altered by a filter. So to fix this design issue, the information will be moved in the sedesc. It is a persistent area to save the information. In addition, to avoid the ambiguity between what the producer say and what the consumer see, the information will be splitted in two fields. In this patch, the fields are added: * kip : The known input payload length * kop : The known output payload lenght The producer will be responsible to set <kip> value. The stream will be responsible to decrement <kip> and increment <kop> accordingly. And the consumer will be responsible to remove consumed bytes from <kop>.	2025-10-08 11:01:36 +02:00
Willy Tarreau	75103e7701	MINOR: proxy: introduce proxy_abrt_close_def() to pass the desired default With this function we can now pass the desired default value for the abortonclose option when neither the option nor its opposite were set. Let's also take this opportunity for using it directly from the HTTP analyser since there's no point in re-checking the proxy's mode there.	2025-10-08 10:29:41 +02:00
Willy Tarreau	644b3dc7d8	MAJOR: proxy: enable abortonclose by default on HTTP proxies As discussed on https://github.com/orgs/haproxy/discussions/3146 and on the mailing list, there's a marked preference for having abortonclose enabled by default when relevant. The point being that with todays' internet, the large majority of requests sent with a closed input channel are aborted requests, and that it's pointless to waste resources processing them. This patch now considers both "option abortonclose" and its opposite "no option abortonclose" to figure whether abortonclose is enabled or disabled in a backend. When neither are set (thus not even inherited from a defaults section), then it considers the proxy's mode, and HTTP mode implies abortonclose by default. This may make some legacy services fail starting with 3.3. In this case it will be sufficient to add "no option abortonclose" in either the affected backend or the defaults section it derives from. But for internet-facing proxies it's better to stay with the option enabled.	2025-10-08 10:29:41 +02:00
Willy Tarreau	fe47e8dfc5	MINOR: proxy: only check abortonclose through a dedicated function In order to prepare for changing the way abortonclose works, let's replace the direct flag check with a similarly named function (proxy_abrt_close) which returns the on/off status of the directive for the proxy. For now it simply reflects the flag's state.	2025-10-08 10:29:41 +02:00
William Lallemand	69bd253b23	CLEANUP: mjson: remove unused defines from mjson.h This patch removes unused defines from mjson.h. It also removes unused c++ declarations and includes. string.h is moved to mjson.c	2025-10-06 09:30:07 +02:00
William Lallemand	8ea8aaace2	CLEANUP: mjson: remove MJSON_ENABLE_BASE64 code Remove the code used under #if MJSON_ENABLE_BASE64, which is not used within haproxy, to ease the maintenance of mjson.	2025-10-03 16:09:13 +02:00
William Lallemand	4edb05eb12	CLEANUP: mjson: remove MJSON_ENABLE_NEXT code Remove the code used under #if MJSON_ENABLE_NEXT, which is not used within haproxy, to ease the maintenance of mjson.	2025-10-03 16:08:17 +02:00
William Lallemand	a4eeeeeb07	CLEANUP: mjson: remove MJSON_ENABLE_PRINT code Remove the code used under #if MJSON_ENABLE_PRINT, which is not used within haproxy, to ease the maintenance of mjson.	2025-10-03 16:07:59 +02:00
William Lallemand	d63dfa34a2	CLEANUP: mjson: remove MJSON_ENABLE_RPC code Remove the code used under #if MJSON_ENABLE_RPC, which is not used within haproxy, to ease the maintenance of mjson.	2025-10-03 16:06:33 +02:00
Willy Tarreau	1afaa7b59d	MINOR: rawsock: introduce CO_RFL_TRY_HARDER to detect closures on complete reads Normally, when reading a full buffer, or exactly the requested size, it is not really possible to know if the peer had closed immediately after, and usually we don't care. There's a problematic case, though, which is with SSL: the SSL layer reads in small chunks of a few bytes, and can consume a client_hello this way, then start computation without knowing yet that the client has aborted. In order to permit knowing more, we now introduce a new read flag, CO_RFL_TRY_HARDER, which says that if we've read up to the permitted limit and the flag is set, then we attempt one extra byte using MSG_PEEK to detect whether the connection was closed immediately after that content or not. The first use case will obviously be related to SSL and client_hello, but it might possibly also make sense on HTTP responses to detect a pending FIN at the end of a response (e.g. if a close was already advertised).	2025-10-01 10:23:01 +02:00
Willy Tarreau	25f5f357cc	MINOR: sched: pass the thread number to is_sched_alive() Now it will be possible to query any thread's scheduler state, not only the current one. This aims at simplifying the watchdog checks for reported threads. The operation is now a simple atomic xchg.	2025-10-01 10:18:53 +02:00
Olivier Houchard	cf26745857	MINOR: mt_list: Implement MT_LIST_POP_LOCKED() Implement MT_LIST_POP_LOCKED(), that behaves as MT_LIST_POP() and removes the first element from the list, if any, but keeps it locked. This should be backported to 3.2, as it will be use in a bug fix in the stick tables that affects 3.2 too.	2025-09-30 16:25:07 +02:00
William Lallemand	b70c7f48fa	MINOR: acme: implement "reuse-key" option The new "reuse-key" option in the "acme" section, allows to keep the private key instead of generating a new one at each renewal.	2025-09-27 21:41:39 +02:00
William Lallemand	3e72a9f618	MINOR: acme: provider-name for dpapi sink Like "acme-vars", the "provider-name" in the acme section is used in case of DNS-01 challenge and is sent to the dpapi sink. This is used to pass the name of a DNS provider in order to chose the DNS API to use. This patch implements the cfg_parse_acme_vars_provider() which parses either acme-vars or provider-name options and escape their strings. Example: $ ( echo "@@1 show events dpapi -w -0"; cat - ) \| socat /tmp/master.sock - \| cat -e <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$ acme-vars "var1=foobar\"toto\",var2=var2"$ provider-name "godaddy"$ {$ "identifier": {$ "type": "dns",$ "value": "example.com"$ },$ "status": "pending",$ "expires": "2025-09-25T14:41:57Z",$ [...]	2025-09-26 10:23:35 +02:00
William Lallemand	92c31a6fb7	MINOR: acme: acme-vars allow to pass data to the dpapi sink In the case of the dns-01 challenge, the agent that handles the challenge might need some extra information which depends on the DNS provider. This patch introduces the "acme-vars" option in the acme section, which allows to pass these data to the dpapi sink. The double quotes will be escaped when printed in the sink. Example: global setenv VAR1 'foobar"toto"' acme LE directory https://acme-staging-v02.api.letsencrypt.org/directory challenge DNS-01 acme-vars "var1=${VAR1},var2=var2" Would output: $ ( echo "@@1 show events dpapi -w -0"; cat - ) \| socat /tmp/master.sock - \| cat -e <0>2025-09-18T17:53:58.831140+02:00 acme deploy foobpar.pem thumbprint gDvbPL3w4J4rxb8gj20mGEgtuicpvltnTl6j1kSZ3vQ$ acme-vars "var1=foobar\"toto\",var2=var2"$ {$ "identifier": {$ "type": "dns",$ "value": "example.com"$ },$ "status": "pending",$ "expires": "2025-09-25T14:41:57Z",$ [...]	2025-09-19 16:40:53 +02:00
Aurelien DARRAGON	5c299dee5a	MEDIUM: stats: consider that shared stats pointers may be NULL This patch looks huge, but it has a very simple goal: protect all accessed to shared stats pointers (either read or writes), because we know consider that these pointers may be NULL. The reason behind this is despite all precautions taken to ensure the pointers shouldn't be NULL when not expected, there are still corner cases (ie: frontends stats used on a backend which no FE cap and vice versa) where we could try to access a memory area which is not allocated. Willy stumbled on such cases while playing with the rings servers upon connection error, which eventually led to process crashes (since 3.3 when shared stats were implemented) Also, we may decide later that shared stats are optional and should be disabled on the proxy to save memory and CPU, and this patch is a step further towards that goal. So in essence, this patch ensures shared stats pointers are always initialized (including NULL), and adds necessary guards before shared stats pointers are de-referenced. Since we already had some checks for backends and listeners stats, and the pointer address retrieval should stay in cpu cache, let's hope that this patch doesn't impact stats performance much.	2025-09-18 16:49:51 +02:00
Willy Tarreau	08c6bbb542	OPTIM: sink: don't waste time calling sink_announce_dropped() if busy If we see that another thread is already busy trying to announce the dropped counter, there's no point going there, so let's just skip all that operation from sink_write() and avoid disturbing the other thread. This results in a boost from 244 to 262k req/s.	2025-09-18 09:07:35 +02:00
Willy Tarreau	361c227465	MINOR: trace: don't call strlen() on the function's name Currently there's a small mistake in the way the trace function and macros. The calling function name is known as a constant until the macro and passed as-is to the __trace() function. That one needs to know its length and will call ist() on it, resulting in a real call to strlen() while that length was known before the call. Let's use an ist instead of a const char* for __trace() and __trace_enabled() so that we can now completely avoid calling strlen() during this operation. This has significantly reduced the importance of __trace_enabled() in perf top.	2025-09-18 08:31:57 +02:00
Willy Tarreau	8c077c17eb	MINOR: server: add the "cc" keyword to set the TCP congestion controller It is possible on at least Linux and FreeBSD to set the congestion control algorithm to be used with outgoing connections, among the list of supported and permitted ones. Let's expose this setting with "cc". Unknown or forbidden algorithms will be ignored and the default one will continue to be used.	2025-09-17 17:19:33 +02:00
Willy Tarreau	4ed3cf295d	MINOR: listener: add the "cc" bind keyword to set the TCP congestion controller It is possible on at least Linux and FreeBSD to set the congestion control algorithm to be used with incoming connections, among the list of supported and permitted ones. Let's expose this setting with "cc". Permission issues might be reported (as warnings).	2025-09-17 17:03:42 +02:00
Ben Kallus	31d0695a6a	IMPORT: ebtree: replace hand-rolled offsetof to avoid UB The C standard specifies that it's undefined behavior to dereference NULL (even if you use & right after). The hand-rolled offsetof idiom &(((s)NULL)->f) is thus technically undefined. This clutters the output of UBSan and is simple to fix: just use the real offsetof when it's available. Note that there's no clear statement about this point in the spec, only several points which together converge to this: - From N3220, 6.5.3.4: A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points, and is an lvalue. - From N3220, 6.3.2.1: An lvalue is an expression (with an object type other than void) that potentially designates an object; if an lvalue does not designate an object when it is evaluated, the behavior is undefined. - From N3220, 6.5.4.4 p3: The unary & operator yields the address of its operand. If the operand has type "type", the result has type "pointer to type". If the operand is the result of a unary operator, neither that operator nor the & operator is evaluated and the result is as if both were omitted, except that the constraints on the operators still apply and the result is not an lvalue. Similarly, if the operand is the result of a [] operator, neither the & operator nor the unary * that is implied by the [] is evaluated and the result is as if the & operator were removed and the [] operator were changed to a + operator. => In short, this is saying that C guarantees these identities: 1. &(p) is equivalent to p 2. &(p[n]) is equivalent to p + n As a consequence, &(p) doesn't result in the evaluation of *p, only the evaluation of p (and similar for []). There is no corresponding special carve-out for ->. See also: https://pvs-studio.com/en/blog/posts/cpp/0306/ After this patch, HAProxy can run without crashing after building w/ clang-19 -fsanitize=undefined -fno-sanitize=function,alignment This is ebtree commit bd499015d908596f70277ddacef8e6fa998c01d5. Signed-off-by: Willy Tarreau <w@1wt.eu> This is ebtree commit 5211c2f71d78bf546f5d01c8d3c1484e868fac13.	2025-09-17 14:30:32 +02:00
Willy Tarreau	a31da78685	IMPORT: ebtree: add a definition of offsetof() We'll use this to improve the definition of container_of(). Let's define it if it does not exist. We can rely on __builtin_offsetof() on recent enough compilers. This is ebtree commit 1ea273e60832b98f552b9dbd013e6c2b32113aa5. Signed-off-by: Willy Tarreau <w@1wt.eu> This is ebtree commit 69b2ef57a8ce321e8de84486182012c954380401.	2025-09-17 14:30:32 +02:00
Ben Kallus	ddbff4e235	IMPORT: ebtree: Fix UB from clz(0) From 'man gcc': passing 0 as the argument to "__builtin_ctz" or "__builtin_clz" invokes undefined behavior. This triggers UBsan in HAProxy. [wt: tested in treebench and verified not to cause any performance regression with opstime-u32 nor stress-u32] Signed-off-by: Willy Tarreau <w@1wt.eu> This is ebtree commit 8c29daf9fa6e34de8c7684bb7713e93dcfe09029. Signed-off-by: Willy Tarreau <w@1wt.eu> This is ebtree commit cf3b93736cb550038325e1d99861358d65f70e9a.	2025-09-17 14:30:32 +02:00
Willy Tarreau	52c6dd773d	IMPORT: ebst: use prefetching in lookup() and insert() While the previous optimizations couldn't be preserved due to the possibility of out-of-bounds accesses, at least the prefetch is useful. A test on treebench shows that for 64k short strings, the lookup time falls from 276 to 199ns per lookup (28% savings), and the insert falls from 311 to 296ns (4.9% savings), which are pretty respectable, so let's do this. This is ebtree commit b44ea5d07dc1594d62c3a902783ed1fb133f568d.	2025-09-17 14:30:32 +02:00
Willy Tarreau	fef4cfbd21	IMPORT: ebtree: only use __builtin_prefetch() when supported It looks like __builtin_prefetch() appeared in gcc-3.1 as there's no mention of it in 3.0's doc. Let's replace it with eb_prefetch() which maps to __builtin_prefetch() on supported compilers and falls back to the usual do{}while(0) on other ones. It was tested to properly build with tcc as well as gcc-2.95. This is ebtree commit 7ee6ede56a57a046cb552ed31302b93ff1a21b1a.	2025-09-17 14:30:32 +02:00
Willy Tarreau	3dda813d54	IMPORT: eb32/64: optimize insert for modern CPUs Similar to previous patches, let's improve the insert() descent loop to avoid discovering mandatory data too late. The change here is even simpler than previous ones, a prefetch was installed and troot is calculated before last instruction in a speculative way. This was enough to gain +50% insertion rate on random data. This is ebtree commit e893f8cc4d44b10f406b9d1d78bd4a9bd9183ccf.	2025-09-17 14:30:32 +02:00
Willy Tarreau	61654c07bd	IMPORT: ebmb: optimize the lookup for modern CPUs This is the same principles as for the latest improvements made on integer trees. Applying the same recipes made the ebmb_lookup() function jump from 10.07 to 12.25 million lookups per second on a 10k random values tree (+21.6%). It's likely that the ebmb_lookup_longest() code could also benefit from this, though this was neither explored nor tested. This is ebtree commit a159731fd6b91648a2fef3b953feeb830438c924.	2025-09-17 14:30:32 +02:00
Willy Tarreau	6c54bf7295	IMPORT: eb32/eb64: place an unlikely() on the leaf test In the loop we can help the compiler build slightly more efficient code by placing an unlikely() around the leaf test. This shows a consistent 0.5% performance gain both on eb32 and eb64. This is ebtree commit 6c9cdbda496837bac1e0738c14e42faa0d1b92c4.	2025-09-17 14:30:32 +02:00
Willy Tarreau	384907f4e7	IMPORT: eb32: drop the now useless node_bit variable This one was previously used to preload from the node and keep a copy in a register on i386 machines with few registers. With the new more optimal code it's totally useless, so let's get rid of it. By the way the 64 bit code didn't use that at all already. This is ebtree commit 1e219a74cfa09e785baf3637b6d55993d88b47ef.	2025-09-17 14:30:31 +02:00
Willy Tarreau	c9e4adf608	IMPORT: eb32/eb64: use a more parallelizable check for lack of common bits Instead of shifting the XOR value right and comparing it to 1, which roughly requires 2 sequential instructions, better test if the XOR has any bit above the current bit, which means any bit set among those strictly higher, or in other words that XOR & (-bit << 1) is non-zero. This is one less instruction in the fast path and gives another nice performance gain on random keys (in million lookups/s): eb32 1k: 33.17 -> 37.30 +12.5% 10k: 15.74 -> 17.08 +8.51% 100k: 8.00 -> 9.00 +12.5% eb64 1k: 34.40 -> 38.10 +10.8% 10k: 16.17 -> 17.10 +5.75% 100k: 8.38 -> 8.87 +5.85% This is ebtree commit c942a2771758eed4f4584fe23cf2914573817a6b.	2025-09-17 14:30:31 +02:00
Willy Tarreau	6af17d491f	IMPORT: eb32/eb64: reorder the lookup loop for modern CPUs The current code calculates the next troot based on a calculation. This was efficient when the algorithm was developed many years ago on K6 and K7 CPUs running at low frequencies with few registers and limited branch prediction units but nowadays with ultra-deep pipelines and high latency memory that's no longer efficient, because the CPU needs to have completed multiple operations before knowing which address to start fetching from. It's sad because we only have two branches each time but the CPU cannot know it. In addition, the calculation is performed late in the loop, which does not help the address generation unit to start prefetching next data. Instead we should help the CPU by preloading data early from the node and calculing troot as soon as possible. The CPU will be able to postpone that processing until the dependencies are available and it really needs to dereference it. In addition we must absolutely avoid serializing instructions such as "(a >> b) & 1" because there's no way for the compiler to parallelize that code nor for the CPU to pre- process some early data. What this patch does is relatively simple: - we try to prefetch the next two branches as soon as the node is known, which will help dereference the selected node in the next iteration; it was shown that it only works with the next changes though, otherwise it can reduce the performance instead. In practice the prefetching will start a bit later once the node is really in the cache, but since there's no dependency between these instructions and any other one, we let the CPU optimize as it wants. - we preload all important data from the node (next two branches, key and node.bit) very early even if not immediately needed. This is cheap, it doesn't cause any pipeline stall and speeds up later operations. - we pre-calculate 1<<bit that we assign into a register, so as to avoid serializing instructions when deciding which branch to take. - we assign the troot based on a ternary operation (or if/else) so that the CPU knows upfront the two possible next addresses without waiting for the end of a calculation and can prefetch their contents every time the branch prediction unit guesses right. Just doing this provides significant gains at various tree sizes on random keys (in million lookups per second): eb32 1k: 29.07 -> 33.17 +14.1% 10k: 14.27 -> 15.74 +10.3% 100k: 6.64 -> 8.00 +20.5% eb64 1k: 27.51 -> 34.40 +25.0% 10k: 13.54 -> 16.17 +19.4% 100k: 7.53 -> 8.38 +11.3% The performance is now much closer to the sequential keys. This was done for all variants ({32,64}{,i,le,ge}). Another point, the equality test in the loop improves the performance when looking up random keys (since we don't need to reach the leaf), but is counter-productive for sequential keys, which can gain ~17% without that test. However sequential keys are normally not used with exact lookups, but rather with lookup_ge() that spans a time frame, and which does not have that test for this precise reason, so in the end both use cases are served optimally. It's interesting to note that everything here is solely based on data dependencies, and that trying to perform less operations upfront always ends up with lower performance (typically the original one). This is ebtree commit 05a0613e97f51b6665ad5ae2801199ad55991534.	2025-09-17 14:30:31 +02:00
Aurelien DARRAGON	644b6b9925	MINOR: counters: document that tg shared counters are tied to shm-stats-file mapping Let's explicitly mention that fe_counters_shared_tg and be_counters_shared_tg structs are embedded in shm_stats_file_object struct so any change in those structs will result in shm stats file incompatibility between processes, thus extra precaution must be taken when making changes to them. Note that the provisionning made in shm_stats_file_object struct could be used to add members to {fe,be}_counters_shared_tg without changing shm_stats_file_object struct size if needed in order to preserve shm stats file version.	2025-09-17 11:31:29 +02:00
Willy Tarreau	4edff4a2cc	CLEANUP: vars: use the item API for the variables trees The variables trees use the immediate cebtree API, better use the item one which is more expressive and safer. The "node" field was renamed to "name_node" to avoid any ambiguity.	2025-09-16 10:51:23 +02:00
Willy Tarreau	2d6b5c7a60	MEDIUM: connection: reintegrate conn_hash_node into connection Previously the conn_hash_node was placed outside the connection due to the big size of the eb64_node that could have negatively impacted frontend connections. But having it outside also means that one extra allocation is needed for each backend connection, and that one memory indirection is needed for each lookup. With the compact trees, the tree node is smaller (16 bytes vs 40) so the overhead is much lower. By integrating it into the connection, We're also eliminating one pointer from the connection to the hash node and one pointer from the hash node to the connection (in addition to the extra object bookkeeping). This results in saving at least 24 bytes per total backend connection, and only inflates connections by 16 bytes (from 240 to 256), which is a reasonable compromise. Tests on a 64-core EPYC show a 2.4% increase in the request rate (from 2.08 to 2.13 Mrps).	2025-09-16 09:23:46 +02:00
Willy Tarreau	ceaf8c1220	MEDIUM: connection: move idle connection trees to ceb64 Idle connection trees currently require a 56-byte conn_hash_node per connection, which can be reduced to 32 bytes by moving to ceb64. While ceb64 is theoretically slower, in practice here we're essentially dealing with trees that almost always contain a single key and many duplicates. In this case, ceb64 insert and lookup functions become faster than eb64 ones because all duplicates are a list accessed in O(1) while it's a subtree for eb64. In tests it is impossible to tell the difference between the two, so it's worth reducing the memory usage. This commit brings the following memory savings to conn_hash_node (one per backend connection), and to srv_per_thread (one per thread and per server): struct before after delta conn_hash_nodea 56 32 -24 srv_per_thread 96 72 -24 The delicate part is conn_delete_from_tree(), because we need to know the tree root the connection is attached to. But thanks to recent cleanups, it's now clear enough (i.e. idle/safe/avail vs session are easy to distinguish).	2025-09-16 09:23:46 +02:00
Willy Tarreau	95b8adff67	MINOR: connection: pass the thread number to conn_delete_from_tree() We'll soon need to choose the server's root based on the connection's flags, and for this we'll need the thread it's attached to, which is not always the current one. This patch simply passes the thread number from all callers. They know it because they just set the idle_conns lock on it prior to calling the function.	2025-09-16 09:23:46 +02:00
Willy Tarreau	7773d87ea6	CLEANUP: proxy: slightly reorganize fields to plug some holes The proxy struct has several small holes that deserved being plugged by moving a few fields around. Now we're down to 3056 from 3072 previously, and the remaining holes are small. At the moment, compared to before this series, we're seeing these sizes: type\size `7d554ca62` current delta listener 752 704 -48 (-6.4%) server 4032 3840 -192 (-4.8%) proxy 3184 3056 -128 (-4%) stktable 3392 3328 -64 (-1.9%) Configs with many servers have shrunk by about 4% in RAM and configs with many proxies by about 3%.	2025-09-16 09:23:46 +02:00
Willy Tarreau	8df81b6fcc	CLEANUP: server: slightly reorder fields in the struct to plug holes The struct server still has a lot of holes and padding that make it quite big. By moving a few fields aronud between areas which do not interact (e.g. boot vs aligned areas), it's quite easy to plug some of them and/or to arrange larger ones which could be reused later with a bit more effort. Here we've reduced holes by 40 bytes, allowing the struct to shrink by one more cache line (64 bytes). The new size is 3840 bytes.	2025-09-16 09:23:46 +02:00
Willy Tarreau	d18d972b1f	MEDIUM: server: index server ID using compact trees The server ID is currently stored as a 32-bit int using an eb32 tree. It's used essentially to find holes in order to automatically assign IDs, and to detect duplicates. Let's change this to use compact trees instead in order to save 24 bytes in struct server for this node, plus 8 bytes in struct proxy. The server struct is still 3904 bytes large (due to alignment) and the proxy struct is 3072.	2025-09-16 09:23:46 +02:00
Willy Tarreau	66191584d1	MEDIUM: listener: index listener ID using compact trees The listener ID is currently stored as a 32-bit int using an eb32 tree. It's used essentially to find holes in order to automatically assign IDs, and to detect duplicates. Let's change this to use compact trees instead in order to save 24 bytes in struct listener for this node, plus 8 bytes in struct proxy. The struct listener is now 704 bytes large, and the struct proxy 3080.	2025-09-16 09:23:46 +02:00
Willy Tarreau	1a95bc42c7	MEDIUM: proxy: index proxy ID using compact trees The proxy ID is currently stored as a 32-bit int using an eb32 tree. It's used essentially to find holes in order to automatically assign IDs, and to detect duplicates. Let's change this to use compact trees instead in order to save 24 bytes in struct proxy for this node, plus 8 bytes in the root (which is static so not much relevant here). Now the proxy is 3088 bytes large.	2025-09-16 09:23:46 +02:00
Willy Tarreau	eab5b89dce	MINOR: proxy: add proxy_index_id() to index a proxy by its ID This avoids needlessly exposing the tree's root and the mechanics outside of the low-level code.	2025-09-16 09:23:46 +02:00
Willy Tarreau	5e4b6714e1	MINOR: listener: add listener_index_id() to index a listener by its ID This avoids needlessly exposing the tree's root and the mechanics outside of the low-level code.	2025-09-16 09:23:46 +02:00
Willy Tarreau	5a5cec4d7a	MINOR: server: add server_index_id() to index a server by its ID This avoids needlessly exposing the tree's root and the mechanics outside of the low-level code.	2025-09-16 09:23:46 +02:00
Willy Tarreau	0b0aefe19b	MINOR: server: add server_get_next_id() to find next free server ID This was previously achieved via the generic get_next_id() but we'll soon get rid of generic ID trees so let's have a dedicated server_get_next_id(). As a bonus it reduces the exposure of the tree's root outside of the functions.	2025-09-16 09:23:46 +02:00
Willy Tarreau	23605eddb1	MINOR: listener: add listener_get_next_id() to find next free listener ID This was previously achieved via the generic get_next_id() but we'll soon get rid of generic ID trees so let's have a dedicated listener_get_next_id(). As a bonus it reduces the exposure of the tree's root outside of the functions.	2025-09-16 09:23:46 +02:00
Willy Tarreau	b2402d67b7	MINOR: proxy: add proxy_get_next_id() to find next free proxy ID This was previously achieved via the generic get_next_id() but we'll soon get rid of generic ID trees so let's have a dedicated proxy_get_next_id().	2025-09-16 09:23:46 +02:00
Willy Tarreau	f4059ea42f	MEDIUM: stktable: index table names using compact trees Here we're saving 64 bytes per stick-table, from 3392 to 3328, and the change was really straightforward so there's no reason not to do it.	2025-09-16 09:23:46 +02:00
Willy Tarreau	d0d60a007d	MEDIUM: proxy: switch conf.name to cebis_tree This is used to index the proxy's name and it contains a copy of the pointer to the proxy's name in <id>. Changing that for a ceb_node placed just before <id> saves 32 bytes to the struct proxy, which is now 3112 bytes large. Here we need to continue to support duplicates since they're still allowed between type-incompatible proxies. Interestingly, the use of cebis_next_dup() instead of cebis_next() in proxy_find_by_name() allows us to get rid of an strcmp() that was performed for each use_backend rule. A test with a large config (100k backends) shows that we can get 3% extra performance on a config involving a static use_backend rule (3.09M to 3.18M rps), and even 4.5% on a dynamic rule selecting a random backend (2.47M to 2.59M).	2025-09-16 09:23:46 +02:00
Willy Tarreau	fdf6fd5b45	MEDIUM: server: switch the host_dn member to cebis_tree This member is used to index the hostname_dn contents for DNS resolution. Let's replace it with a cebis_tree to save another 32 bytes (24 for the node + 8 by avoiding the duplication of the pointer). The struct server is now at 3904 bytes.	2025-09-16 09:23:46 +02:00
Willy Tarreau	413e903a22	MEDIUM: server: switch conf.name to cebis_tree This is used to index the server name and it contains a copy of the pointer to the server's name in <id>. Changing that for a ceb_node placed just before <id> saves 32 bytes to the struct server, which remains 3968 bytes large due to alignment. The proxy struct shrinks by 8 bytes to 3144. It's worth noting that the current way duplicate names are handled remains based on the previous mechanism where dups were permitted. Ideally we should now reject them during insertion and use unique key trees instead.	2025-09-16 09:23:46 +02:00
Willy Tarreau	0e99f64fc6	MEDIUM: server: switch addr_node to cebis_tree This contains the text representation of the server's address, for use with stick-tables with "srvkey addr". Switching them to a compact node saves 24 more bytes from this structure. The key was moved to an external pointer "addr_key" right after the node. The server struct is now 3968 bytes (down from 4032) due to alignment, and the proxy struct shrinks by 8 bytes to 3152.	2025-09-16 09:23:46 +02:00
Willy Tarreau	91258fb9d8	MEDIUM: guid: switch guid to more compact cebuis_tree The current guid struct size is 56 bytes. Once reduced using compact trees, it goes down to 32 (almost half). We're not on a critical path and size matters here, so better switch to this. It's worth noting that the name part could also be stored in the guid_node at the end to save 8 extra byte (no pointer needed anymore), however the purpose of this struct is to be embedded into other ones, which is not compatible with having a dynamic size. Affected struct sizes in bytes: Before After Diff server 4032 4032 0* proxy 3184 3160 -24 listener 752 728 -24 *: struct server is full of holes and padding (176 bytes) and is 64-byte aligned. Moving the guid_node elsewhere such as after sess_conn reduces it to 3968, or one less cache line. There's no point in moving anything now because forthcoming patches will arrange other parts.	2025-09-16 09:23:46 +02:00
Willy Tarreau	e36b3b60b3	MEDIUM: migrate the patterns reference to cebs_tree cebs_tree are 24 bytes smaller than ebst_tree (16B vs 40B), and pattern references are only used during map/acl updates, so their storage is pure loss between updates (which most of the time never happen). By switching their indexing to compact trees, we can save 16 to 24 bytes per entry depending on alightment (here it's 24 per struct but 16 practical as malloc's alignment keeps 8 unused). Tested on core i7-8650U running at 3.0 GHz, with a file containing 17.7M IP addresses (16.7M different): $ time ./haproxy -c -f acl-ip.cfg Save 280 MB RAM for 17.7M IP addresses, and slightly speeds up the startup (5.8%, from 19.2s to 18.2s), a part of which possible being attributed to having to write less memory. Note that this is on small strings. On larger ones such as user-agents, ebtree doesn't reread the whole key and might be more efficient. Before: RAM (VSZ/RSS): 4443912 3912444 real 0m19.211s user 0m18.138s sys 0m1.068s Overhead Command Shared Object Symbol 44.79% haproxy haproxy [.] ebst_insert 25.07% haproxy haproxy [.] ebmb_insert_prefix 3.44% haproxy libc-2.33.so [.] __libc_calloc 2.71% haproxy libc-2.33.so [.] _int_malloc 2.33% haproxy haproxy [.] free_pattern_tree 1.78% haproxy libc-2.33.so [.] inet_pton4 1.62% haproxy libc-2.33.so [.] _IO_fgets 1.58% haproxy libc-2.33.so [.] _int_free 1.56% haproxy haproxy [.] pat_ref_push 1.35% haproxy libc-2.33.so [.] malloc_consolidate 1.16% haproxy libc-2.33.so [.] __strlen_avx2 0.79% haproxy haproxy [.] pat_idx_tree_ip 0.76% haproxy haproxy [.] pat_ref_read_from_file 0.60% haproxy libc-2.33.so [.] __strrchr_avx2 0.55% haproxy libc-2.33.so [.] unlink_chunk.constprop.0 0.54% haproxy libc-2.33.so [.] __memchr_avx2 0.46% haproxy haproxy [.] pat_ref_append After: RAM (VSZ/RSS): 4166108 3634768 real 0m18.114s user 0m17.113s sys 0m0.996s Overhead Command Shared Object Symbol 38.99% haproxy haproxy [.] cebs_insert 27.09% haproxy haproxy [.] ebmb_insert_prefix 3.63% haproxy libc-2.33.so [.] __libc_calloc 3.18% haproxy libc-2.33.so [.] _int_malloc 2.69% haproxy haproxy [.] free_pattern_tree 1.99% haproxy libc-2.33.so [.] inet_pton4 1.74% haproxy libc-2.33.so [.] _IO_fgets 1.73% haproxy libc-2.33.so [.] _int_free 1.57% haproxy haproxy [.] pat_ref_push 1.48% haproxy libc-2.33.so [.] malloc_consolidate 1.22% haproxy libc-2.33.so [.] __strlen_avx2 1.05% haproxy libc-2.33.so [.] __strcmp_avx2 0.80% haproxy haproxy [.] pat_idx_tree_ip 0.74% haproxy libc-2.33.so [.] __memchr_avx2 0.69% haproxy libc-2.33.so [.] __strrchr_avx2 0.69% haproxy libc-2.33.so [.] _IO_getline_info 0.62% haproxy haproxy [.] pat_ref_read_from_file 0.56% haproxy libc-2.33.so [.] unlink_chunk.constprop.0 0.56% haproxy libc-2.33.so [.] cfree@GLIBC_2.2.5 0.46% haproxy haproxy [.] pat_ref_append If the addresses are totally disordered (via "shuf" on the input file), we see both implementations reach exactly 68.0s (slower due to much higher cache miss ratio). On large strings such as user agents (1 million here), it's now slightly slower (+9%): Before: real 0m2.475s user 0m2.316s sys 0m0.155s After: real 0m2.696s user 0m2.544s sys 0m0.147s But such patterns are much less common than short ones, and the memory savings do still count. Note that while it could be tempting to get rid of the list that chains all these pat_ref_elt together and only enumerate them by walking along the tree to save 16 extra bytes per entry, that's not possible due to the problem that insertion ordering is critical (think overlapping regex such as /index.* and /index.html). Currently it's not possible to proceed differently because patterns are first pre-loaded into the pat_ref via pat_ref_read_from_file_smp() and later indexed by pattern_read_from_file(), which has to only redo the second part anyway for maps/acls declared multiple times.	2025-09-16 09:23:46 +02:00
Willy Tarreau	ddf900a0ce	IMPORT: cebtree: import version 0.5.0 to support duplicates The support for duplicates is necessary for various use cases related to config names, so let's upgrade to the latest version which brings this support. This updates the cebtree code to commit 808ed67 (tag 0.5.0). A few tiny adaptations were needed: - replace a few ceb_node with ceb_root since pointers are now tagged ; - replace cebu.h with ceb.h since both are now merged in the same include file. This way we can drop the unused cebu*.h files from cebtree that are provided only for compatibility. - rename immediate storage functions to cebXX_imm_XXX() as per the API change in 0.5 that makes immediate explicit rather than implicit. This only affects vars and tools.c:copy_file_name(). The tests continue to work.	2025-09-16 09:23:46 +02:00
Remi Tricot-Le Breton	257df69fbd	BUG/MINOR: ocsp: Crash when updating CA during ocsp updates If an ocsp response is set to be updated automatically and some certificate or CA updates are performed on the CLI, if the CLI update happens while the OCSP response is being updated and is then detached from the udapte tree, it might be wrongly inserted into the update tree in 'ssl_sock_load_ocsp', and then reinserted when the update finishes. The update tree then gets corrupted and we could end up crashing when accessing other nodes in the ocsp response update tree. This patch must be backported up to 2.8. This patch fixes GitHub #3100.	2025-09-15 15:34:36 +02:00

1 2 3 4 5 ...

8843 commits