This patch impacts the QUIC frontends. It reverts this patch
MINOR: quic-be: add a "CC connection" backend TX buffer pool
which adds <pool_head_quic_be_cc_buf> new pool to allocate CC (connection closed state)
TX buffers with bigger object size than the one for <pool_head_quic_cc_buf>.
Indeed the QUIC backends must be able to send at least 1200 bytes Initial packets.
For now on, both the QUIC frontends and backend use the same pool with
MAX(QUIC_INITIAL_IPV6_MTU, QUIC_INITIAL_IPV4_MTU)(1252 bytes) as object size.
cpu_dump_topology() prints details about each enabled CPU and a summary with
clusters info and thread-cpu bindings. The latter is often usefull for
debugging and we want to add it in the 'show dev' output.
So, let's split cpu_dump_topology() in two parts: cpu_topo_debug() to print the
details about each enabled CPU; and cpu_topo_dump_summary() to print only the
summary.
In the next commit we will modify cpu_topo_dump_summary() to write into local
trash buffer and it could be easily called from debug_parse_cli_show_dev().
It's super difficult to find the rules that operate idle conns depending
on their idle/safe/avail/private status. Some are in lists, others not.
Some are in trees, others not. Some have a flag set, others not. This
documents the rules before the definitions in connection-t.h. It could
even be backported to help during backport sessions.
Replace all calls to qc_is_listener() (resp. !qc_is_listener()) by calls to
objt_listener() (resp. objt_server()).
Remove qc_is_listener() implement and QUIC_FL_CONN_LISTENER the flag it
relied on.
When an appctx is initialized, there is a BUG_ON() to be sure the appctx is
really initialized on the right thread to avoid bugs on the thread
affinity. However, it is possible to not choose the thread when the appctx
is created and let it starts on any thread. In that case, the thread
affinity is set when the appctx is initialized. So, we must take cate to not
trigger the BUG_ON() in that case.
For now, we never hit the bug because the thread affinity is always set
during the appctx creation.
This patch must be backport as far as 2.8.
On backend side, H3 layer is responsible to decode a HTTP/3 response
into an HTX message. Multiple responses may be received on a single
stream with interim status codes prior to the final one.
h3_resp_headers_to_htx() is the function used solely on backend side
responsible for H3 response to HTX transcoding. This patch extends it to
be able to properly support interim responses. When such a response is
received, the new flag H3_SF_RECV_INTERIM is set. This is converted to
QMUX qcs flag QC_SF_EOI_SUSPENDED.
The objective of this latter flag is to prevent stream EOI to be
reported during stream rcv_buf callback, even if HTX message contains
EOM and is empty. QC_SF_EOI_SUSPENDED will be cleared when the final
response is finally converted, which unblock stream EOI notification for
next rcv_buf invocations. Note however that HTX EOM is untouched : it is
always set for both interim and final response reception.
As a minor adjustment, HTX_SL_F_BODYLESS is always set for interim
responses.
Contrary to frontend interim response handling, a flag is necessary on
QMUX layer. This is because H3 to HTX transcoding and rcv_buf callback
are two distinct operations, called under different context (MUX vs
stream tasklet).
Also note that H3 layer has two distinct flags for interim response
handling, one only used as a server (FE side) and the other as a client
(BE side). It was preferred to used two distinct flags which is
considered less error-prone, contrary to a single unified flag which
would require to always set the proxy side to ensure it is relevant or
not.
No need to backport.
Previous patches have fixes interim response encoding via
h3_resp_headers_send(). However, it is still necessary to adjust h3
layer state-machine so that several successive HTTP responses are
accepted for a single stream.
Prior to this, QMUX was responsible to decree that the final HTX message
was encoded so that FIN stream can be emitted. However, with interim
response, MUX is in fact unable to properly determine this. As such,
this is the responsibility of the application protocol layer. To reflect
this, app_ops snd_buf callback is modified so that a new output argument
<fin> is added to it.
Note that for now this commit does not bring any functional change.
However, it will be necessary for the following patch. As such, it
should be backported prior to it to every versions as necessary.
The server struct has gone huge over time (~3.8kB), and having a copy
of it in the defsrv section of the struct proxy costs a lot of RAM,
that is not needed anymore at run time.
This patch replaces this struct with a dynamically allocated one. The
field is allocated and initialized during alloc_new_proxy() and is
freed when the proxy is destroyed for now. But the goal will be to
support freeing it after parsing the section.
At a few places we're seeing some open-coding of the same function, likely
because it looks overkill for what it's supposed to do, due to extraneous
tests that are not needed (e.g. check of the backend's PR_CAP_BE etc).
Let's just remove all these superfluous tests and inline it so that it
feels more suitable for use everywhere it's needed.
This function doesn't just look at the name but also the ID when the
argument starts with a '#'. So the name is not correct and explains
why this function is not always used when the name only is needed,
and why the list-based findserver() is used instead. So let's just
call the function "server_find()", and rename its generation-id based
cousin "server_find_unique()".
Let's just use the tree-based lookup instead of walking through the list.
This function is used to find duplicates in "track" statements and a few
such places, so it's important not to waste too much time on large setups.
Since 2012, systemd compliant distributions contain
/etc/os-release file. This file has some standardized format, see details at
https://www.freedesktop.org/software/systemd/man/latest/os-release.html.
Let's read it in feed_post_mortem_linux() to gather more info about the
distribution.
(cherry picked from commit f1594c41368baf8f60737b229e4359fa7e1289a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>
The function h1_format_htx_msg() can now be used to convert a valid HTX
message in its H1 representation. No validity test is performed, the HTX
message must be valid. Only trailers are silently ignored if the message is
not chunked. In addition, the destination buffer must be empty. 1XX interim
responses should be supported. But again, there is no validity tests.
When a large request is sent, it is possible to have a response before the
end of the request. It is valid from HTTP perspective but it is an issue
with the current design of the http-client. Indded, the request and the
response are handled sequentially. So the response will be blocked, waiting
for the end of the request. Most of time, it is not an issue, except when
the request transfer is blocked. In that case, the applet is blocked.
With the current API, it is not possible to handle early response and
continue the request transfer. So, this case cannot be handle. In that case,
it seems reasonnable to drain the request if a response is received. This
way, the request transfer, from the caller point of view, is never blocked
and the response can be properly processed.
To do so, the action flag HTTPCLIENT_FA_DRAIN_REQ is added to the
http-client. When it is set, the request payload is just dropped. In that
case, we take care to not report the end of input to properly report the
request was truncated, especially in logs.
It is only an issue with large POSTs, when the payload is streamed.
This patch must be backported as far as 2.6.
Revert this patch which is no more useful since OpenSSL 3.5.1 to remove the
QUIC server callback restoration after SSL context switch:
MINOR: quic: OpenSSL 3.5 internal QUIC custom extension for transport parameters reset
It was required for 3.5.0. That said, there was no CI for OpenSSL 3.5 at the date
of this commit. The CI recently revealed that the QUIC server side could crash
during QUIC reg tests just after having restored the callbacks as implemented by
the commit above.
Also revert this commit which is no more useful because it arrived with the commit
above:
BUG/MEDIUM: quic: SSL/TCP handshake failures with OpenSSL 3.
Must be backported to 3.2.
The QUIC listener part was impacted by the 3.5.0 OpenSSL new QUIC API with several
issues which have been fixed by 3.5.1.
Add a #error to prevent such OpenSSL 3.5 new QUIC API use with version below 3.5.1.
Must be backported to 3.2.
The server's "hostname_dn" is in Domain Name format, not a pure string, as
converted by resolv_str_to_dn_label(). It is made of lower-case string
components delimited by binary lengths, e.g. <0x03>www<0x07>haproxy<0x03)org.
As such it must not be lowercased again in srv_state_srv_update(), because
1) it's useless on the name components since already done, and 2) because
it would replace component lengths 97 and above by 32-char shorter ones.
Granted, not many domain names have that large components so the risk is
very low but the operation is always wrong anyway. This was brought in
2.5 by commit 3406766d57 ("MEDIUM: resolvers: add a ref between servers
and srv request or used SRV record").
In the same vein, let's fix the confusing strcasecmp() that are applied
to this binary format, and use memcmp() instead. Here there's basically
no risk to incorrectly match the wrong record, but that test alone is
confusing enough to provoke the existence of the bug above.
Finally let's update the component for that field to mention that it's
in this format and already lower cased.
Better not backport this, the risk of facing this bug is almost zero, and
every time we touch such files something breaks for bad reasons.
Make the server line parsing fail when a QUIC backend is configured if haproxy
is built to use the OpenSSL stack compatibility module. This latter does not
support the QUIC client part.
This bug impacts both QUIC backends and frontends with OpenSSL 3.5 as QUIC API.
The connections to a haproxy QUIC listener from a haproxy QUIC backend could not
work at all without HelloRetryRequest TLS messages emitted by the backend
asking the QUIC client to restart the handshake followed by TLS alerts:
conn. @(nil) OpenSSL error[0xa000098] read_state_machine: excessive message size
Furthermore, the Initial CRYPTO data sent by the client were big (about two 1252 bytes
packets) (ClientHello TLS message). After analyzing the packets a key_share extension
with <unknown> as value was long (more that 1Ko). This extension is in relation with
the groups but does not belong to the groups supported by QUIC.
That said such connections could work with ngtcp2 as backend built against the same
OSSL TLS stack API but with a HelloRetryRequest.
ngtcp2 always set the QUIC default cipher suites and group, for all the stacks it
supports as implemented by this patch.
So this patch configures both QUIC backend and frontend cipher suites and groups
calling SSL_CTX_set_ciphersuites() and SSL_CTX_set1_groups_list() with the correct
argument, except for SSL_CTX_set1_groups_list() which fails with QUIC TLS for
a unknown reason at this time.
The call to SSL_CTX_set_options() is useless from ssl_quic_initial_ctx() for the QUIC
clients. One relies on ssl_sock_prepare_srv_ssl_ctx() to set them for now on.
This patch is effective for all the supported stacks without impact for AWS-LC,
and QUIC TLS and fixes the connections for haproxy QUIC frontend and backends
when builts against OpenSSL 3.5 QUIC API).
A new define HAVE_OPENSSL_QUICTLS has been added to openssl-compat.h to distinguish
the QUIC TLS stack.
Must be backported to 3.2.
Patterns are allocated when loading maps/acls from a file or dynamically
via the CLI, and are released only from the CLI (e.g. "clear map xxx").
These ones do not use pools and are much harder to monitor, e.g. in case
a script adds many and forgets to clear them, etc.
Let's add a new pair of metrics "PatternsAdded" and "PatternsFreed" that
will report the number of added and freed patterns respectively. This
can allow to simply graph both. The difference between the two normally
represents the number of allocated patterns. If Added grows without
Freed following, it can indicate a faulty script that doesn't perform
the needed cleanup. The metrics are also made available to Prometheus
as patterns_added_total and patterns_freed_total respectively.
This patch adds the support for the RFC2385 (Protection of BGP Sessions via
the + TCP MD5 Signature Option) for the listeners and the servers. The
feature is only available on Linux. Keywords are not exposed otherwise.
By setting "tcp-md5sig <password>" option on a bind line, TCP segments of
all connections instantiated from the listening socket will be signed with a
16-byte MD5 digest. The same option can be set on a server line to protect
outgoing connections to the corresponding server.
The primary use case for this option is to allow BGP to protect itself
against the introduction of spoofed TCP segments into the connection
stream. But it can be useful for any very long-lived TCP connections.
A reg-test was added and it will be executed only on linux. All other
targets are excluded.
Add a HTTPCLIENT_O_RES_HTX flag which allow to store directly the HTX
data in the response buffer instead of extracting the data in raw
format.
This is useful when the data need to be reused in another request.
This patch split the httpclient code to prevent confusion between the
httpclient CLI command and the actual httpclient API.
Indeed there was a confusion between the flag used internally by the
CLI command, and the actual httpclient API.
hc_cli_* functions as well as HC_C_F_* defines were moved to
httpclient_cli.c.
The HC_F_HTTPPROXY flag was wrongly named and does not use the correct
value, indeed this flag was meant to be used for the httpclient API, not
the httpclient CLI.
This patch fixes the problem by introducing HTTPCLIENT_FO_HTTPPROXY
which has must be set in hc->flags.
Also add a member 'options' in the httpclient structure, because the
member flags is reinitialized when starting.
Must be backported as far as 3.0.
Add a fourth character to the second column of the "typed output format"
to indicate whether the value results from a volatile or persistent metric
('V' or 'P' characters respectively). A persistent metric means the value
could possibily be preserved across reloads by leveraging a shared memory
between multiple co-processes. Such metrics are identified as "shared" in
the code (since they are possibly shared between multiple co-processes)
Some reg-tests were updated to take that change into account, also, some
outputs in the configuration manual were updated to reflect current
behavior.
The 'jwt_verify' converter could only be passed public keys as second
parameter instead of full-on public certificates. This patch allows
proper certificates to be used.
Those certificates can be loaded in ckch_stores like any other
certificate which means that all the certificate-related operations that
can be made via the CLI can now benefit JWT validation as well.
We now have two ways JWT validation can work, the legacy one which only
relies on public keys which could not be stored in ckch_stores without
some in depth changes in the way the ckch_stores are built. In this
legacy way, the public keys are fully stored in a cache dedicated to JWT
only which does not have any CLI commands and any way to update them
during runtime. It also requires that all the public keys used are
passed at least once explicitely to the 'jwt_verify' converter so that
they can be loaded during init.
The new way uses actual certificates, either already stored in the
ckch_store tree (if predefined in a crt-store or already used previously
in the configuration) or loaded in the ckch_store tree during init if
they are explicitely used in the configuration like so:
var(txn.bearer),jwt_verify(txn.jwt_alg,"cert.pem")
When using a variable (or any other way that can only be resolved during
runtime) in place of the converter's <key> parameter, the first time we
encounter a new value (for which we don't have any entry in the jwt
tree) we will lock the ckch_store tree and try to perform a lookup in
it. If the lookup fails, an entry will still be inserted into the jwt
tree so that any following call with this value avoids performing the
ckch_store tree lookup.
It was already forbidden to use HTTP sample fetch functions from lua
services. An error is triggered if it happens. However, the error must be
extended to any L6/L7 sample fetch functions.
Indeed, a lua service is an applet. It totally unexepected for an applet to
access to input data in a channel's buffer. These data have not been
analyzed yet and are still subject to any change. An applet, lua or not,
must never access to "not forwarded" data. Only output data are
available. For now, if a lua applet relies on any L6/L7 sampel fetch
functions, the behavior is undefined and not consistent.
So to fix the issue, hlua flag HLUA_F_MAY_USE_HTTP is renamed to
HLUA_F_MAY_USE_CHANNELS_DATA. This flag is used to prevent any lua applet to
use L6/L7 sample fetch functions.
This patch could be backported to all stable versions.
Since proxy and server struct already have an internal last_change
variable and we cannot merge it with the shared counter one, let's
rename the last_change counter to be more specific and prevent the
mixup between the two.
last_change counter is renamed to last_state_change, and unlike the
internal last_change, this one is a shared counter so it is expected
to be updated by other processes in our back.
However, when updating last_state_change counter, we use the value
of the server/proxy last_change as reference value.
Same motivation as previous commit, proxy last_change is "abused" because
it is used for 2 different purposes, one for stats, and the other one
for process-local internal use.
Let's add a separate proxy-only last_change variable for internal use,
and leave the last_change shared (and thread-grouped) counter for
statistics.
last_change server metric is used for 2 separate purposes. First it is
used to report last server state change date for stats and other related
metrics. But it is also used internally, including in sensitive paths,
such as lb related stuff to take decision or perform computations
(ie: in srv_dynamic_maxconn()).
Due to last_change counter now being split over thread groups since 16eb0fa
("MAJOR: counters: dispatch counters over thread groups"), reading the
aggregated value has a cost, and we cannot afford to consult last_change
value from srv_dynamic_maxconn() anymore. Moreover, since the value is
used to take decision for the current process we don't wan't the variable
to be updated by another process in our back.
To prevent performance regression and sharing issues, let's instead add a
separate srv->last_change value, which is not updated atomically (given how
rare the updates are), and only serves for places where the use of the
aggregated last_change counter/stats (split over thread groups) is too
costly.
Now that native mailers configuration is only usable with Lua mailers,
Willy noticed that we lack a way to warn the user if mailers were
previously configured on an older version but Lua mailers were not loaded,
which could trick the user into thinking mailers keep working when
transitionning to 3.2 while it is not.
In this patch we add the 'core.use_native_mailers_config()' Lua function
which should be called in Lua script body before making use of
'Proxy:get_mailers()' function to retrieve legacy mailers configuration
from haproxy main config. This way haproxy effectively knows that the
native mailers config is actually being used from Lua (which indicates
user correctly migrated from native mailers to Lua mailers), else if
mailers are configured but not used from Lua then haproxy warns the user
about the fact that they will be ignored unless they are used from Lua.
(e.g.: using the provided 'examples/lua/mailers.lua' to ease transition)
- Add ->retry_token and ->retry_token_len new quic_conn struct members to store
the retry tokens. These objects are allocated by quic_rx_packet_parse() and
released by quic_conn_release().
- Add <pool_head_quic_retry_token> new pool for these tokens.
- Implement quic_retry_packet_check() to check the integrity tag of these tokens
upon RETRY packets receipt. quic_tls_generate_retry_integrity_tag() is called
by this new function. It has been modified to pass the address where the tag
must be generated
- Add <resend> new parameter to quic_pktns_discard(). This function is called
to discard the packet number spaces where the already TX packets and frames are
attached to. <resend> allows the caller to prevent this function to release
the in flight TX packets/frames. The frames are requeued to be resent.
- Modify quic_rx_pkt_parse() to handle the RETRY packets. What must be done upon
such packets receipt is:
- store the retry token,
- store the new peer SCID as the DCID of the connection. Note that the peer will
modify again its SCID. This is why this SCID is also stored as the ODCID
which must be matched with the peer retry_source_connection_id transport parameter,
- discard the Initial packet number space without flagging it as discarded and
prevent retransmissions calling qc_set_timer(),
- modify the TLS cryptographic cipher contexts (RX/TX),
- wakeup the I/O handler to send new Initial packets asap.
- Modify quic_transport_param_decode() to handle the retry_source_connection_id
transport parameter as a QUIC client. Then its caller is modified to
check this transport parameter matches with the SCID sent by the peer with
the RETRY packet.
A QUIC client must be able to close a connection sending Initial packets. But
QUIC client Initial packets must always be at least 1200 bytes long. To reduce
the memory use of TX buffers of a connection when in "closing" state, a pool
was dedicated for this purpose but with a too much reduced TX buffer size
(QUIC_MAX_CC_BUFSIZE).
This patch adds a "closing state connection" TX buffer pool with the same role
for QUIC backends.
This patch removes completely the support for the program section, the
parsing of the section as well as the internals in the mworker does not
support it anymore.
The program section was considered dysfonctional and not fully
compatible with the "mworker V3" model. Users that want to run an
external program must use their init system.
The documentation is cleaned up in another patch.
This "renegotiate" option can be set on SSL backends to allow secure
renegotiation. It is mostly useful with SSL libraries that disable
secure regotiation by default (such as AWS-LC).
The "no-renegotiate" one can be used the other way around, to disable
secure renegotation that could be allowed by default.
Those two options can be set via "ssl-default-server-options" as well.
As mentioned in 2.8 announce on the mailing list [1] and on the wiki [2]
native mailers were deprecated and planned for removal in 3.3. Now is
the time to drop the legacy code for native mailers which is based on a
tcpcheck "hack" and cannot be maintained. Lua mailers should be used as
a drop in replacement. Indeed, "mailers" and associated config directives
are preserved because mailers config is exposed to Lua, which helps smoothing
the transition from native mailers to Lua based ones.
As a reminder, to keep mailers configuration working as before without
making changes to the config file, simply add the line below to the global
section:
lua-load examples/lua/mailers.lua
mailers.lua script (provided in the git repository, adjust path as needed)
may be customized by users familiar with Lua, by default it emulates the
behavior of the native (now removed) mailers.
[1]: https://www.mail-archive.com/haproxy@formilux.org/msg43600.html
[2]: https://github.com/haproxy/wiki/wiki/Breaking-changes
As reported by Chris Staite in GH #3002, trying to yield from a Lua
action during a client disconnect causes the script to be interrupted
(which is expected) and an alert to be emitted with the error:
"Lua function '%s': yield not allowed".
While this error is well suited for cases where the yield is not expected
at all (ie: when context doesn't allow it) and results from a yield misuse
in the Lua script, it isn't the case when the yield is exceptionnally not
available due to an abort or error in the request/response processing.
Because of that we raise an alert but the user cannot do anything about it
(the script is correct), so it is confusing and polluting the logs.
In this patch we introduce the ACT_OPT_FINAL_EARLY flag which is a
complementary flag to ACT_OPT_FIRST. This flag is set when the
ACT_OPT_FIRST is set earlier than normal (due to error/abort).
hlua_action() then checks for this flag to decide whether an error (alert)
or a simple log message should be emitted when the yield is not available.
It should solve GH #3002. Thanks to Chris Staite (@chrisstaite-menlo) for
having reported the issue and suggested a solution.
For frontend side, quic_conn is only released if MUX wasn't allocated,
either due to handshake abort, in which case upper layer is never
allocated, or after transfer completion when full conn + MUX layers are
already released.
On the backend side, initialization is not performed in the same order.
Indeed, in this case, connection is first instantiated, the nthe
quic_conn is created to execute the handshake, while MUX is still only
allocated on handshake completion. As such, it is not possible anymore
to free immediately quic_conn on handshake failure. Else, this can cause
crash if the connection try to reaccess to its transport layer after
quic_conn release.
Such crash can easily be reproduced in case of connection error to the
QUIC server. Here is an example of an experienced backtrace.
Thread 1 "haproxy" received signal SIGSEGV, Segmentation fault.
0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
28 qc->conn = NULL;
[ ## gdb ## ] bt
#0 0x0000555555739733 in quic_close (conn=0x55555734c0d0, xprt_ctx=0x5555573a6e50) at src/xprt_quic.c:28
#1 0x00005555559c9708 in conn_xprt_close (conn=0x55555734c0d0) at include/haproxy/connection.h:162
#2 0x00005555559c97d2 in conn_full_close (conn=0x55555734c0d0) at include/haproxy/connection.h:206
#3 0x00005555559d01a9 in sc_detach_endp (scp=0x7fffffffd648) at src/stconn.c:451
#4 0x00005555559d05b9 in sc_reset_endp (sc=0x55555734bf00) at src/stconn.c:533
#5 0x000055555598281d in back_handle_st_cer (s=0x55555734adb0) at src/backend.c:2754
#6 0x000055555588158a in process_stream (t=0x55555734be10, context=0x55555734adb0, state=516) at src/stream.c:1907
#7 0x0000555555dc31d9 in run_tasks_from_lists (budgets=0x7fffffffdb30) at src/task.c:655
#8 0x0000555555dc3dd3 in process_runnable_tasks () at src/task.c:889
#9 0x0000555555a1daae in run_poll_loop () at src/haproxy.c:2865
#10 0x0000555555a1e20c in run_thread_poll_loop (data=0x5555569d1c00 <ha_thread_info>) at src/haproxy.c:3081
#11 0x0000555555a1f66b in main (argc=5, argv=0x7fffffffde18) at src/haproxy.c:3671
To fix this, change the condition prior to calling quic_conn release. If
<conn> member is not NULL, delay the release, similarly to the case when
MUX is allocated. This allows connection to be freed first, and detach
from quic_conn layer through close xprt operation.
No need to backport.
Implement support for MAX_STREAMS frame. On frontend, this was mostly
useless as haproxy would never initiate new bidirectional streams.
However, this becomes necessary to control stream flow-control when
using QUIC as a client on the backend side.
Parsing of MAX_STREAMS is implemented via new qcc_recv_max_streams().
This allows to update <ms_uni>/<ms_bidi> QCC fields.
This patch is necessary to achieve QUIC backend connection reuse.
Previously, no check on peer flow-control was implemented prior to open
a local QUIC stream. This was a small problem for frontend
implementation, as in this case haproxy as a server never opens
bidirectional streams.
On frontend, the only stream opened by haproxy in this case is for
HTTP/3 control unidirectional data. If the peer uses an initial value
for max uni streams set to 0, it would violate its flow control, and the
peer will probably close the connection. Note however that RFC 9114
mandates that each peer defines minimal initial value so that at least
the control stream can be created.
This commit improves the situation of too low initial max uni streams
value. Now, on HTTP/3 layer initialization, haproxy preemptively checks
flow control limit on streams via a new function
qcc_fctl_avail_streams(). If credit is already expired due to a too
small initial value, haproxy preemptively closes the connection using
H3_ERR_GENERAL_PROTOCOL_ERROR. This behavior is better as haproxy is now
the initiator of the connection closure.
This should be backported up to 2.8.
Remove avail_streams_bidi/avail_streams_uni mux_ops. These callbacks
were designed to be specific to QUIC. However, they won't be necessary,
as stream layer only cares about bidirectional streams.
Implement proper encoding of HTTP/3 authority pseudo-header during
request transcoding on the backend side. A pseudo-header :authority is
encoded if a value can be extracted from HTX start-line. A special check
is also implemented to ensure that a host header is not encoded if
:authority already is.
A new function qpack_encode_auth() is defined to implement QPACK
encoding of :authority header using literal field line with name ref.
Previously, scheme was always set to https when transcoding an HTX
start-line into a HTTP/3 request. Change this so this conversion is now
fully compliant.
If no scheme is specified by the client, which is what happens most of
the time with HTTP/1, https is set for the HTTP/3 request. Else, reuse
the scheme requested by the client.
If either https or http is set, qpack_encode_scheme will encode it using
entry from QPACK static table. Else, a full literal field line with name
ref is used instead as the scheme value is specified as-is.
On the backend side, HTX start-line is converted into a HTTP/3 request
message. Previously, GET method was hardcoded. Implement proper method
conversion, by extracting it from the HTX start-line.
qpack_encode_method() has also been extended, so that it is able to
encode any method, either using a static table entry, or with a literal
field line with name ref representation.
This commit is the first one of a serie which aim is to implement
transcoding of a HTX request into HTTP/3, which is necessary for QUIC
backend support.
Transcoding is implementing via a new function h3_req_headers_send()
when a HTX start-line is parsed. For now, most of the request fields are
hardcoded, using a GET method. This will be adjusted in the next
following patches.
Mux connection is flagged with new QC_CF_IS_BACK if used on the backend
side. For now the only change is during traces, to be able to
differentiate frontend and backend usage.
Complete document for rcv_buf/snd_buf operations. In particular, return
value is now explicitely defined. For H3 layer, associated functions
documentation is also extended.
Replace ->li quic_conn pointer to struct listener member by ->target which is
an object type enum and adapt the code.
Use __objt_(listener|server)() where the object type is known. Typically
this is were the code which is specific to one connection type (frontend/backend).
Remove <server> parameter passed to qc_new_conn(). It is redundant with the
<target> parameter.
GSO is not supported at this time for QUIC backend. qc_prep_pkts() is modified
to prevent it from building more than an MTU. This has as consequence to prevent
qc_send_ppkts() to use GSO.
ssl_clienthello.c code is run only by listeners. This is why __objt_listener()
is used in place of ->li.
Store the peer connection ID (SCID) as the connection DCID as soon as an Initial
packet is received.
Stop comparing the packet to QUIC_PACKET_TYPE_0RTT is already match as
QUIC_PACKET_TYPE_INITIAL.
A QUIC server must not send too short datagram with ack-eliciting packets inside.
This cannot be done from quic_rx_pkt_parse() because one does not know if
there is ack-eliciting frame into the Initial packets. If the packet must be
dropped, this is after having parsed it!
Modify quic_dgram_parse() to stop passing it a listener as third parameter.
In place the object type address of the connection socket owner is passed
to support the haproxy servers with QUIC as transport protocol.
qc_owner_obj_type() is implemented to return this address.
qc_counters() is also implemented to return the QUIC specific counters of
the proxy of owner of the connection.
quic_rx_pkt_parse() called by quic_dgram_parse() is also modify to use
the object type address used by this latter as last parameter. It is
also modified to send Retry packet only from listeners. A QUIC client
(connection to haproxy QUIC servers) must drop the Initial packets with
non null token length. It is also not supposed to receive O-RTT packets
which are dropped.
This patch only adds <proto_type> new proto_type enum parameter and <sock_type>
socket type parameter to sock_create_server_socket() and adapts its callers.
This is to prepare the use of this function by QUIC servers/backends.
Modify qc_alloc_ssl_sock_ctx() to pass the connection object as parameter. It is
NULL for a QUIC listener, not NULL for a QUIC server. This connection object is
set as value for ->conn quic_conn struct member. Initialise the SSL session object from
this function for QUIC servers.
qc_ssl_set_quic_transport_params() is also modified to pass the SSL object as parameter.
This is the unique parameter this function needs. <qc> parameter is used only for
the trace.
SSL_do_handshake() must be calle as soon as the SSL object is initialized for
the QUIC backend connection. This triggers the TLS CRYPTO data delivery.
tasklet_wakeup() is also called to send asap these CRYPTO data.
Modify the QUIC_EV_CONN_NEW event trace to dump the potential errors returned by
SSL_do_handshake().
Implement ssl_sock_new_ssl_ctx() to allocate a SSL server context as this is currently
done for TCP servers and also for QUIC servers depending on the <is_quic> boolean value
passed as new parameter. For QUIC servers, this function calls ssl_quic_srv_new_ssl_ctx()
which is specific to QUIC.
Add ->quic_params new member to server struct.
Also set the ->xprt member of the server being initialized and initialize asap its
transport parameters from _srv_parse_init().
According to the RFC, a QUIC client must encode the QUIC version it supports
into the "Available Versions" of "Version Information" transport parameter
order by descending preference.
This is done defining <quic_version_2> and <quic_version_draft_29> new variables
pointers to the corresponding version of <quic_versions> array elements.
A client announces its available versions as follows: v1, v2, draft29.
In fd_insert(), use the provided tgid to ghet the thread group info,
instead of using the one of the current thread, as we may call
fd_insert() from a thread of another thread group, that will happen at
least when binding the listeners. Otherwise we'd end up accessing the
thread mask containing enabled thread of the wrong thread group, which
can lead to crashes if we're binding on threads not present in the
thread group.
This should fix Github issue #2991.
This should be backported up to 2.8.
There was already functions to pushed data from the applet to the stream by
inserting them in the right buffer, depending the applet was using or not
the legacy API. Here, functions to retreive data pushed to the applet by the
stream were added:
* applet_getchar : Gets one character
* applet_getblk : Copies a full block of data
* applet_getword : Copies one text block representing a word using a
custom separator as delimiter
* applet_getline : Copies one text line
* applet_getblk_nc : Get one or two blocks of data
* applet_getword_nc: Gets one or two blocks of text representing a word
using a custom separator as delimiter
* applet_getline_nc: Gets one or two blocks of text representing a line
In this patch, some functions were added to ease input and output buffers
manipulation, regardless the corresponding applet is using its own buffers
or it is relying on channels buffers. Following functions were added:
* applet_get_inbuf : Get the buffer containing data pushed to the applet
by the stream
* applet_get_outbuf : Get the buffer containing data pushed by the applet
to the stream
* applet_input_data : Return the amount of data in the input buffer
* applet_skip_input : Skips <len> bytes from the input buffer
* applet_reset_input: Skips all bytes from the input buffer
* applet_output_room: Returns the amout of space available at the output
buffer
* applet_need_room : Indicates that the applet have more data to deliver
and it needs more room in the output buffer to do
so
Most fe and be counters are good candidates for being shared between
processes. They are now grouped inside "shared" struct sub member under
be_counters and fe_counters.
Now they are properly identified, they would greatly benefit from being
shared over thread groups to reduce the cost of atomic operations when
updating them. For this, we take the current tgid into account so each
thread group only updates its own counters. For this to work, it is
mandatory that the "shared" member from {fe,be}_counters is initialized
AFTER global.nbtgroups is known, because each shared counter causes the stat
to be allocated lobal.nbtgroups times. When updating a counter without
concurrency, the first counter from the array may be updated.
To consult the shared counters (which requires aggregation of per-tgid
individual counters), some helper functions were added to counter.h to
ease code maintenance and avoid computing errors.
cps_max (max new connections received per second), sps_max (max new
sessions per second) and http.rps_max (maximum new http requests per
second) all rely on shared counters (namely conn_per_sec, sess_per_sec and
http.req_per_sec). The problem is that shared counters are about to be
distributed over thread groups, and we cannot afford to compute the
total (for all thread groups) each time we update the max counters.
Instead, since such max counters (relying on shared counters) are a very
few exceptions, let's add internal (sess,conn,req) per sec freq counters
that are dedicated to cps_max, sps_max and http.rps_max computing.
Thanks to that, related *_max counters shouldn't be negatively impacted
by the thread-group distribution, yet they will not benefit from it
either. Related internal freq counters are prefixed with "_" to emphasize
the fact that they should not be used for other purpose (the shared ones,
which are about to be distributed over thread groups in upcoming commits
are still available and must be used instead). The internal ones could
eventually be removed at any time if we find another way to compute the
{cps,sps,http.rps)_max counters.
Now that we have a common struct between fe and be shared counters struct
let's perform some cleanup to merge duplicate members into the common
struct part. This will ease code maintenance.
proxies, listeners and server shared counters are now managed via helpers
added in one of the previous commits.
When guid is not set (ie: when not yet assigned), shared counters pointer
is allocated using calloc() (local memory) and a flag is set on the shared
counters struct to know how to manipulate (and free it). Else if guid is
set, then it means that the counters may be shared so while for now we
don't actually use a shared memory location the API is ready for that.
The way it works, for proxies and servers (for which guid is not known
during creation), we first call counters_{fe,be}_shared_get with guid not
set, which results in local pointer being retrieved (as if we just
manually called calloc() to retrieve a pointer). Later (during postparsing)
if guid is set we try to upgrade the pointer from local to shared.
Lastly, since the memory location for some objects (proxies and servers
counters) may change from creation to postparsing, let's update
counters->last_change member directly under counters_{fe,be}_shared_get()
so we don't miss it.
No change of behavior is expected, this is only preparation work.
fe_counters_shared and be_counters_shared may share some common members
since they are quite similar, so we add a common struct part shared
between the two. struct counters_shared is added for convenience as
a generic pointer to manipulate common members from fe or be shared
counters pointer.
Also, the first common member is added: shared fe and be counters now
have a flags member.
create include/haproxy/counters.h and src/counters.c files to anticipate
for further helpers as some counters specific tasks needs to be carried
out and since counters are shared between multiple object types (ie:
listener, proxy, server..) we need generic helpers.
Add some shared counters helper which are not yet used but will be updated
in upcoming commits.
Shareable counters are not tagged as shared counters and are dynamically
allocated in separate memory area as a prerequisite for being stored
in shared memory area. For now, GUID and threads groups are not taken into
account, this is only a first step.
also we ensure all counters are now manipulated using atomic operations,
namely, "last_change" counter is now read from and written to using atomic
ops.
Despite the numerous changes caused by the counters being moved away from
counters struct, no change of behavior should be expected.
These functions were copied from the channel API and modified to work with
applets using the new API or the legacy one. However, the comments were
updated accordingly. It is the purpose of this patch.
rename _srv_postparse() internal function to srv_init() function and group
srv_init_per_thr() plus idle conns list init inside it. This way we can
perform some simplifications as srv_init() performs multiple server
init steps after parsing.
SRV_F_CHECKED flag was added, it is automatically set when srv_init()
runs successfully. If the flag is already set and srv_init() is called
again, nothing is done. This permis to manually call srv_init() earlier
than the default POST_CHECK hook when needed without risking to do things
twice.
while new_server() takes the parent proxy as argument and even assigns
srv->proxy to the parent proxy, it didn't actually inserted the server
to the parent proxy server list on success.
The result is that sometimes we add the server to the list after
new_server() is called, and sometimes we don't.
This is really error-prone and because of that hooks such as
REGISTER_POST_SERVER_CHECK() which as run for all servers listed in
all proxies may not be relied upon for servers which are not actually
inserted in their parent proxy server list. Plus it feels very strange
to have a server that points to a proxy, but then the proxy doesn't know
about it because it cannot find it in its server list.
To prevent errors and make proxy->srv list reliable, we move the insertion
logic directly under new_server(). This requires to know if we are called
during parsing or during runtime to either insert or append the server to
the parent proxy list. For that we use PR_FL_CHECKED flag from the parent
proxy (if the flag is set, then the proxy was checked so we are past the
init phase, thus we assume we are called during runtime)
This implies that during startup if new_server() has to be cancelled on
error paths we need to call srv_detach() (which is now exposed in server.h)
before srv_drop().
The consequence of this commit is that REGISTER_POST_SERVER_CHECK() should
not run reliably on all servers created using new_server() (without having
to manually loop on global servers_list)
We have global proxies_list pointer which is announced as the list of
"all existing proxies", but in fact it only represents regular proxies
declared on the config file through "listen, frontend or backend" keywords
It is ambiguous, and we currently don't have a straightforwrd method to
iterate over all proxies (either public or internal ones) within haproxy
Instead we still have to manually iterate over multiple lists (main
proxies, log-forward proxies, peer proxies..) which is error-prone.
In this patch we add a struct list member (8 bytes) inside struct proxy
in order to store every proxy (except default ones) within a global
"proxies" list which is actually representative for all proxies existing
under haproxy process, like we already have for servers.
It is now possile to set a label on a bind line. All sockets attached to
this bind line inherits from this label. The idea is to be able to groud of
sockets. For now, there is no mechanism to create these groups, this must be
done by hand.
Several users already reported that it would be nice to support
strict-sni in ssl-default-bind-options. However, in order to support
it, we also need an option to disable it.
This patch moves the setting of the option from the strict_sni field
to a flag in the ssl_options field so that it can be inherited from
the default bind options, and adds a new "no-strict-sni" directive to
allow to disable it on a specific "bind" line.
The test file "del_ssl_crt-list.vtc" which already tests both options
was updated to make use of the default option and the no- variant to
confirm everything continues to work.
It was mentioned during the development of glitches that it would be
nice to support not killing misbehaving connections below a certain
CPU usage so that poor implementations that routinely misbehave without
impact are not killed. This is now possible by setting a CPU usage
threshold under which we don't kill them via this parameter. It defaults
to zero so that we continue to kill them by default.
Adjust include list in quic_conn-t.h. This file is included in many QUIC
source, so it is useful to keep as lightweight as possible. Note that
connection/QUIC MUX are transformed into forward declaration for better
layer separation.
Insert some missing includes statement in QUIC source files. This was
detected after the next commit which adjust the include list used in
quic_conn-t.h file.
quic-conn layer has to handle itself STREAM frames after MUX release. If
the stream was already seen, it is probably only a retransmitted frame
which can be safely ignored. For other streams, an active closure may be
needed.
Thus it's necessary that quic-conn layer knows the highest stream ID
already handled by the MUX after its release. Previously, this was done
via <nb_streams> member array in quic-conn structure.
Refactor this by replacing <nb_streams> by two members called
<stream_max_uni>/<stream_max_bidi>. Indeed, it is unnecessary for
quic-conn layer to monitor locally opened uni streams, as the peer
cannot by definition emit a STREAM frame on it. Also, bidirectional
streams are always opened by the remote side.
Previously, <nb_streams> were set by quic-stream layer. Now,
<stream_max_uni>/<stream_max_bidi> members are only set one time, just
prior to QUIC MUX release. This is sufficient as quic-conn do not use
them if the MUX is available.
Note that previously, IDs were used relatively to their type, thus
incremented by 1, after shifting the original value. For simplification,
use the plain stream ID, which is incremented by 4.
Move general function to check if a stream is uni or bidirectional from
QUIC MUX to quic_utils module. This should prevent unnecessary include
of QUIC MUX header file in other sources.
The quic_conn struct is modified for two reasons. The first one is to store
the encoded version of the local tranport parameter as this is done for
USE_QUIC_OPENSSL_COMPAT. Indeed, the local transport parameter "should remain
valid until after the parameters have been sent" as mentionned by
SSL_set_quic_tls_cbs(3) manual. In our case, the buffer is a static buffer
attached to the quic_conn object. qc_ssl_set_quic_transport_params() function
whose role is to call SSL_set_tls_quic_transport_params() (aliased by
SSL_set_quic_transport_params() to set these local tranport parameter into
the TLS stack from the buffer attached to the quic_conn struct.
The second quic_conn struct modification is the addition of the new ->prot_level
(SSL protection level) member added to the quic_conn struct to store "the most
recent write encryption level set via the OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn
callback (if it has been called)" as mentionned by SSL_set_quic_tls_cbs(3) manual.
This patches finally implements the five remaining callacks to make the haproxy
QUIC implementation work.
OSSL_FUNC_SSL_QUIC_TLS_crypto_send_fn() (ha_quic_ossl_crypto_send) is easy to
implement. It calls ha_quic_add_handshake_data() after having converted
qc->prot_level TLS protection level value to the correct ssl_encryption_level_t
(boringSSL API/quictls) value.
OSSL_FUNC_SSL_QUIC_TLS_crypto_recv_rcd_fn() (ha_quic_ossl_crypto_recv_rcd())
provide the non-contiguous addresses to the TLS stack, without releasing
them.
OSSL_FUNC_SSL_QUIC_TLS_crypto_release_rcd_fn() (ha_quic_ossl_crypto_release_rcd())
release these non-contiguous buffer relying on the fact that the list of
encryption level (qc->qel_list) is correctly ordered by SSL protection level
secret establishements order (by the TLS stack).
OSSL_FUNC_SSL_QUIC_TLS_yield_secret_fn() (ha_quic_ossl_got_transport_params())
is a simple wrapping function over ha_quic_set_encryption_secrets() which is used
by boringSSL/quictls API.
OSSL_FUNC_SSL_QUIC_TLS_got_transport_params_fn() (ha_quic_ossl_got_transport_params())
role is to store the peer received transport parameters. It simply calls
quic_transport_params_store() and set them into the TLS stack calling
qc_ssl_set_quic_transport_params().
Also add some comments for all the OpenSSL 3.5 QUIC API callbacks.
This patch have no impact on the other use of QUIC API provided by the others TLS
stacks.
This patch allows the use of the new OpenSSL 3.5.0 QUIC TLS API when it is
available and detected at compilation time. The detection relies on the presence of the
OSSL_FUNC_SSL_QUIC_TLS_CRYPTO_SEND macro from openssl-compat.h. Indeed this
macro is defined by OpenSSL since 3.5.0 version. It is not defined by quictls.
This helps in distinguishing these two TLS stacks. When the detection succeeds,
HAVE_OPENSSL_QUIC is also defined by openssl-compat.h. Then, this is this new macro
which is used to detect the availability of the new OpenSSL 3.5.0 QUIC TLS API.
Note that this detection is done only if USE_QUIC_OPENSSL_COMPAT is not asked.
So, USE_QUIC_OPENSSL_COMPAT and HAVE_OPENSSL_QUIC are exclusive.
At the same location, from openssl-compat.h, ssl_encryption_level_t enum is
defined. This enum was defined by quictls and expansively used by the haproxy
QUIC implementation. SSL_set_quic_transport_params() is replaced by
SSL_set_quic_tls_transport_params. SSL_set_quic_early_data_enabled() (quictls) is also replaced
by SSL_set_quic_tls_early_data_enabled() (OpenSSL). SSL_quic_read_level() (quictls)
is not defined by OpenSSL. It is only used by the traces to log the current
TLS stack decryption level (read). A macro makes it return -1 which is an
usused values.
The most of the differences between quictls and OpenSSL QUI APIs are in quic_ssl.c
where some callbacks must be defined for these two APIs. This is why this
patch modifies quic_ssl.c to define an array of OSSL_DISPATCH structs: <ha_quic_dispatch>.
Each element of this arry defines a callback. So, this patch implements these
six callabcks:
- ha_quic_ossl_crypto_send()
- ha_quic_ossl_crypto_recv_rcd()
- ha_quic_ossl_crypto_release_rcd()
- ha_quic_ossl_yield_secret()
- ha_quic_ossl_got_transport_params() and
- ha_quic_ossl_alert().
But at this time, these implementations which must return an int return 0 interpreted
as a failure by the OpenSSL QUIC API, except for ha_quic_ossl_alert() which
is implemented the same was as for quictls. The five remaining functions above
will be implemented by the next patches to come.
ha_quic_set_encryption_secrets() and ha_quic_add_handshake_data() have been moved
to be defined for both quictls and OpenSSL QUIC API.
These callbacks are attached to the SSL objects (sessions) calling qc_ssl_set_cbs()
new function. This latter callback the correct function to attached the correct
callbacks to the SSL objects (defined by <ha_quic_method> for quictls, and
<ha_quic_dispatch> for OpenSSL).
The calls to SSL_provide_quic_data() and SSL_process_quic_post_handshake()
have been also disabled. These functions are not defined by OpenSSL QUIC API.
At this time, the functions which call them are still defined when HAVE_OPENSSL_QUIC
is defined.
The current hash involves 3 simple shifts and additions so that it can
be mapped to a multiply on architecures having a fast multiply. This is
indeed what the compiler does on x86_64. A large range of values was
scanned to try to find more optimal factors on machines supporting such
a fast multiply, and it turned out that new factor 0x1af42f resulted in
smoother hashes that provided on average 0.4% better compression on both
the Silesia corpus and an mbox file composed of very compressible emails
and uncompressible attachments. It's even slightly better than CRC32C
while being faster on Skylake. This patch enables this factor on archs
with a fast multiply.
This is slz upstream commit 82ad1e75c13245a835c1c09764c89f2f6e8e2a40.
Building on MIPS64 with clang16 incorrectly reports some uninitialized
value warnings in stats-proxy.c due to some calls to ABORT_NOW() where
the compiler didn't know the code wouldn't return. Let's properly mark
the function as noreturn, and take this opportunity for also marking it
unused to avoid possible warnings depending on the build options (if
ABORT_NOW is not used). No backport needed though it will not harm.
It is especially a problem with Lua filters, but it is important to disable
the 0-copy forwarding if a filter alters the payload, or at least to be able
to disable it. While the filter is registered on the data filtering, it is
not an issue (and it is the common case) because, there is now way to
fast-forward data at all. But it may be an issue if a filter decides to
alter the payload and to unregister from data filtering. In that case, the
0-copy forwarding can be re-enabled in a hardly precdictable state.
To fix the issue, a SC flags was added to do so. The HTTP compression filter
set it and lua filters too if the body length is changed (via
HTTPMessage.set_body_len()).
Note that it is an issue because of a bad design about the HTX. Many info
about the message are stored in the HTX structure itself. It must be
refactored to move several info to the stream-endpoint descriptor. This
should ease modifications at the stream level, from filter or a TCP/HTTP
rules.
This should be backported as far as 3.0. If necessary, it may be backported
on lower versions, as far as 2.6. In that case, it must be reviewed and
adapted.
SPOP_CS_FRAME_H and SPOP_CS_FRAME_P states, that were used to handle frame
parsing, were removed. The demux process now relies on the demux stream ID
to know if it is waiting for the frame header or the frame
payload. Concretly, when the demux stream ID is not set (dsi == -1), the
demuxer is waiting for the next frame header. Otherwise (dsi >= 0), it is
waiting for the frame payload. It is especially important to be able to
properly handle DISCONNECT frames sent by the agents.
SPOP_CS_RUNNING state is introduced to know the hello handshake was finished
and the SPOP connection is able to open SPOP streams and exchange NOTIFY/ACK
frames with the agents.
It depends on the following fixes:
* MINOR: mux-spop: Don't set SPOP connection state to FRAME_H after ACK parsing
* BUG/MINOR: mux-spop: Make the demux stream ID a signed integer
This change will be mandatory for the next fix. It must be backported to 3.1
with the commits above.
A test run on a dual-socket EPYC 9845 (2x160 cores) showed that we'll
be facing new limits during the lifetime of 3.2 with our current 16
groups and 256 threads max:
$ cat test.cfg
global
cpu-policy perforamnce
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
Raising the default limit to 1024 threads and 32 groups is sufficient
to buy us enough margin for a long time (hopefully, please don't laugh,
you, reader from the future):
$ ./haproxy -dc -c -f test.cfg
...
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,320-335
2/1-32 33-64 32: 16-31,336-351
3/1-32 65-96 32: 32-47,352-367
4/1-32 97-128 32: 48-63,368-383
5/1-32 129-160 32: 64-79,384-399
6/1-32 161-192 32: 80-95,400-415
7/1-32 193-224 32: 96-111,416-431
8/1-32 225-256 32: 112-127,432-447
9/1-32 257-288 32: 128-143,448-463
10/1-32 289-320 32: 144-159,464-479
11/1-32 321-352 32: 160-175,480-495
12/1-32 353-384 32: 176-191,496-511
13/1-32 385-416 32: 192-207,512-527
14/1-32 417-448 32: 208-223,528-543
15/1-32 449-480 32: 224-239,544-559
16/1-32 481-512 32: 240-255,560-575
17/1-32 513-544 32: 256-271,576-591
18/1-32 545-576 32: 272-287,592-607
19/1-32 577-608 32: 288-303,608-623
20/1-32 609-640 32: 304-319,624-639
We can change this default now because it has no functional effect
without any configured cpu-policy, so this will only be an opt-in
and it's better to do it now than to have an effect during the
maintenance phase. A tiny effect is a doubling of the number of
pool buckets and stick-table shards internally, which means that
aside slightly reducing contention in these areas, a dump of tables
can enumerate keys in a different order (hence the adjustment in the
vtc).
The only really visible effect is a slightly higher static memory
consumption (29->35 MB on a small config), but that difference
remains even with 50k servers so that's pretty much acceptable.
Thanks to Erwan Velu for the quick tests and the insights!
Add counters to measure Rx buffers usage per QCS. This reused the newly
defined bdata_ctr type already used for Tx accounting.
Note that for now, <tot> value of bdata_ctr is not used. This is because
it is not easy to account for data accross contiguous buffers.
These values are displayed both on log/traces and "show quic" output.
Add accounting at qc_stream_desc level to be able to report the number
of allocated Tx buffers and the sum of their data. This represents data
ready for emission or already emitted and waiting on ACK.
To simplify this accounting, a new counter type bdata_ctr is defined in
quic_utils.h. This regroups both buffers and data counter, plus a
maximum on the buffer value.
These values are now displayed on QCS info used both on logline and
traces, and also on "show quic" output.
As discussed here:
https://github.com/httpwg/http2-spec/pull/936https://github.com/haproxy/haproxy/issues/2941
It's important to take care of some special characters in the :authority
pseudo header before reassembling a complete URI, because after assembly
it's too late (e.g. the '/').
This patch adds a specific function which was checks all such characters
and their ranges on an ist, and benefits from modern compilers
optimizations that arrange the comparisons into an evaluation tree for
faster match. That's the version that gave the most consistent performance
across various compilers, though some hand-crafted versions using bitmaps
stored in register could be slightly faster but super sensitive to code
ordering, suggesting that the results might vary with future compilers.
This one takes on average 1.2ns per character at 3 GHz (3.6 cycles per
char on avg). The resulting impact on H2 request processing time (small
requests) was measured around 0.3%, from 6.60 to 6.618us per request,
which is a bit high but remains acceptable given that the test only
focused on req rate.
The code was made usable both for H2 and H3.
IPv6 connectivity might start off (e.g. network not fully up when
haproxy starts), so for features like resolvers, it would be nice to
periodically recheck.
With this change, instead of having the resolvers code rely on a variable
indicating connectivity, it will now call a function that will check for
how long a connectivity check hasn't been run, and will perform a new one
if needed. The age was set to 30s which seems reasonable considering that
the DNS will cache results anyway. There's no saving in spacing it more
since the syscall is very check (just a connect() without any packet being
emitted).
The variables remain exported so that we could present them in show info
or anywhere else.
This way, "dns-accept-family auto" will now stay up to date. Warning
though, it does perform some caching so even with a refreshed IPv6
connectivity, an older record may be returned anyway.
This way we can preserve the entire contents of the released area for
later inspection. This automatically enables comparison at reallocation
time as well (like "integrity" does). If used in combination with
integrity, the comparison is disabled but the check of non-corruption
of the area mangled by integrity is still operated.
The automatic scheduler is useful but sometimes you don't want to use,
or schedule manually.
This patch adds an 'acme.scheduler' option in the global section, which
can be set to either 'auto' or 'off'. (auto is the default value)
This also change the ouput of the 'acme status' command so it does not
shows scheduled values. The state will be 'Stopped' instead of
'Scheduled'.
The new MEM_F_UAF flag can be set just after a pool's creation to make
this pool UAF for debugging purposes. This allows to maintain a better
overall performance required to reproduce issues while still having a
chance to catch UAF. It will only be used by developers who will manually
add it to areas worth being inspected, though.
Extend API used for QUIC transport parameter decoding. This is done via
the introduction of a dedicated enum to report the various error
condition detected. No functional change should occur with this patch,
as the only returned code is QUIC_TP_DEC_ERR_TRUNC, which results in the
connection closure via a TLS alert.
This patch will be necessary to properly reject transport parameters
with the proper CONNECTION_CLOSE error code. As such, it should be
backported up to 2.6 with the following series.
In order to make the lock history a bit more useful, let's try to merge
adjacent lock/unlock sequences that don't change anything for other
threads. For this we can replace the last unlock with the new operation
on the same label, and even just not store it if it was the same as the
one before the unlock, since in the end it's the same as if the unlock
had not been done.
Now loops that used to be filled with "R:LISTENER U:LISTENER" show more
useful info such as:
S:IDLE_CONNS U:IDLE_CONNS S:PEER U:PEER S:IDLE_CONNS U:IDLE_CONNS R:LISTENER U:LISTENER
U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS R:STK_TABLE
R:STK_TABLE U:STK_TABLE W:STK_SESS U:STK_SESS W:STK_TABLE_UPDT U:STK_TABLE_UPDT S:PEER
It's worth noting that it can sometimes induce confusion when recursive
locks of the same label are used (a few exist on peers or stick-tables),
as in such a case the two operations would be needed. However these ones
are already undebuggable, so instead they will just have to be renamed
to make sure they use a distinct label.
Most threads are filled with "R:OTHER U:OTHER" in their history. Since
anything non-important can use other it's not observable but it pollutes
the history. Let's just drop OTHER entirely during the recording.
There is a lot of contention trying to add updates to the tree. So
instead of trying to add the updates to the tree right away, just add
them to a mt-list (with one mt-list per thread group, so that the
mt-list does not become the new point of contention that much), and
create a tasklet dedicated to adding updates to the tree, in batchs, to
avoid keeping the update lock for too long.
This helps getting stick tables perform better under heavy load.
quic_conn_release() may, or may not, free the tasklet associated with
the connection. So make it return 1 if it was, and 0 otherwise, so that
if it was called from the tasklet handler itself, the said handler can
act accordingly and return NULL if the tasklet was destroyed.
This should be backported if 9240cd4a27
is backported.
Add an extra parametre to conn_create_mux(), "closed_connection".
If a pointer is provided, then let it know if the connection was closed.
Callers have no way to determine that otherwise, and we need to know
that, at least in ssl_sock_io_cb(), as if the connection was closed we
need to return NULL, as the tasklet was free'd, otherwise that can lead
to memory corruption and crashes.
This should be backported if 9240cd4a27
is backported too.
TASK_QUEUED was used to mean "the task has been scheduled to run",
TASK_IN_LIST was used to mean "the tasklet has been scheduled to run",
remove TASK_IN_LIST and just use TASK_QUEUED for tasklets instead.
This commit is just cosmetic, and should not have any impact.
Implement a way to display the running acme tasks over the CLI.
It currently only displays a "Running" status with the certificate name
and the acme section from the configuration.
The displayed running tasks are limited to the size of a buffer for now,
it will require a backref list later to be called multiple times to
resume the list.
When called, this function will try to enforce a yield (if available) as
soon as possible. Indeed, automatic yield is already enforced every X
Lua instructions. However, there may be some cases where we know after
running heavy operation that we should yield already to avoid taking too
much CPU at once.
This is what this function offers, instead of asking the user to manually
yield using "core.yield()" from Lua itself after using an expensive
Lua method offered by haproxy, we can directly enforce the yield without
the need to do it in the Lua script.
The previous commit has implemented a new calcul method for
MAX_STREAM_DATA frame emission. Now, a frame may be emitted as soon as a
buffer was consumed by a QCS instance.
This will probably increase the number of MAX_STREAM_DATA frame
emission. It may even cause a series of frame emitted for the same
stream with increasing values under high load, which is completely
unnecessary.
To improve this, limit the number of MAX_STREAM_DATA frames built to one
per QCS instance. This is implemented by storing a reference to this
frame in QCS structure via a new member <tx.msd_frm>.
Note that to properly reset QCS msd_frm member, emission of flow-control
frames have been changed. Now, each frame is emitted individually. On
one side, it is better as it prevent to emit frames related to different
streams in a single datagram, which is not desirable in case of packet
loss. However, this can also increase sendto() syscall invocation.
Recently, QCS Rx allocation buffer method has been improved. It is now
possible to allocate multiple buffers per QCS instances, which was
necessary to improve HTTP/3 POST throughput.
However, a limitation remained related to the emission of
MAX_STREAM_DATA. These frames are only emitted once at least half of the
receive capacity has been consumed by its QCS instance. This may be too
restrictive when a client need to upload a large payload.
Improve this by adjusting MAX_STREAM_DATA allocation. If QCS capacity is
still limited to 1 or 2 buffers max, the old calcul is still used. This
is necessary when user has limited upload throughput via their
configuration. If QCS capacity is more than 2 buffers, a new frame is
emitted if at least a buffer was consumed.
This patch has reduced number of STREAM_DATA_BLOCKED frames received in
POST tests with some specific clients.
We had to parse the sigAlg extension by hand in order to properly select
the certificate used by the SSL frontends. These traces allow to dump
the allowed sigAlg list sent by the client in its clientHello.
This callback allows to pick the used certificate on an SSL frontend.
The certificate selection is made according to the information sent by
the client in the clientHello. The traces that were added will allow to
better understand what certificate was chosen and why. It will also warn
us if the chosen certificate was the default one.
The actual certificate parsing happens in ssl_sock_chose_sni_ctx. It's
in this function that we actually get the filename of the certificate
used.
If OCSP stapling fails because of a missing or invalid OCSP response we
used to silently disable stapling for the given session. We can now know
a bit more what happened regarding OCSP stapling.
Those traces dump information about the multiple SSL_do_handshake calls
(renegotiation and regular call). Some errors coud also be dumped in
case of rejected early data.
Depending on the chosen verbosity, some information about the current
handshake can be dumped as well (servername, tls version, chosen cipher
for instance).
In case of failed handshake, the error codes and messages will also be
dumped in the log to ease debugging.
Add a dedicated trace for some unlikely allocation failures and async
errors. Those traces will ostly be used to identify the start and end of
a given SSL connection.
This function can be used to convert a TLSv1.3 sigAlg entry (2bytes)
from the signature_agorithms client hello extension into a string.
In order to ease debugging, some TLSv1.2 combinations can also be
dumped. In TLSv1.2 those signature algorithms pairs were built out of a
one byte signature identifier combined to a one byte hash identifier.
In TLSv1.3 those identifiers are two bytes blocs that must be treated as
such.
In relation to issue #2954, it appears that turning some size_t length
calculations to the int that uses my_strndup() upsets coverity a bit.
Instead of dealing with such warnings each time, better address it at
the root. An inspection of all call places show that the size passed
there is always positive so we can safely use an unsigned type, and
size_t will always suit it like for strndup() where it's available.
when dns session callback (dns_session_release()) is called upon error
(ie: when some pending queries were not sent), we try our best to
re-create the applet in order to preserve the pending queries and give
them a chance to be retried. This is done at the end of
dns_session_release().
However, doing so exposes to an issue: if the error preventing queries
from being sent is still encountered over and over the dns session could
stay there indefinitely. Meanwhile, other dns sessions may be created on
the same dns_stream_server periodically. If previous failing dns sessions
don't terminate but we also keep creating new ones, we end up accumulating
failing sessions on a given dns_stream_server, which can eventually cause
ressource shortage.
This issue was found when trying to address ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers")
To fix it, we track the number of failed consecutive sessions for a given
dns server. When we reach the threshold (set to 100), we consider that the
link to the dns server is broken (at least temporarily) and we force
dns_session_new() to fail, so that we stop creating new sessions until one
of the existing one eventually succeeds.
A workaround for this fix consists in setting the "maxconn" parameter on
nameserver directive (under resolvers section) to a reasonnable value so
that no more than "maxconn" sessions may co-exist on the same server at
a given time.
This may be backported to all stable versions.
("CLEANUP: dns: remove unused dns_stream_server struct member") may be
backported to ease the backport.
The stateless mode which was documented previously in the ACME example
is not convenient for all use cases.
First, when HAProxy generates the account key itself, you wouldn't be
able to put the thumbprint in the configuration, so you will have to get
the thumbprint and then reload.
Second, in the case you are using multiple account key, there are
multiple thumbprint, and it's not easy to know which one you want to use
when responding to the challenger.
This patch allows to configure a map in the acme section, which will be
filled by the acme task with the token corresponding to the challenge,
as the key, and the thumbprint as the value. This way it's easy to reply
the right thumbprint.
Example:
http-request return status 200 content-type text/plain lf-string "%[path,field(-1,/)].%[path,field(-1,/),map(virt@acme)]\n" if { path_beg '/.well-known/acme-challenge/' }
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().
Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
Use the newly defined cshared type to account for the sum of congestion
window of every QUIC connection. This value is stored in global counter
quic_mem_global defined in proto_quic module.
Define a new type "struct cshared". This can be used as a tool to
manipulate a global counter with thread-safety ensured. Each thread
would declare its thread-local cshared type, which would point to a
global counter.
Each thread can then add/substract value to their owned thread-local
cshared instance via cshared_add(). If the difference exceed a
configured limit, either positively or negatively, the global counter is
updated and thread-local instance is reset to 0. Each thread can safely
read the global counter value using cshared_read().
Congestion window is limit by a minimal and maximum values which can
never be exceeded. Min value is hardcoded to 2 datagrams as recommended
by the specification. Max value is specified via haproxy configuration.
These values must be respected each time the congestion window size is
adjusted. However, in some rare occasions, limit were not always
enforced. Fix this by implementing wrappers to set or increment the
congestion window. These functions ensure limits are always applied
after the operation.
Additionnally, wrappers also ensure that if window reached a new maximum
value, it is saved in <cwnd_last_max> field.
This should be backported up to 2.6, after a brief period of
observation.
There was some possible confusion between fields related to congestion
window size min and max limit which cannot be exceeded, and the maximum
value previously reached by the window.
Fix this by adopting a new naming scheme. Enforced limit are now renamed
<limit_max>/<limit_min>, while the previously reached max value is
renamed <cwnd_last_max>.
This should be backported up to 3.1.
TCP_NOTSENT_LOWAT is very convenient as it indicates when to report
EAGAIN on the sending side. It takes a margin on top of the estimated
window, meaning that it's no longer needed to store too many data in
socket buffers. Instead there's just enough to fill the send window
and a little bit of margin to cover the scheduling time to restart
sending. Experiments on a 100ms network have shown a 10-fold reduction
in the memory used by socket buffers by just setting this value to
tune.bufsize, without noticing any performance degradation. Theoretically
the responsiveness on multiplexed protocols such as H2 should also be
improved.
The CLI's "prompt" command toggles two distinct things:
- displaying or hiding the prompt at the beginning of the line
- single-command vs interactive mode
These are two independent concepts and the prompt mode doesn't
always cope well with tools that would like to upload data without
having to read the prompt on return. Also, the master command line
works in interactive mode by default with no prompt, which is not
consistent (and not convenient for tools). So let's start by splitting
the bit in two, and have a new APPCTX_CLI_ST1_INTER flag dedicated
to the interactive mode. For now the "prompt" command alone continues
to toggle the two at once.
We add __equals_0(NAME) which is only true if NAME is defined as zero,
and __def_as_empty(NAME) which is only true if NAME is defined as an
empty string.
Setting DEBUG_THREAD to 1 allows recording the lock history for each
thread. Tests have shown that (as predicted) the cost of updating a
single thread-local variable is not perceptible in the noise, especially
when compared to the cost of obtaining a lock. Since this can provide
useful value when debugging deadlocks, let's enable it by default when
threads are enabled.
This will display the lock labels and modes for each non-empty step
at the end of "show threads" when these are defined. This allows to
emit up to the last 8 locking operation for each thread on 64 bit
machines.
by only storing a word in each thread context, we can keep the history
of all taken/dropped locks by label. This is expected to be very cheap
and to permit to store up to 8 consecutive lock operations in 64 bits.
That should significantly help detect recursive locks as well as figure
what thread was likely to hinder another one waiting for a lock.
For now we only store the final state of the lock, we don't store the
attempt to get it. It's just a matter of space since we already need
4 ops (rd,sk,wr,un) which take 2 bits, leaving max 64 labels. We're
already around 45. We could also multiply by 5 and still keep 8 bits
total per lock, that would limit us to 51 locks max. It seems that
most of the time if we get a watchdog panic, anyway the victim thread
will be perfectly located so that we don't need a specific value for
this. Another benefit is that we perform a single memory write per
lock.
We now default the value to zero and make sure all tests properly take
care of values above zero. This is in preparation for supporting several
degrees of debugging.
Parse the Retry-After header in response and store it in order to use
the value as the next delay for the next retry, fallback to 3s if the
value couldn't be parse or does not exist.
Instead of always having to force IPv4 or IPv6, let's now also offer
"auto" which will only enable IPv6 if the system has a default gateway
for it. This means that properly configured dual-stack systems will
default to "ipv4,ipv6" while those lacking a gateway will only use
"ipv4". Note that no real connectivity test is performed, so firewalled
systems may still get it wrong and might prefer to rely on a manual
"ipv4" assignment.
In order to ease dual-stack deployments, we could at least try to
check if ipv6 seems to be reachable. For this we're adding a test
based on a UDP connect (no traffic) on port 53 to the base of
public addresses (2001::) and see if the connect() is permitted,
indicating that the routing table knows how to reach it, or fails.
Based on this result we're setting a global variable that other
subsystems might use to preset their defaults.
In order to ease troubleshooting and testing, the new "-4" command line
argument enforces queries and processing of "A" DNS records only, i.e.
those representing IPv4 addresses. This can be useful when a host lack
end-to-end dual-stack connectivity. This overrides the global
"dns-accept-family" directive and is equivalent to value "ipv4".
By default, DNS resolvers accept both IPv4 and IPv6 addresses. This can be
influenced by the "resolve-prefer" keywords on server lines as well as the
family argument to the "do-resolve" action, but that is only a preference,
which does not block the other family from being used when it's alone. In
some environments where dual-stack is not usable, stumbling on an unreachable
IPv6-only DNS record can cause significant trouble as it will replace a
previous IPv4 one which would possibly have continued to work till next
request. The "dns-accept-family" global option permits to enforce usage of
only one (or both) address families. The argument is a comma-delimited list
of the following words:
- "ipv4": query and accept IPv4 addresses ("A" records)
- "ipv6": query and accept IPv6 addresses ("AAAA" records)
When a single family is used, no request will be sent to resolvers for the
other family, and any response for the othe family will be ignored. The
default value is "ipv4,ipv6", which effectively enables both families.
There are several fields in the appctx structure only used by the CLI. To
make things cleaner, all these fields are now placed in a dedicated context
inside the appctx structure. The final goal is to move it in the service
context and add an API for cli commands to get a command coontext inside the
cli context.
CLI_ST_GETREQ state was renamed into CLI_ST_PARSE_CMDLINE and CLI_ST_PARSEREQ
into CLI_ST_PROCESS_CMDLINE to reflect the real action performed in these
states.
Before this patch, when pipelined commands were received, each command was
parsed and then excuted before moving to the next command. Pending commands
were not copied in the input buffer of the applet. The major issue with this
way to handle commands is the impossibility to consume inputs from commands
with an I/O handler, like "show events" for instance. It was working thanks
to a "bug" if such commands were the last one on the command line. But it
was impossible to use them followed by another command. And this prevents us
to implement any streaming support for CLI commands.
So we decided to refactor the command line parsing to have something similar
to a basic shell. Now an entire line is parsed, including the payload,
before starting commands execution. The command line is copied in a
dedicated buffer. "appctx->chunk" buffer is used for this purpose. It was an
unsed field, so it is safe to use it here. Once the command line copied, the
commands found on this line are executed. Because the applet input buffer
was flushed, any input can be safely consumed by the CLI applet and is
available for the command I/O handler. Thanks to this change, "show event
-w" command can be followed by a command. And in theory, it should be
possible to implement commands supporting input data streaming. For
instance, the Tetris like lua applet can be used on the CLI now.
Note that the payload, if any, is part of the command line and must be fully
received before starting the commands processing. It means there is still
the limitation to a buffer, but not only for the payload but for the whole
command line. The payload is still necessarily at the end of the command
line and is passed as argument to the last command. Internally, the
"appctx->cli_payload" field was introduced to point on the payload in the
command line buffer.
This patch is quite huge but it cannot easily be splitted. It should not
introduced significant changes.
Add an experimental "https" log-format for the httpclient, it is not
used by the httpclient by default, but could be define in a customized
proxy.
The string is basically a httpslog, with some of the fields replaced by
their backend equivalent or - when not available:
"%ci:%cp [%tr] %ft -/- %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[bc_err]/%[ssl_bc_err,hex]/-/-/%[ssl_bc_is_resumed] -/-/-"
Since the commit "MINOR: hlua/h1: Use http_parse_cont_len_header() to parse
content-length value", this function is no longer used. So it can be safely
removed.
In RFC9110, it is stated that trailers could be merged with the
headers. While it should be performed with a speicial care, it may be a
problem for some applications. To avoid any trouble with such applications,
two new options were added to drop trailers during the message forwarding.
On the backend, "http-drop-request-trailers" option can be enabled to drop
trailers from the requests before sending them to the server. And on the
frontend, "http-drop-response-trailers" option can be enabled to drop
trailers from the responses before sending them to the client. The options
can be defined in defaults sections and disabled with "no" keyword.
This patch should fix the issue #2930.
To handle out-of-order received CRYPTO frames, a ncbuf instance is
allocated. This is done via the helper quic_get_ncbuf().
Buffer allocation was improperly checked. In case b_alloc() fails, it
crashes due to a BUG_ON(). Fix this by removing it. The function now
returns NULL on allocation failure, which is already properly handled in
its caller qc_handle_crypto_frm().
This should fix the last reported crash from github issue #2935.
This must be backported up to 2.6.
When using the round-robin load balancer, the major source of contention
is the lbprm lock, that has to be held every time we pick a server.
To mitigate that, make it so there are one tree per thread-group, and
one lock per thread-group. That means we now have a lb_fwrr_per_tgrp
structure that will contain the two lb_fwrr_groups (active and backup) as well
as the lock to protect them in the per-thread lbprm struct, and all
fields in the struct server are now moved to the per-thread structure
too.
Those changes are mostly mechanical, and brings good performances
improvment, on a 64-cores AMD CPU, with 64 servers configured, we could
process about 620000 requests par second, and we now can process around
1400000 requests per second.
Add a new structure in the per-thread groups proxy structure, that will
contain whatever is per-thread group in lbprm.
It will be accessed as p->per_tgrp[tgid].lbprm.
Move the "next_weight" outside of fwrr_group, and inside struct lb_fwrr
directly, one for the active servers, one for the backup servers.
We will soon have one fwrr_group per thread group, but next_weight will
be global to all of them.
Add a pointer to the server into the struct srv_per_tgroup, so that if
we only have access to that srv_per_tgroup, we can come back to the
corresponding server.
This verifies that the scheduler is still ticking without having to
access the activity[] array nor keeping local copies of the ctxsw
counter. It just tests and sets a flag that is reset after each
return from a ->process() function.
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
Since we no longer call it with a foreign thread, let's simplify its code
and get rid of the special cases that were relying on ha_thread_dump_fill()
and synchronization with a remote thread. We're not only dumping the
current thread so ha_thread_dump_one() is sufficient.
The goal is to let the caller deal with the pointer so that the function
only has to fill that buffer without worrying about locking. This way,
synchronous dumps from "show threads" are produced and emitted directly
without causing undesired locking of the buffer nor risking causing
confusion about thread_dump_buffer containing bits from an interrupted
dump in progress.
It's only the caller that's responsible for notifying the requester of
the end of the dump by setting bit 0 of the pointer if needed (i.e. it's
only done in the debug handler).
This function was initially designed to dump any threadd into the presented
buffer, but the way it currently works is that it's always called for the
current thread, and uses the distinction between coming from a sighandler
or being called directly to detect which thread is the caller.
Let's simplify all this by replacing thr with tid everywhere, and using
the thread-local pointers where it makes sense (e.g. th_ctx, th_ctx etc).
The confusing "from_signal" argument is now replaced with "is_caller"
which clearly states whether or not the caller declares being the one
asking for the dump (the logic is inverted, but there are only two call
places with a constant).
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
Some signal handlers rely on these to decide about the level of detail to
provide in dumps, so let's properly fill the info about entering/leaving
idle. Note that for consistency with other tests we're using bitops with
t->ltid_bit, while we could simply assign 0/1 to the fields. But it makes
the code more readable and the whole difference is only 88 bytes on a 3MB
executable.
This bug is not important, and while older versions are likely affected
as well, it's not worth taking the risk to backport this in case it would
wake up an obscure bug.
This commit is the counterpart of the previous one, adapted on the
frontend side. "idle-ping" is added as keyword to bind lines, to be able
to refresh client timeout of idle frontend connections.
H2 MUX behavior remains similar as the previous patch. The only
significant change is in h2c_update_timeout(), as idle-ping is now taken
into account also for frontend connection. The calculated value is
compared with http-request/http-keep-alive timeout value. The shorter
delay is then used as expired date. As hr/ka timeout are based on
idle_start, this allows to run them in parallel with an idle-ping timer.
This commit implements support for idle-ping on the backend side. First,
a new server keyword "idle-ping" is defined in configuration parsing. It
is used to set the corresponding new server member.
The second part of this commit implements idle-ping support on H2 MUX. A
new inlined function conn_idle_ping() is defined to access connection
idle-ping value. Two new connection flags are defined H2_CF_IDL_PING and
H2_CF_IDL_PING_SENT. The first one is set for idle connections via
h2c_update_timeout().
On h2_timeout_task() handler, if first flag is set, instead of releasing
the connection as before, the second flag is set and tasklet is
scheduled. As both flags are now set, h2_process_mux() will proceed to
PING emission. The timer has also been rearmed to the idle-ping value.
If a PING ACK is received before next timeout, connection timer is
refreshed. Else, the connection is released, as with timer expiration.
Also of importance, special care is needed when a backend connection is
going to idle. In this case, idle-ping timer must be rearmed. Thus a new
invokation of h2c_update_timeout() is performed on h2_detach().
This patch registers the task in the ckch_store so we don't run 2 tasks
at the same time for a given certificate.
Move the task creation under the lock and check if there was already a
task under the lock.
src/jws.c: In function '__jws_init':
src/jws.c:594:38: error: passing argument 2 of 'hap_register_unittest' from incompatible pointer type [-Wincompatible-pointer-types]
594 | hap_register_unittest("jwk", jwk_debug);
| ^~~~~~~~~
| |
| int (*)(int, char **)
In file included from include/haproxy/api.h:36,
from include/import/ebtree.h:251,
from include/import/ebmbtree.h:25,
from include/haproxy/jwt-t.h:25,
from src/jws.c:5:
include/haproxy/init.h:37:52: note: expected 'int (*)(void)' but argument is of type 'int (*)(int, char **)'
37 | void hap_register_unittest(const char *name, int (*fct)());
| ~~~~~~^~~~~~
GCC 15 is warning because the function pointer does have its
arguments in the register function.
Should fix issue #2929.
These counters can have a noticeable cost on large machines, though not
dramatic. There's no single good choice to keep them enabled or disabled.
This commit adds multiple choices:
- DEBUG_COUNTERS set to 2 will automatically enable them by default, while
1 will disable them by default
- the global "debug.counters on/off" will allow to change the setting at
boot, regardless of DEBUG_COUNTERS as long as it was at least 1.
- the CLI "debug counters on/off" will also allow to change the value at
run time, allowing to observe a phenomenon while it's happening, or to
disable counters if it's suspected that their cost is too high
Finally, the "debug counters" command will append "(stopped)" at the end
of the CNT lines when these counters are stopped.
Not that the whole mechanism would easily support being extended to all
counter types by specifying the types to apply to, but it doesn't seem
useful at all and would require the user to also type "cnt" on debug
lines. This may easily be changed in the future if it's found relevant.
COUNT_IF() is convenient but can be heavy since some of them were found
to trigger often (roughly 1 counter per request on avg). This might even
have an impact on large setups due to the cost of a shared cache line
bouncing between multiple cores. For now there's no way to disable it,
so let's only enable it when DEBUG_COUNTERS is 1 or above. A future
change will make it configurable.
Till now the per-line glitches counters were only enabled with the
confusingly named DEBUG_GLITCHES (which would not turn glitches off
when disabled). Let's instead change it to DEBUG_COUNTERS and make sure
it's enabled by default (though it can still be disabled with
-DDEBUG_GLITCHES=0 just like for DEBUG_STRICT). It will later be
expanded to cover more counters.
Once the Order status is "valid", the certificate URL is accessible,
this patch implements the retrieval of the certificate which is stocked
in ctx->store.
Generate the X509_REQ using the generated private key and the SAN from
the configuration. This is only done once before the task is started.
It could probably be done at the beginning of the task with the private
key generation once we have a scheduler instead of a CLI command.
This patch implements a check on the challenge URL, once haproxy asked
for the challenge to be verified, it must verify the status of the
challenge resolution and if there weren't any error.
This patch sends the "{}" message to specify that a challenge is ready.
It iterates on every challenge URL in the authorization list from the
acme_ctx.
This allows the ACME server to procede to the challenge validation.
https://www.rfc-editor.org/rfc/rfc8555#section-7.5.1
This patch implements the retrieval of the challenges objects on the
authorizations URLs. The challenges object contains a token and a
challenge url that need to be called once the challenge is setup.
Each authorization URLs contain multiple challenge objects, usually one
per challenge type (HTTP-01, DNS-01, ALPN-01... We only need to keep the
one that is relevent to our configuration.
This patch implements the newOrder action in the ACME task, in order to
ask for a new certificate, a list of SAN is sent as a JWS payload.
the ACME server replies a list of Authorization URLs. One Authorization
is created per SAN on a Order.
The authorization URLs are stored in a linked list of 'struct acme_auth'
in acme_ctx, so we can get the challenge URLs from them later.
The location header is also store as it is the URL of the order object.
https://datatracker.ietf.org/doc/html/rfc8555#section-7.4
This patch implements the retrival of the KID (account identifier) using
the pkey.
A request is sent to the newAccount URL using the onlyReturnExisting
option, which allow to get the kid of an existing account.
acme_jws_payload() implement a way to generate a JWS payload using the
nonce, pkey and provided URI.
ACME requests are supposed to be sent with a Nonce, the first Nonce
should be retrieved using the newNonce URI provided by the directory.
This nonce is stored and must be replaced by the new one received in the
each response.
The first request of the ACME protocol is getting the list of URLs for
the next steps.
This patch implements the first request and the parsing of the response.
The response is a JSON object so mjson is used to parse it.
The "acme renew" command launch the ACME task for a given certificate.
The CLI parser generates a new private key using the parameters from the
acme section..
This commit allows to configure the generated private keys, you can
configure the keytype (RSA/ECDSA), the number of bits or the curves.
Example:
acme LE
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
keytype ECDSA
curves P-384
Add new acme keywords for the ckch_conf parsing, which will be used on a
crt-store, a crt line in a frontend, or even a crt-list.
The cfg_postparser_acme() is called in order to check if a section referenced
elsewhere really exists in the config file.
Add a configuration parser for the new acme section, the section is
configured this way:
acme letsencrypt
uri https://acme-staging-v02.api.letsencrypt.org/directory
account account.key
contact foobar@example.com
challenge HTTP-01
When unspecified, the challenge defaults to HTTP-01, and the account key
to "<section_name>.account.key".
Section are stored in a linked list containing acme_cfg structures, the
configuration parsing is mostly resolved in the postsection parser
cfg_postsection_acme() which is called after the parsing of an acme section.
Some rare commands in the worker require to keep their input open and
terminate when it's closed ("show events -w", "wait"). Others maintain
a per-session context ("set anon on"). But in its default operation
mode, the master CLI passes commands one at a time to the worker, and
closes the CLI's input channel so that the command can immediately
close upon response. This effectively prevents these two specific cases
from being used.
Here the approach that we take is to introduce a bidirectional mode to
connect to the worker, where everything sent to the master is immediately
forwarded to the worker (including the raw command), allowing to queue
multiple commands at once in the same session, and to continue to watch
the input to detect when the client closes. It must be a client's choice
however, since doing so means that the client cannot batch many commands
at once to the master process, but must wait for these commands to complete
before sending new ones. For this reason we use the prefix "@@<pid>" for
this. It works exactly like "@" except that it maintains the channel
open during the whole execution. Similarly to "@<pid>" with no command,
"@@<pid>" will simply open an interactive CLI session to the worker, that
will be ended by "quit" or by closing the connection. This can be convenient
for the user, and possibly for clients willing to dedicate a connection to
the worker.
Same as free_proxy(), but does not free the base proxy pointer (ie: the
proxy itself may not be allocated)
Goal is to be able to cleanup statically allocated dummy proxies.
Split alloc_new_proxy() in two functions: the preparing part is now
handled by setup_new_proxy() which can be called individually, while
alloc_new_proxy() takes care of allocating a new proxy struct and then
calling setup_new_proxy() with the freshly allocated proxy.
At the moment it is not supported to produce multi-line events on the
"show events" output, simply because the LF character is used as the
default end-of-event mark. However it could be convenient to produce
well-formatted multi-line events, e.g. in JSON or other formats. UNIX
utilities have already faced similar needs in the past and added
"-print0" to "find" and "-0" to "xargs" to mention that the delimiter
is the NUL character. This makes perfect sense since it's never present
in contents, so let's do exactly the same here.
Thus from now on, "show events <ring> -0" will delimit messages using
a \0 instead of a \n, permitting a better and safer encapsulation.
In order to support delimiting output events with other characters than
just the LF, let's pass the delimiter through the API. The default remains
the LF, used by applet_append_line(), and ignored by the log forwarder.
Commit f435a2e518 ("CLEANUP: atomics: also replace __sync_synchronize()
with __atomic_thread_fence()") replaced the builtins used for barriers,
but the different API required an argument while the macros didn't specify
any, resulting in double parenthesis that were causing obscure build errors
such as "called object type 'void' is not a function or function pointer".
Let's just specify the args for the macro. No backport is needed.
Some notification_* functions were not thread safe by default as they
assumed only one producer would emit events for registered tasks.
While this suited well with the Lua sockets use-case, this proved to
be a limitation with some other event sources (ie: lua Queue class)
instead of having to deal with both the non thread safe and thread
safe variants (_mt suffix), which is error prone, let's make the
entire API thread safe regarding the event list.
Pruning functions still require that only one thread executes them,
with Lua this is always the case because there is one cleanup list
per context.
notification_new and notification_wake were historically meant to be
called by a single thread doing both the init and the wakeup for other
tasks waiting on the signals.
In this patch, we extend the API so that notification_new and
notification_wake have thread-safe variants that can safely be used with
multiple threads registering on the same list of events and multiple
threads pushing updates on the list.
This commit is a direct follow-up of the previous one. It defines a new
server keyword check-pool-conn-name. It is used as the default value for
the name parameter of idle connection hash generation.
Its behavior is similar to server keyword pool-conn-name, but reserved
for checks reuse. If check-pool-conn-name is set, it is used in priority
to match a connection for reuse. If unset, a fallback is performed on
check-sni.
Support for connection reuse during server checks was implemented
recently. This is activated with the server keyword check-reuse-pool.
Similarly to stream processing via connect_backend(), a connection hash
is calculated when trying to perform reuse for checks. This is necessary
to retrieve for a connection which shares the check connect parameters.
However, idle connections can additionnally be tagged using a
pool-conn-name or SNI under connect_backend(). Check reuse does not test
these values, which prevent to retrieve a matching connection.
Improve this by using "check-sni" value as idle connection hash input
for check reuse. be_calculate_conn_hash() API has been adjusted so that
name value can be passed as input, both when using streams or checks.
Even with the current patch, there is still some scenarii which could
not be covered for checks connection reuse. most notably, when using
dynamic pool-conn-name/SNI value. It is however at least sufficient to
cover simpler cases.
The old __sync_* API is no longer necessary since we do not support
gcc before 4.7 anymore. Let's just get rid of this code, the file is
still ugly enough without it.
Implement the possibility to reuse idle connections when performing
server checks. This is done thanks to the recently introduced functions
be_calculate_conn_hash() and be_reuse_connection().
One side effect of this change is that be_calculate_conn_hash() can now
be called with a NULL stream instance. As such, part of the functions
are adjusted accordingly.
Note that to simplify configuration, connection reuse is not performed
if any specific check connection parameters are defined on the server
line or via the tcp-check connect rule. This is performed via newly
defined tcpcheck_use_nondefault_connect().
Define a new server keyword check-reuse-pool, and its counterpart with a
"no" prefix. For the moment, only parsing is implemented. The real
behavior adjustment will be implemented in the next patch.
Adjust newly defined be_reuse_connection() API. The stream argument is
removed. This will allows checks to be able to invoke it without relying
on a stream instance.
Following the previous patch, the part directly related to connection
reuse is extracted from connect_server(). It is now define in a new
function be_reuse_connection().
On connection reuse, a hash is first calculated. It is generated from
various connection parameters, to retrieve a matching connection.
Extract hash calculation from connect_server() into a new dedicated
function be_calculate_conn_hash(). The objective is to be able to
perform connection reuse for checks, without connect_server() invokation
which relies on a stream instance.
As Ilya reported in issue #2911, the CONCAT() macro breaks on NetBSD
which defines its own as __CONCAT() (which is exactly the same). Let's
just undefine it before ours to fix the issue instead of renaming, but
keep ours so that we don't have doubts about what we're running with.
Note that the patch introducing this breaking change was backported
to 3.0.
For leastconn, servers used to just be stored in an ebtree.
Each server would be one node.
Change that so that nodes contain multiple mt_lists. Each list
will contain servers that share the same key (typically meaning
they have the same number of connections). Using mt_lists means
that as long as tree elements already exist, moving a server from
one tree element to another does no longer require the lbprm write
lock.
We use multiple mt_lists to reduce the contention when moving
a server from one tree element to another. A list in the new
element will be chosen randomly.
We no longer remove a tree element as soon as they no longer
contain any server. Instead, we keep a list of all elements,
and when we need a new element, we look at that list only if it
contains a number of elements already, otherwise we'll allocate
a new one. Keeping nodes in the tree ensures that we very
rarely have to take the lbrpm write lock (as it only happens
when we're moving the server to a position for which no
element is currently in the tree).
The number of mt_lists used is defined as FWLC_NB_LISTS.
The number of tree elements we want to keep is defined as
FWLC_MIN_FREE_ENTRIES, both in defaults.h.
The value used were picked afrer experimentation, and
seems to be the best choice of performances vs memory
usage.
Doing that gives a good boost in performances when a lot of
servers are used.
With a configuration using 500 servers, before that patch,
about 830000 requests per second could be processed, with
that patch, about 1550000 requests per second are
processed, on an 64-cores AMD, using 1200 concurrent connections.
Add two new methods to lbprm, server_deinit() and proxy_deinit(),
in case something should be done at the lbprm level when
removing servers and proxies.
Implement mt_list_try_lock_prev(), that does the same thing
as mt_list_lock_prev(), exceot if the list is locked, it
returns { NULL, NULL } instaed of waiting.
jwk_thumbprint() is a function which is a function which implements
RFC7368 and emits a JWK thumbprint using a EVP_PKEY.
EVP_PKEY_EC_to_pub_jwk() and EVP_PKEY_RSA_to_pub_jwk() were changed in
order to match what is required to emit a thumbprint (ie, no spaces or
lines and the lexicographic order of the fields)
The new function "print_cpu_set()" will print cpu sets in a human-friendly
way, with commas and dashes for intervals. The goal is to keep them compact
enough.
GCC 15 throws the following warning on fixed-size char arrays if they do not
contain terminated NUL:
src/tools.c:2041:25: error: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Werror=unterminated-string-initialization]
2041 | const char hextab[16] = "0123456789ABCDEF";
We are using a couple of such definitions for some constants. Converting them
to flexible arrays, like: hextab[] = "0123456789ABCDEF" may have consequences,
as enlarged arrays won't fit anymore where they were possibly located due to
the memory alignement constraints.
GCC adds 'nonstring' variable attribute for such char arrays, but clang and
other compilers don't have it. Let's wrap 'nonstring' with our
__nonstring macro, which will test if the compiler supports this attribute.
This fixes the issue #2910.
By default, pools of comparable sizes are merged together. However, the
current algorithm is dumb: it rounds the requested size to the next
multiple of 16 and compares the sizes like this. This results in many
entries which are already multiples of 16 not being merged, for example
1024 and 1032 are separate, 65536 and 65540 are separate, 48 and 56 are
separate (though 56 merges with 64).
This commit changes this to consider not just the entry size but also the
average entry size, that is, it compares the average size of all objects
sharing the pool with the size of the object looking for a pool. If the
object is not more than 1% bigger nor smaller than the current average
size or if it neither 16 bytes smaller nor larger, then it can be merged.
Also, it always respects exact matches in order to avoid merging objects
into larger pools or worse, extending existing ones for no reason, and
when there's a tie, it always avoids extending an existing pool.
Also, we now visit all existing pools in order to spot the best one, we
do not stop anymore at the smallest one large enough. Theoretically this
could cost a bit of CPU but in practice it's O(N^2) with N quite small
(typically in the order of 100) and the cost at each step is very low
(compare a few integer values). But as a side effect, pools are no
longer sorted by size, "show pools bysize" is needed for this.
This causes the objects to be much better grouped together, accepting to
use a little bit more sometimes to avoid fragmentation, without causing
everyone to be merged into the same pool. Thanks to this we're now
seeing 36 pools instead of 48 by default, with some very nice examples
of compact grouping:
- Pool qc_stream_r (80 bytes) : 13 users
> qc_stream_r : size=72 flags=0x1 align=0
> quic_cstrea : size=80 flags=0x1 align=0
> qc_stream_a : size=64 flags=0x1 align=0
> hlua_esub : size=64 flags=0x1 align=0
> stconn : size=80 flags=0x1 align=0
> dns_query : size=64 flags=0x1 align=0
> vars : size=80 flags=0x1 align=0
> filter : size=64 flags=0x1 align=0
> session pri : size=64 flags=0x1 align=0
> fcgi_hdr_ru : size=72 flags=0x1 align=0
> fcgi_param_ : size=72 flags=0x1 align=0
> pendconn : size=80 flags=0x1 align=0
> capture : size=64 flags=0x1 align=0
- Pool h3s (56 bytes) : 17 users
> h3s : size=56 flags=0x1 align=0
> qf_crypto : size=48 flags=0x1 align=0
> quic_tls_se : size=48 flags=0x1 align=0
> quic_arng : size=56 flags=0x1 align=0
> hlua_flt_ct : size=56 flags=0x1 align=0
> promex_metr : size=48 flags=0x1 align=0
> conn_hash_n : size=56 flags=0x1 align=0
> resolv_requ : size=48 flags=0x1 align=0
> mux_pt : size=40 flags=0x1 align=0
> comp_state : size=40 flags=0x1 align=0
> notificatio : size=48 flags=0x1 align=0
> tasklet : size=56 flags=0x1 align=0
> bwlim_state : size=48 flags=0x1 align=0
> xprt_handsh : size=48 flags=0x1 align=0
> email_alert : size=56 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
> caphdr : size=41 flags=0x1 align=0
- Pool quic_cids (32 bytes) : 13 users
> quic_cids : size=16 flags=0x1 align=0
> quic_tls_ke : size=32 flags=0x1 align=0
> quic_tls_iv : size=12 flags=0x1 align=0
> cbuf : size=32 flags=0x1 align=0
> hlua_queuew : size=24 flags=0x1 align=0
> hlua_queue : size=24 flags=0x1 align=0
> promex_modu : size=24 flags=0x1 align=0
> cache_st : size=24 flags=0x1 align=0
> spoe_appctx : size=32 flags=0x1 align=0
> ehdl_sub_tc : size=32 flags=0x1 align=0
> fcgi_flt_ct : size=16 flags=0x1 align=0
> sig_handler : size=32 flags=0x1 align=0
> pipe : size=24 flags=0x1 align=0
- Pool quic_crypto (1032 bytes) : 2 users
> quic_crypto : size=1032 flags=0x1 align=0
> requri : size=1024 flags=0x1 align=0
- Pool quic_conn_r (65544 bytes) : 2 users
> quic_conn_r : size=65536 flags=0x1 align=0
> dns_msg_buf : size=65540 flags=0x1 align=0
On a very unscientific test consisting in sending 1 million H1 requests
and 1 million H2 requests to the stats page, we're seeing an ~6% lower
memory usage with the patch:
before the patch:
Total: 48 pools, 4120832 bytes allocated, 4120832 used (~3555680 by thread caches).
after the patch:
Total: 36 pools, 3880648 bytes allocated, 3880648 used (~3299064 by thread caches).
This should be taken with care however since pools allocate and release
in batches.
When using hash-based load balancing, requests are always assigned to
the server corresponding to the hash bucket for the balancing key,
without taking maxconn or maxqueue into account, unlike in other load
balancing methods like 'first'. This adds a new backend directive that
can be used to take maxconn and possibly maxqueue in that context. This
can be used when hashing is desired to achieve cache locality, but
sending requests to a different server is preferable to queuing for a
long time or failing requests when the initial server is saturated.
By default, affinity is preserved as was the case previously. When
'hash-preserve-affinity' is set to 'maxqueue', servers are considered
successively in the order of the hash ring until a server that does not
have a full queue is found.
When 'maxconn' is set on a server, queueing cannot be disabled, as
'maxqueue=0' means unlimited. To support picking a different server
when a server is at 'maxconn' irrespective of the queue,
'hash-preserve-affinity' can be set to 'maxconn'.
Define a new global configuration tune.quic.frontend.max-data. This
allows users to explicitely set the value for the corresponding QUIC TP
initial-max-data, with direct impact on haproxy memory consumption.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
By default, create_pool() tries to merge similar pools into one. But when
dealing with certain bugs, it's hard to say which ones were merged together.
We do have the information at registration time, so let's just create a
list of registrations ("pool_registration") attached to each pool, that
will store that information. It can then be consulted on the CLI using
"show pools detailed", where the names, sizes, alignment and flags are
reported.
alt_name will be used by metric exporters to know how the metric should be
presented to the user. If the alt_name is NULL, the metric should be
ignored. For now only promex exporter will make use of this.
Reduce the max number of loops in the mt_list code while waiting for
a lock to be available with exponential backoff. It's been observed that
the current value led to severe performances degradation at least on
some hardware, hopefully this value will be acceptable everywhere.
Use stat_col storage for stat_cols_info[] array instead of name_desc.
As documented in 65624876f ("MINOR: stats: introduce a more expressive
stat definition method"), stat_col supersedes name_desc storage but
it remains backward compatible. Here we migrate to the new API to be
able to further extend stat_cols_info[] in following patches.
Further extend logic implemented in 65624876 ("MINOR: stats: introduce a
more expressive stat definition method") and 4e9e8418 ("MINOR: stats:
prepare stats-file support for values other than FN_COUNTER"): we don't
rely anymore on the presence of the capability to know if the metric is
generic or not. This is because it prevents us from setting a capability
on static statistics. Yet it could be useful to set the capability even
on static metrics, thus we add a dedicated .generic bit to tell haproxy
that the metric is generic and can be handled automatically by the API.
Also, ME_NEW_* helpers are not explicitly associated to generic metric
definition (as it was already the case before) to avoid ambiguities.
It may change in the future as we may need to use the new definition
method to define static metrics (without the generic bit set). But for
now it isn't the case as this need definition was implemented for generic
metrics support in the first place. If we want to define static metrics
using the API, we could add a new set of helpers for instance.
The ckch_store_load_files() function makes specific processing for
PARSE_TYPE_STR as if it was a type only used for paths.
This patch changes a little bit the way it's done,
PARSE_TYPE_STR is only meant to strdup() a string and stores the
resulting pointer in the ckch_conf structure.
Any processing regarding the path is now done in the callback.
Since the callbacks were basically doing the same thing, they were
transformed into the DECLARE_CKCH_CONF_LOAD() macros which allows to
do some templating of these functions.
The resulting ckch_conf_load_* functions will do the same as before,
except they will also do the path processing instead of letting
ckch_store_load_files() do it, which means we don't need the "base"
member anymore in the struct ckch_conf_kws.
With the SSL configuration, crt-base, key-base are often used, these
keywords concatenates the base path with the path when the path does not
start by '/'.
This is done at several places in the code, so a function to do this
would be better to standardize the code.
This patch implements the function EVP_PKEY_to_jws_algo() which returns
a jwt_alg compatible with the private key.
This value can then be passed to jws_b64_protected() and
jws_b64_signature() which modified to take an jwt_alg instead of a char.
We'll need to let the user decide what's best for their workload, and in
order to do this we'll have to provide tunable options. For that, we're
introducing struct ha_cpu_policy which contains a name, a description
and a function pointer. The purpose will be to use that function pointer
to choose the best CPUs to use and now to set the number of threads and
thread-groups, that will be called during the thread setup phase. The
only supported policy for now is "none" which doesn't set/touch anything
(i.e. all available CPUs are used).
The goal here is to keep an array of the known CPU clusters, because
we'll use that often to decide of the performance of a cluster and
its relevance compared to other ones. We'll store the number of CPUs
in it, the total capacity etc. For the capacity, we count one unit
per core, and 1/3 of it per extra SMT thread, since this is roughly
what has been measured on modern CPUs.
In order to ease debugging, they're also dumped with -dc.
The purpose here is to detect heterogenous clusters which are not
properly reported, based on the exposed information about the cores
capacity. The algorithm here consists in sorting CPUs by capacity
within a cluster, and considering as equal all those which have 5%
or less difference in capacity with the previous one. This allows
large clusters of more than 5% total between extremities, while
keeping apart those where the limit is more pronounced. This is
quite common in embedded environments with big.little systems, as
well as on some laptops.
It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.
Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.
Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.
The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.
We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.
Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.
This will be used to detect and fix incorrect setups which report
the same cluster ID for multiple L3 instances.
The arrangement of functions in this file is becoming a real problem.
Maybe we should move all this to cpu_topo for example, and better
distinguish OS-specific and generic code.
Once we've kept only the CPUs we want, the next step will be to form
groups and these ones are based on locality. Thus we'll have to sort by
locality. For now the locality is only inferred by the index. No grouping
is made at this point. For this we add the "cpu_reorder_by_locality"
function with a locality-based comparison function.
CPU selection will be performed by sorting CPUs according to
various criteria. For dumps however, that's really not convenient
and we'll need to reorder the CPUs according to their index only.
This is what the new function cpu_reorder_by_index() does. It's
called in thread_detect_count() before dumping the CPU topology.
By mutually refining the thread count and group count, we can try
to detect the most suitable setup for the current machine. Taskset
is implicitly handled correctly. tgroups automatically adapt to the
configured number of threads. cpu-map manages to limit tgroups to
the smallest supported value.
The thread-limit is enforced. Just like in cfgparse, if the thread
count was forced to a higher value, it's reduced and a warning is
emitted. But if it was not set, the thr_max value is bound to this
limit so that further calculations respect it.
We continue to default to the max number of available threads and 1
tgroup by default, with the limit. This normally allows to get rid
of that test in check_config_validity().
During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.
The function is not convenient because it doesn't allow us to undo the
startup changes, and depending on where it's being used, we don't know
whether the values read have already been altered (this is not the case
right now but it's going to evolve).
Let's just compute the status during cpu_detect_usable() and set a
variable accordingly. This way we'll always read the init value, and
if needed we can even afford to reset it. Also, placing it in cpu_topo.c
limits cross-file dependencies (e.g. threads without affinity etc).
This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.
This adds a generic function ha_cpuset_detect_online() which for now
only supports linux via /sys. It fills a cpuset with the list of online
CPUs that were detected (or returns a failure).
The cpuset files are normally used only for cpu manipulations. It happens
that the initial CPU binding detection was initially placed there since
there was no better place, but in practice, being OS-specific, it should
really be in cpu-topo. This simplifies cpuset which doesn't need to know
about the OS anymore.
Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.
It's important to proceed this way so that when we have no info we
assume they're all available.
The new function cpu_dump_topology() will centralize most debugging
calls, and it can make efforts of not dumping some possibly irrelevant
fields (e.g. non-existing cache levels).
We don't want to constantly deal with as many CPUs as a cpuset can hold,
so let's first try to trim the value to what the system claims to support
via _SC_NPROCESSORS_CONF. It is obviously still subject to the limit of
the cpuset size though. The value is stored globally so that we can
reuse it elsewhere after initialization.
This structure will be used to store information about each CPU's
topology (package ID, L3 cache ID, NUMA node ID etc). This will be used
in conjunction with CPU affinity setting to try to perform a mostly
optimal binding between threads and CPU numbers by default. Since it
was noticed during tests that absolutely none of the many machines
tested reports different die numbers, the die_id is not stored.
Also, it was found along experiments that the cluster ID will be used
a lot, half of the time as a node-local identifier, and half of the
time as a global identifier. So let's store the two versions at once
(cl_gid, cl_lid).
Some flags are added to indicate causes of exclusion (offline, excluded
at boot, excluded by rules, ignored by policy).
__decl_thread() already exists but is more suited for struct members.
When using it in a variables block, it appends the final trailing
semi-colon which is a statement that ends the variable block. Better
clean this up and have one precisely for variable blocks. In this
case we can simply define an unused enum value that will consume the
semi-colon. That's what the new macro __decl_thread_var() does.
It's often useful to be able to concatenate strings after resolving
them (e.g. __FILE__, __LINE__ etc). Let's just have a CONCAT() macro
to do that, which calls _CONCAT() with the same arguments to make
sure the contents are resolved before being concatenated.
Migrate recently added log-forward section options, currently stored under
proxy->options2 to proxy->options3 since proxy->options2 is running out of
space and we plan on adding more log-forward options.
proxy->options2 is almost full, yet we will add new log-forward options
in upcoming patches so we anticipate that by adding a new {no_}options3
and cfg_opts3[] to further extend proxy options
Support for multiple Rx buffers per QCS instance has been introduced by
previous patches. However, due to flow-control initial values, client
were still unable to fully used this to increase their upload
throughput.
This patch increases max-stream-data-bidi-remote flow-control initial
values. A new define QMUX_STREAM_RX_BUF_FACTOR will fix the number of
concurrent buffers allocable per QCS. It is set to 90.
Note that connection flow-control initial value did not changed. It is
still configured to be equivalent to bufsize multiplied by the maximum
concurrent streams. This ensures that Rx buffers allocation is still
constrained per connection, so that it won't be possible to have all
active QCS instances using in parallel their maximum Rx buffers count.
Convert QCS rx buffer pointer to a tree container. Additionnaly, offset
field of qc_stream_rxbuf is thus transformed into a node tree.
For now, only a single Rx buffer is stored at most in QCS tree. Multiple
Rx buffers will be implemented in a future patch to improve QUIC clients
upload throughput.
Define a new type qc_stream_rxbuf. This is used as a wrapper around QCS
Rx buffer with encapsulation of the ncbuf storage. It is allocated via a
new pool. Several functions are adapted to be able to deal with
qc_stream_rxbuf as a wrapper instead of the previous plain ncbuf
instance.
No functional change should happen with this patch. For now, only a
single qc_stream_rxbuf can be instantiated per QCS. However, this new
type will be useful to implement multiple Rx buffer storage in a future
commit.
QCS uses ncbuf for STREAM data storage. This serves as a limit for
maximum STREAM buffering capacity, advertised via QUIC transport
parameters for initial flow-control values.
Define a new function qmux_stream_rx_bufsz() which can be used to
retrieve this Rx buffer size. This can be used both in MUX/H3 layers and
in QUIC transport parameters.
Previously, a function qcs_http_handle_standalone_fin() was implemented
to handle a received standalone FIN, bypassing app_ops layer decoding.
However, this was removed as app_ops layer interaction is necessary. For
example, HTTP/3 checks that FIN is never sent on the control uni stream.
This patch reintroduces qcs_http_handle_standalone_fin(), albeit in a
slightly diminished version. Most importantly, it is now the
responsibility of the app_ops layer itself to use it, to avoid the
shortcoming described above.
The main objective of this patch is to be able to support standalone FIN
in HTTP/0.9 layer. This is easily done via the reintroduction of
qcs_http_handle_standalone_fin() usage. This will be useful to perform
testing, as standalone FIN is a corner case which can easily be broken.
In check_config_validity() we evaluate some sample fetch expressions
(log-format, server rules, etc). These expressions may use external files like
maps.
If some particular 'default-path' was set in the global section before, it's no
longer applied to resolve file pathes in check_config_validity(). parse_cfg()
at the end of config parsing switches back to the initial cwd.
This fixes the issue #2886.
This patch should be backported in all stable versions since 2.4.0, including
2.4.0.
This commit introduces the dont-parse-log option to disable log message
parsing, allowing raw log data to be forwarded without modification.
Also, it adds the assume-rfc6587-ntf option to frame log messages
using only non-transparent framing as per RFC 6587. This avoids
missparsing in certain cases (mainly with non RFC compliant messages).
The documentation is updated to include details on the new options and
their intended use cases.
This feature was discussed in GH #2856
cfg_parse_listen_match_option() takes cfg_opt array as parameter, as well
current args, expected mode and cap bitfields.
It is expected to be used under cfg_parse_listen() function or similar.
Its goal is to remove code duplication around proxy->options and
proxy->options2 handling, since the same checks are performed for the
two. Also, this function could help to evaluate proxy options for
mode-specific proxies such as log-forward section for instance:
by giving the expected mode and capatiblity as input, the function
would only match compatible options.
Current pr_mode enum is a regular enum because a proxy only supports one
mode at a time. However it can be handy for a function to be given a
list of compatible modes for a proxy, and we can't do that using a
bitfield because pr_mode is not bitfield compatible (values share
the same bits).
In this patch we manually define pr_mode values so that they are all
using separate bits and allows a function to take a bitfield of
compatible modes as parameter.
Add a new macro, REGISTER_UNITTEST(), that will automatically make sure
we call hap_register_unittest(), instead of having to create a function
that will do so.
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.
This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.
When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.
HAProxy need to be built with DEBUG_UNIT to activate this feature.
Implement a converter which takes an EVP_PKEY and converts it to a
public JWK key. This is the first step of the JWS implementation.
It supports both EC and RSA keys.
Know to work with:
- LibreSSL
- AWS-LC
- OpenSSL > 1.1.1
As reported in issue #2882, using "no-send-proxy-v2" on a server line does
not properly disable the use of proxy-protocol if it was enabled in a
default-server directive in combination with other PP options. The reason
for this is that the sending of a proxy header is determined by a test on
srv->pp_opts without any distinction, so disabling PPv2 while leaving other
options results in a PPv1 header to be sent.
Let's fix this by explicitly testing for the presence of either send-proxy
or send-proxy-v2 when deciding to send a proxy header.
This can be backported to all versions. Thanks to Andre Sencioles (@asenci)
for reporting the issue and testing the fix.
Maxconn is a bit of a misnomer when it comes to servers, as it doesn't
control the maximum number of connections we establish to a server, but
the maximum number of simultaneous requests. So add "strict-maxconn",
that will make it so we will never establish more connections than
maxconn.
It extends the meaning of the "restricted" setting of
tune.takeover-other-tg-connections, as it will also attempt to get idle
connections from other thread groups if strict-maxconn is set.
Allow haproxy to take over idle connections from other thread groups
than our own. To control that, add a new tunable,
tune.takeover-other-tg-connections. It can have 3 values, "none", where
we won't attempt to get connections from the other thread group (the
default), "restricted", where we only will try to get idle connections
from other thread groups when we're using reverse HTTP, and "full",
where we always try to get connections from other thread groups.
Unless there is a special need, it is advised to use "none" (or
restricted if we're using reverse HTTP) as using connections from other
thread groups may have a performance impact.
Add a fixup_tgid_takeover() method to pollers for which it makes sense
(epoll, kqueue and evport). That method can be called after a takeover
of a fd from a different thread group, to make sure the poller's
internal structure reflects the new state.
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
This is the introduction of "minsize-req" and "minsize-res".
These two options allow you to set the minimum payload size required for
compression to be applied.
This helps save CPU on both server and client sides when the payload does
not need to be compressed.
Some code called by the debug handlers in the context of a signal handler
accesses to some freq_ctr and occasionally ends up on a locked one from
the same thread that is dumping it. Let's introduce a non-blocking version
that at least allows to return even if the value is in the process of being
updated, it's less problematic than hanging.
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.
Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
When the connection for sink_forward_{oc}_applet fails or a previous one
is destroyed, the sft->appctx is instantly released.
However process_sink_forward_task(), which may run at any time, iterates
over all known sfts and tries to create sessions for orphan ones.
It means that instantly after sft->appctx is destroyed, a new one will
be created, thus a new connection attempt will be made.
It can be an issue with tcp log-servers or sink servers, because if the
server is unavailable, process_sink_forward() will keep looping without
any temporisation until the applet survives (ie: connection succeeds),
which results in unexpected CPU usage on the threads responsible for
that task.
Instead, we add a tempo logic so that a delay of 1second is applied
between two retries. Of course the initial attempt is not delayed.
This could be backported to all stable versions.
In the SPOP protocol, ACK frame with empty payload are allowed. However, in
that case, because only the payload is transferred, there is no data to
return to the SPOE applet. Only the end of input is reported. Thus the
applet is never woken up. It means that the SPOE filter will be blocked
during the processing timeout and will finally return an error.
To workaournd this issue, a NOOP action is introduced with the value 0. It
is only an internal action for now. It does not exist in the SPOP
protocol. When an ACK frame with an empy payload is received, this noop
action is transferred to the SPOE applet, instead of nothing. Thanks to this
trick, the applet is properly notified. This works because unknown actions
are ignored by the SPOE filter.
This patch must be backported to 3.1.
Previously, QUIC MUX application layer was installed and initialized via
MUX init. However, the latter stage involve I/O operations, for example
when using HTTP/3 with the emission of a SETTINGS frame.
Change this to prevent any I/O operations during MUX init. As such,
finalize app_ops callback is now called during the first invokation of
qcc_io_send(), in the context of MUX tasklet. To implement this, a new
application state value is added, to detect the transition from NULL to
INIT stage.
Introduce a new QCC field to track the current application layer state.
For the moment, only INIT and SHUT state are defined. This allows to
replace the older flag QC_CF_APP_SHUT.
This commit does not bring major changes. It is only necessary to permit
future evolutions on QUIC MUX. The only noticeable change is that QMUX
traces can now display this new field.
ckch_conf loading is not that simple as it requires to check
- if the cert already exists in the ckchs_tree
- if the ckch_conf is compatible with an existing cert in ckchs_tree
- if the cert is a bundle which need to load multiple ckch_store
This logic could be reuse elsewhere, so this commit introduce the new
crtlist_load_crt() function which does that.
QUIC frame type is encoded as a variable-length integer. Thus, 64-bit
integer should be used for them. Currently, this was not the case as
type was represented as a 1-byte char inside quic_frame structure. This
does not cause any issue with QUIC from RFC9000, as all frame types fit
in this range. Furthermore, a QUIC implementation is required to use the
smallest size varint when encoding a frame type.
However, the current code is unable to accept QUIC extension with bigger
frame types. This is notably the case for quic-on-streams draft. Thus,
this commit readjusts quic_frame architecture to be able to support
higher frame type values.
First, type field of quic_frame is changed to a 64-bits variable. Both
encoding and decoding frame functions uses variable-length integer
helpers to manipulate the frame type field.
Secondly, the quic_frame builders/parsers infrastructure is still
preserved. However, it could be impossible to define new large frame
type as an index into quic_frame_builders / quic_frame_parsers arrays.
Thus, wrapper functions are now provided to access the builders and
parsers. Both qf_builder() and qf_parser() wrappers can then be extended
to return custom builder/parser instances for larger frame type.
Finally, unknown frame type detection also uses the new wrapper
quic_frame_is_known(). As with builders/parsers, for large frame type,
this function must be manually completed to support a new type value.
With this patch, files resulting from a lookup (*.key, *.ocsp,
*.issuer etc) are now stored in the ckch_conf.
It allows to see the original filename from where it was loaded in "show
ssl cert <filename>"
CRYPTO and STREAM frames encoding is similar. If payload is too large,
frame will be splitted and only the first payload part will be written
in the output QUIC packet. This process is complexified by the presence
of a variable-length integer Length field prior to the payload.
This commit aims at refactor these operations. Define two functions to
simplify the code :
* quic_strm_frm_fillbuf() which is used to calculate the optimal frame
length of a STREAM/CRYPTO frame with its payload in a buffer
* quic_strm_frm_split() which is used to split the frame payload if
buffer is too small
With this patch, both functions are now implemented for STREAM encoding.
Before this patch, REGISTER_CONFIG_SECTION() allowed to register one and only
one callback (<post>) called after the parsing of a section.
It was limitating because you couldn't register a post callback from anywhere
else in the code.
This patch introduces the new REGISTER_CONFIG_SECTION_POST() macros which allows
to register a new post callback for a section keyword from anywhere.
This patch introduces the feature by allowing `struct cfg_section` entries that
does not have a `section_parser`, and then iterating on all cfg_section with a
post_section_parser for a keyword.
STREAM and CRYPTO frames have a similar encoding format. In particular,
both of them have a variable-length integer Length field just before the
frame payload.
It is complex to determine the optimal Length value before copying the
payload data in the remaining buffer space. As such, helper functions
were implemented to calculate this. However, CRYPTO and STREAM frames
encoding implementation were not completely aligned, which renders the
code harder to follow.
The purpose of this commit is to simplify CRYPTO and STREAM frames
encoding. First, a new helper quic_int_cap_length() is defined which is
useful to determine the optimal buffer room available if prefixed by a
variable-length integer as Length field. Then, processing of both CRYPTO
and STREAM frames is now nearly identical, based on this new helper
function. Functions max_available_room() and max_stream_data_size() are
now unused and are removed.
This creates a tasklet that only expects to be called when the LB
algorithm is under contention when trying to reposition the server
in its tree. Indeed, that's one of the operations that usually
requires to take a write lock on a highly contended area, often
for very little benefits under contention; indeed, under load, if
a server keeps its previous position for a few extra microseconds,
usually there's no harm. Thus this new tasklet can be woken up by
the LB algo to ask the server to later call lbprm.server_requeue().
It does nothing else.
This callback will be used to reposition a server to its expected
position regardless of the fact that it was taken or dropped. It
will only be used by supporting LB algos. For now, only fwlc defines
it and assigns it to fwlc_srv_reposition(). At the moment it's not
used yet.
Storing only 30 buckets means we only keep 256 bytes per label. This
further simplifies address calculation and reduces the memory used
without complicating the locking code. It means we won't measure wait
times larger than a second but we're not supposed to face this as it
would trigger the watchdog anyway. It may become a little bit just if
measuring using rdtsc() instead of now_mono_time() though (typically
the limit would be around 350ms for a 3 GHz CPU).
It's more convenient (and more readable) to have the lock stats arranged
by operation type (read, seek, write). It will also allow to later simplify
the structure format and the bucket address calculation. Now lock_stat[]
got split into lock_stats_rd[], lock_stats_sk[], lock_stats_wr[].
Now that we have our sums by bucket, the _locked counter is redundant
since it's always equal to the sum of all entries. Let's just get rid
of it and replace its consumption with a loop over all buckets, this
will reduce the overhead of taking each lock at the expense of a tiny
extra effort when dumping all locks, which we don't care about.
In addition to the total/average wait time, we now also store the
wait time in 2^N buckets. There are 32 buckets for each type (read,
seek, write), allowing to store wait times from 1-2ns to 2.1-4.3s,
which is quite sufficient, even if we'd want to switch from NS to
CPU cycles in the future. The counters are only reported for non-
zero buckets so as not to visually pollute the output.
This significantly inflates the lock_stat struct, which is now
aligned to 256 bytes and rounded up to 1kB. But that's not really
a problem, given that there's only one per lock label.
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.
This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.
In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.
This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
This reverts commit 5496d06b2b.
It breaks the build on Windows which apparently doesn't support the weak
attribute well on functions. It's not big deal anyway, playing with build
options while debugging still works though it's less easy to use.
Tests have shown that on modern CPUs it's interesting to wait a bit less
in cpu_relax(). Till now we were looping down to 60 iterations and then
switching to just barriers. Increasing the threshold to 90 iterations
left before getting out of the loop improved the average and max time
to grab a write lock by a few percent (e.g. 10% at 1us, 20% at 256ns
or lower). Higher values tend to progressively lose that gain so let's
stick to this one. This was measured on an EPYC 74F3 like previous
measurements that initially led to this value, and the value might
possibly depend on the mask applied to the loop counter.
This is plock commit 74ca0a7307fa6aec3139f27d3b7e534e1bdb748e.
Along many tests involving both haproxy's scheduler and forwarded
traffic, various exponents and algorithms were attempted for the EBO
and their effects were measured. It was found that a growth in 1.25^N
limited to 128k cycles consistently gives a better latency than 1.5^N
limited to 256k cycles, without degrading general performance. The
measures of the time to grab a write lock on a 48-thread EPYC show
that the number of occurrences of low times was roughly multiplied by
2-3 while the number of occurrences of times above 64us was reduced
by similar factors, to even reach 300 at 64us and limiting the maximum
time by a factor of 4.
The other variants that were experimented with are:
m = ((m + (m >> 1)) + 2) & 0x3ffff; // original
m = ((m + (m >> 1) + (m >> 3)) + 2) & 0x3ffff;
m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x3ffff;
m = ((m + (m >> 1) + (m >> 4)) + 2) & 0x1ffff;
m = ((m + (m >> 1) + (m >> 4)) + 1) & 0x1ffff;
m = ((m + (m >> 2) + (m >> 4)) + 1) & 0x1ffff; // lowest CPU on pl_wr test + good perf
m = ((m + (m >> 2)) + 1) & 0x1ffff; // even lower cpu usage, lowest max
m = ((m + (m >> 1) + (m >> 2)) + 1) & 0x1ffff; // correct but slightly higher maxes
m = ((m + (m >> 1) + (m >> 3)) + 1) & 0x1ffff; // less good than m+m>>2
m = ((m + (m >> 2) + (m >> 3)) + 1) & 0x1ffff; // better but not as good as m+m>>2
m = ((m + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good, lower rates on small coounts.
m = ((m + (m >> 2) + (m >> 3) + (m >> 4)) + 1) & 0x1ffff; // less good as well
m = ((m & 0x7fff) + (m >> 1) + (m >> 4)) + 2;
m = ((m & 0xffff) + (m >> 1) + (m >> 4)) + 2;
This is plock commit dddd9ee01c522da33c353e2e4d4fd743d8336ec3.
It was noticed in haproxy that in certain extreme cases, a write lock
subject to EBO may fail for a very long time in front of a large set
of readers constantly trying to upgrade to the S state. The reason is
that among many readers, one will succeed in its upgrade, and this
situation can last for a very long time with many readers upgrading
in turn, while the writer waits longer and longer before trying again.
Here we're taking a reasonable approach which is that the write lock
should have a higher precedence in its attempt to grab the lock. What
is done is that instead of fully rolling back in case of conflict with
a pure S lock, the writer will only release its read part in order to
let the S upgrade to W if needed, and finish its operations. This
guarantees no other seek/read/write can enter. Once the conflict is
resolved, the writer grabs the read part again and waits for readers
to be gone (in practice it could even return without waiting since we
know that any possible wanderers would leave or even not be there at
all, but it avoids a complicated loop code that wouldn't improve the
practical situation but inflate the code).
Thanks to this change, the maximum write lock latency on a 48 threads
AMD with aheavily loaded scheduler went down from 256 to 64 ms, and the
number of occurrences of 32ms or more was divided by 300, while all
occurrences of 1ms or less were multiplied by up to 3 (3 for the 4-16ns
cases).
This is plock commit b6a28366d156812f59c91346edc2eab6374a5ebd.
The inlining of the lock waiting function was made more easily
configurable with commit 7505c2e ("plock: always expose the inline
version of the lock wait function"). However, the standard one remained
static, but in order to resolve the symbols in "perf top", it's much
better to export it, so let's move "static" with "inline" and leave it
exported when PLOCK_INLINE_EBO is not set.
This is plock commit 3bea7812ec705b9339bbb0ed482a2cd8aa6c185c.
The spop stream now reports the end of input when the ACK is transferred to
the SPOE applet. To do so, the flag SPOP_SF_ACK_RCVD was added. It is set on
the SPOP stream when its ACK is received by the SPOP connection.
In addition when SPOP stream flags are propagated to the SE, the error is
now reported if end of input was not reached instead of testing the
connection error code. It is more accurate.
This patch should be backported to 3.1.
This is the case for AWS-LC which derives from boringssl, where
X509_OBJECT_get0_X509_CRL() is already defined. There is definitively
no more need to define this function to build haproxy against TLS libs derived
from boringssl.
It is not rare to see configurations with a large number of "tcp-request
content" or "http-request" rules for instance. A large number of rules
combined with cpu-demanding actions (e.g.: actions that work on content)
may create thread contention as all the rules from a given ruleset are
evaluated under the same polling loop if the evaluation is not interrupted
Thus, in this patch we add extra logic around "tcp-request content",
"tcp-response content", "http-request" and "http-response" rulesets, so
that when a certain number of rules are evaluated under the single polling
loop, we force the evaluating function to yield. As such, the rule which
was about to be evaluated is saved, and the function starts evaluating
rules from the save pointer when it returns (in the next polling loop).
We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so
that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG
is mandatory here because process_stream() expects TASK_WOKEN_MSG for
explicit analyzers re-evaluation.
rules_bcount stream's attribute was added to count how manu rules were
evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag
was added to know that the s->current_rule was assigned due to forced
yield and not regular yield.
By default haproxy will enforce a yield every 50 rules, this behavior
can be configured using the "tune.max-rules-at-once" global keyword.
There is a limitation though: for now, if the ACT_OPT_FINAL flag is set
on act_opts, we consider it is not safe to yield (as it is already the
case for automatic yield). In this case instead of yielding an taking
the risk of not being called back, we skip the yield and hope it will
not create contention. This is something we should ideally try to
improve in order to yield in all conditions.
A Read0 event could be ignored by the FCGI multiplexer if it is blocked on a
partial record. Instead of handling the event, it remained blocked, waiting
for the end of the record.
To fix the issue, the same solution than the H2 multiplexer is used. Two
flags are introduced. The first one, FCGI_CF_END_REACHED, is used to
acknowledge a read0. This flag is set when a read0 was received AND the FCGI
multiplexer must handle it. The second one, FCGI_CF_DEM_SHORT_READ, is set
when the demux is interrupted on a partial record. A short read and a read0
lead to set the FCGI_CF_END_REACHED flag.
With these changes, the FCGI mux should be able to properly handle read0 on
partial records.
This patch should be backported to all stable versions after a period of
observation.
Connection errors can be detected via connect/recv/send syscall, but also
because it was reported by the poller. So dedicated events, at the FD level,
are introduced to make the difference.
term_events tool was updated accordingly.
Enums used to report events were placed in the connection header for
conveniance. But it is not specifically related to connection. So, they are
moved at the end of the file to have a better isolation.
The function is now responsible to handle empty log because no event was
reported. In that case, an empty string is returned. It is also responsible to
handle case where termination events log is not supported for an given entity
(for instance the quic mux for now). In that case, a dash ("-") is returned.
It is hard to never detect the same event several time without painful
tests. In other words, the same termination event can be reported several
time and this must be handled. To do so, "tevt_report_event" macro is
updated to ignore an event if the last reported one is of the same type, for
the same location. Of course, if the same event is reported several times at
different moment, it will not be detected.
If it is the last patch to introduce dedicated termination events for each
location. In this one, events for the stream location are introcued. The old
enum is also removed because it is now unused.
Here, more accurate evets are added. The "intercepted" event was splitted.
Termination events dedicated to mux connection and stream-endpoint
descriptors are added in this patch. Specific events to these locations are
thus added. Changes for the H1 and H2 multiplexers are reviewed to be more
accurate.
To be able to add more accurate termination events for each location, the
enum will be splitted by location. Indeed, there are at most 16 possbile
events. It will be pretty confusing to use same termination events for the
different locations. So the best is to split them.
In this patch, the termination events for the fd, hs and xprt locations are
introduced. For now some holes are added to keep similar events aligned
across enums. But this may change in future.
MUX_CTL_TEVTS command is added to get the termination event logs of a mux
connection and MUX_SCTL_TEVTS command to get the termination event logs of a
mux stream.
In this patch, events for the stream location are reported. These events are
first reported on the corresponding stream-connector. So front events on scf
and back event on scb. Then all events are both merged in the stream. But
only 4 events are saved on the stream.
Several internal events are for now grouped with the type
"tevt_type_intercepted". More events will be added to have a better
resolution. But at least the place to report these events are identified.
For now, when a event is reported on a SC, it is also reported on the stream
and vice versa.
This termination events log will be used to report events from the mux
streams. The location will be "tevt_loc_se" and the muxes will be
responsible to report the corresponding events.
Termination events logs will be used to report the events that led to close
a connection. Unlike flags, that reflect a state, the idea here is to store
a log to preserve the order of the events. Most of time, when debugging an
issue, the order of the events is crucial to be able to understand the root
cause of the issue. The traces are trully heplful to do so. But it is not
always possible to active them because it is pretty verbose. On heavily
loaded platforms, it is not acceptable. We hope that the termination events
logs will help us in that situations.
One termination events log will be be store at each layer (connection, mux
connection, mux stream...) as a 32-bits integer. Each event will be store on
8 bits, 4 bits for the location and 4 bits for the type. So the first four
events will be stored only for each layer. It should be enough why a
connection is closed.
In this patch, the enums defining the termination event locations and types
are added. The macro to report a new event is also added and a function to
convert a termination events log to a string that could be display in log
messages for instance.
It is just a small patch to clean up mux/demux functions. Instead of listing
the H1S errors that must be handled during demux of mux operations, masks of
flags are used. It is more readable.
This patch adds a counter of close() on file descriptors in the fdtab.
The goal is to better detect if reported events concern the current or
a previous file descriptor. For now the counter is only added, and is
showed in "show fd" as "gen". We're reusing unused space at the end of
the struct. If it's needed for something more important later, this
patch can be reverted.
That's essentially in order to help with debugging strange cases like
the occasional epoll issues/races, by keeping a counter of how many
times an FD was taken over since last inserted. The room is available
so let's use it. If it's needed later, this patch can easily be reverted.
The counter is also reported in "show fd" as "tkov".
A new global option was recently introduced to disable pacing. However,
the value used (1<<31) caused issue with some compiler as options field
used for storage is declared as int. Move pacing deactivation flag
outside into the newly defined quic_tune to fix this.
This should be backported up to 3.1 after a period of observation. Note
that it relied on the previous patch which defined new quic_tune type.
Define a new structure quic_tune. It will be useful to regroup various
configuration settings and tunable related to QUIC, instead of defining
them into the global structure.
Pacing support was previously activated on each bind line individually,
via an optional argument of quic-cc-algo keyword. Remove this optional
argument and introduce a global setting to enable/disable pacing. Pacing
activation is still flagged as experimental.
One important change is that previously BBR usage automatically
activated pacing support. This is not the case anymore, so users should
now always explicitely activate pacing if BBR is selected. A new warning
message will be displayed if this is not the case.
Another consequence of this change is that now pacing_inter callback is
always defined for every quic_cc_algo types. As such, QUIC MUX uses
global.tune.options to determine if pacing is required.
This should be backported up to 3.1, after a period of observation.
Patch discussed in https://github.com/wolfSSL/wolfssl/issues/6834
When building Wolfssl without renegotiation options, WolfSSL still
defines the macros about it, which warns during the build.
This patch completes the previous one by undefining the macros so
haproxy could build without any warning.
In ticket https://github.com/wolfSSL/wolfssl/issues/6834, it was
suggested to push --enable-haproxy within --enable-distro.
WolfSSL does not want to include the renegotiation support in
--enable-distro.
To achieve this, let haproxy build without SSL_renegotiate_pending()
when wolfssl does not define HAVE_SECURE_RENEGOCIATION or
HAVE_SERVER_RENEGOCIATION_INFO.
Commit 44537379fc ("MINOR: tools: add errname to print errno macro
name") brought a facility to report errno using a symbolic string
when known instead of showing only the value. However, among the
listed options, ETIME is mentioned but is unknown from FreeBSD where
it breaks the build. Let's simply drop it, we don't use ETIME anyway
and even if it would be reported, the default code path still reports
the numeric value so there's no harm. If other ones fail to build in
the future, they could be handled the same way.
Thanks to the previous patch, it is now possible to explicitly rely on
stream's events to shut it down. The right event is set in
stream_shutdown(), before waking up the stream, via an atomic operation. In
process_stream(), this event will be handled as expected.
Thus, TASK_F_UEVT* are no longer used, but not removed since still usable
for other tasks.
This patch depends on "MEDIUM: stream: Map task wake up reasons to dedicated
stream events".
To fix thread-safety issues when a stream must be shut, three new task
states were added. These states are generic (UEVT1, UEVT2 and UEVT3), the
task callback function is responsible to know what to do with them. However,
it is not really scalable.
The best is to use an atomic field in the stream structure itself to deal
with these dedicated events. There is already the "pending_events" field
that save wake up reasons (TASK_WOKEN_*) to not loose them if
process_stream() is interrupted before it had a chance to handle them.
So the idea is to introduce a new field to handle streams dedicated events
and merged them with the task's wake up reasons used by the stream. This
means a mapping must be performed between some task wake up reasons and
streams events. Note that not all task wake up reasons will be mapped.
In this patch, the "new_events" field is introduced. It is an atomic
bit-field. Streams events (STRM_EVT_*) are also introduced to map the task
wake up reasons used by process_stream(). Only TASK_WOKEN_TIMER and
TASK_WOKEN_MSG are mapped, in addition to TASK_F_UEVT* flags. In
process_stream(), "pending_events" field is now filled with new stream
events and the mapping of the wake up reasons.
shutdown-backup-sessions action for on-marked-up directive does not work anymore
since the stream_shutdown() function was modified to be async-safe.
When stream_shutdown() was modified to be async-safe, dedicated task events were
added to map the reasons to shut a stream down. SF_ERR_DOWN was mapped to
TASK_F_EVT1 and SF_ERR_KILLED was mapped to TASK_F_EVT2. The reverse mapping was
performed by process_stream() to shut the stream with the appropriate reason.
However, SF_ERR_UP reason, used by shutdown-backup-sessions action to shut a
stream down because a preferred server became available, was not mapped in the
same way. So since commit b8e3b0a18d ("BUG/MEDIUM: stream: make
stream_shutdown() async-safe"), this action is ignored and does not work
anymore.
To fix an issue, and being able to bakcport the fix, a third task event was
added. TASK_F_EVT3 is now mapped on SF_ERR_UP.
This patch should fix the issue #2848. It must be backported as far as 2.6.
For both servers and proxies, use one connection queue per thread-group,
instead of only one. Having only one can lead to severe performance
issues on NUMA machines, it is actually trivial to get the watchdog to
trigger on an AMD machine, having a server with a maxconn of 96, and an
injector that uses 160 concurrent connections.
We now have one queue per thread-group, however when dequeueing, we're
dequeuing MAX_SELF_USE_QUEUE (currently 9) pendconns from our own queue,
before dequeueing one from another thread group, if available, to make
sure everybody is still running.
For both proxies and servers, properly calculates queueslength, which is
the total number of element in each queues (as they currently are only
using one queue, it is equivalent to the number of element of that
queue), and use it instead of the queue's length.
Add a per-thread group queue and associated fields in per-thread group
field in struct server, as well as a new field, queues length.
This is currently unused, so should change nothing.
Add a per-thread group field to struct proxy, that will contain a struct
queue, as well as a new field, "queueslength".
This is currently unused, so should change nothing.
Please note that proxy_init_per_thr() must now be called for each proxy
once the thread groups number is known.
acl_match_cond() combines acl_exec_cond() + acl_pass() and a check on the
condition->pol (to check if the cond is inverted) in order to return
either 0 if the cond doesn't match or 1 if it matches (or NULL).
Thanks to this we can actually simplify some redundant constructs that
iterate over rules and evaluate if the condition matches or not.
Conditions for tcp-request inspect-content and tcp-response
inspect-content couldn't be simplified because they perform an extra
check for missing data, and thus still need to leverage acl_exec_cond()
It's best to display the patch using "-w", like "git show xxxx -w",
because some blocks had to be re-indented after the cleanup, which
makes the patch hard to review by default.
As ssl_sock_gencert_load_ca and ssl_sock_gencert_free_ca are compiled only if
SSL_NO_GENERATE_CERTIFICATES is not defined, let's align it and move these
declarations in ssl_gencert.h.
ssl_sock_load_ca is defined in ssl_gencert.c and compiled only if
SSL_NO_GENERATE_CERTIFICATES is not defined. It's name is a bit confusing, as
we may think at the first glance, that it's a generic function, which is also
used to load CA file, provided via 'ca-file' keyword.
ssl_set_verify_locations_file is used in this case.
So let's rename ssl_sock_load_ca into ssl_sock_gencert_load_ca. Same is
applied to ssl_sock_free_ca.
Add helper to print the name of errno's corresponding macro, for example
"EINVAL" for errno=22. This may be helpful for debugging and for using in
some CLI commands output. The switch-case in errname() contains only the
errnos currently used in the code. So, it needs to be extended, if one starts
to use new syscalls.
Pacing burst size is now dynamic. As such, configuration value has been
removed and related fields in bind_conf and quic_cc_path structures can
be safely removed.
This should be backported up to 3.1.
Major improvements have been introduced in pacing recently. Most
notably, QMUX schedules emission on a millisecond resolution, which
allow to use passive wait to be much CPU friendly.
However, an issue remains with the pacing max credit. Unless BBR is
used, it is fixed to the configured value from quic-cc-algo bind
statement. This is not practical as if too low, it may drastically
reduce performance due to 1ms sleep resolution. If too high, some
clients will suffer from too much packet loss.
This commit fixes the issue by implementing a dynamic maximum credit
value based on the network condition specific to each clients.
Calculation is done to fix a maximum value which should allow QMUX
current tasklet context to emit enough data to cover the delay with the
next tasklet invokation. As such, avg_loop_us is used to detect the
process load. If too small, 1.5ms is used as minimal value, to cover the
extra delay incurred by the system which will happen for a default 1ms
sleep.
This should be backported up to 3.1.
Pacing algorithm has been revamped in the previous commit to implement a
credit based solution. This is a far more adaptative solution, in
particular which allow to catch up in case pause between pacing emission
was longer than expected.
This allows QMUX to remove the active loop based on tasklet wake-up.
Instead, a new task is used when emission should be paced. The main
advantage is that CPU usage is drastically reduced.
New pacing task timer is reset each time qcc_io_send() is invoked. Timer
will be set only if pacing engine reports that emission must be
interrupted. In this case timer is set via qcc_wakeup_pacing() to the
delay reported by congestion algorithm, or 1ms if delay is too short. At
the end of qcc_io_cb(), pacing task is queued if timer has been set.
Pacing task execution is simple enough : it immediately wakes up QCC I/O
handler.
Note that to have decent performance, it requires to have a large enough
burst defined in configuration of quic-cc-algo. However, this value is
common to every listener clients, which may cause too much loss under
network conditions. This will be address in a future patch.
This should be backported up to 3.1.
Implement a new method for QUIC pacing emission based on credit. This
represents the number of packets which can be emitted in a single burst.
After emission, decrement from the credit the number of emitted packets.
Several emission can be conducted in the same sequence until the credit
is completely decremented.
When a new emission sequence is initiated (i.e. under a new QMUX tasklet
invokation), credit is refilled according to the delay which occured
between the last and current emission context.
This new mechanism main advantage is that it allows to conduct several
emission in the same task context without having to wait between each
invokation. Wait is only forced if pacing is expired, which is now
equivalent to having a null credit.
Furthermore, if delay between two emissions sequence would have been
smaller than expected, credit is only partially refilled. This allows to
restart emission without having to wait for the whole credit to be
available.
On the implementation side, a new field <credit> is avaiable in
quic_pacer structure. It is automatically decremented on
quic_pacing_sent_done() invokation. Also, a new function
quic_pacing_reload() must be used by QUIC MUX when a new emission
sequence is initiated to refill credit. <next> field from quic_pacer has
been removed.
For the moment, credit is based on the burst configured via quic-cc-algo
keyword, or directly reported by BBR.
This should be backported up to 3.1.
Previously, congestion window was increased any time each time a new
acknowledge was received. However, it did not take into account the
window filling level. In a network condition with negligible loss, this
will cause the window to be incremented until the maximum value (by
default 480k), even though the application does not have enough data to
fill it.
In most cases, this issue is not noticeable. However, it may lead to
excessive memory consumption when a QUIC connection is suddendly
interrupted, as in this case haproxy will fill the window with
retransmission. It even has caused OOM crash when thousands of clients
were interrupted at once on a local network benchmark.
Fix this by first checking window level prior to every incrementation
via a new helper function quic_cwnd_may_increase(). It was arbitrarily
decided that the window must be at least 50% full when the ACK is
handled prior to increment it. This value is a good compromise to keep
window in check while still allowing fast increment when needed.
Note that this patch only concerns cubic and newreno algorithm. BBR has
already its notion of application limited which ensures the window is
only incremented when necessary.
This should be backported up to 2.6.
Rename one of the congestion algorithms pacing callback from pacing_rate
to pacing_inter. This better reflects that this function returns a delay
(in nanoseconds) which should be applied between each packet emission to
fill the congestion window with a perfectly smoothed emission.
This should be backported up to 3.1.
This is definitively a bug to call quic_tx_packet_refdec() to decrement the reference
counter of a TX packet calling quic_tx_packet_refdec(), and possibly to release its
memory when it is negative or null.
This counter is incremented when a TX frm is attached to it with some allocated memory
and when the packet is inserted into a data structure, if needed (list or tree).
Should be easily backported as far as 2.6 to ease any further backport around
this code part.
Reset ->prev and ->next fields of a coalesced TX packet to ensure it cannot access
several times its neighbours after it is supposed to be detached from them calling
quic_tx_packet_dgram_detach().
There are two cases where a packet can be coalesced to another previous built one:
this is when it is built into the same datagrame without GSO (and flagged flag with
QUIC_FL_TX_PACKET_COALESCED) or when sent from the same sendto() syscall with GOS
(not flagged with QUIC_FL_TX_PACKET_COALESCED).
This fix may be in relation with GH #2839.
Must be backported as far as 2.6.
version.c tries to centralize all variables conveying version information,
but there's still an issue with the BUILD_* variables which are only
passed to haproxy.o and are only updated when that one is rebuilt. This
is not very logical given that we can end up with values there which
contradict info from version.c.
Better move all of these to version.c which is systematically rebuilt.
Most of these variables only end up as string concatenation at the
moment. Some of them are even duplicated. In version.c we now have one
variable (or constant) for each of them and haproxy.c references them
in messages. This is much more logical and easier to maintain in a
consistent state.
The patch looks a bit large but it really only moves the ifdefed string
assignment from one file to another, placing them into variables.
Traces can be activated on startup either via -dt command line argument
or via the traces configuration section. This can caused confusion as it
may not be clear as trace source can be completed or overriden by one or
the other.
Fix the precedence to give the priority to the command line argument.
Now, each trace source configured via -dt is first resetted to a default
state before applying new settings. Then, it is impossible to change a
trace source via the configuration file if it was already targetted via
-dt argument.
At many places we'd like to be able to simply construct a path from a
format string and check if that path corresponds to an existing file,
directory etc. Here we add 3 functions, a generic one to test that a
path corresponds to a given file mode (e.g. S_IFDIR, S_IFREG etc), and
two other ones specifically checking for a file or a dir for easier
use.
Some commands such as $(cmd_CC) etc already handle the quiet vs verbose
mode in the makefile, but sometimes we may want to pass other info. The
new "qinfo" macro can be called with a 9-char string argument (spaces
included) as a prefix for some commands, to emit that string when in
quiet mode. The caller must fill the spaces needed for alignment. E.g:
$(call quinfo, CC )$(CC) ...
A recent fix was introduced to ensure that a streamdesc instance won't
be attached to an already completed QCS which is eligible to purging.
This was performed by skipping application protocol decoding if a QCS is
in such a state. Here is the patch responsible for this change.
caf60ac696
BUG/MEDIUM: mux-quic: do not attach on already closed stream
However, this is too restrictive, in particular for unidirection stream
where no streamdesc is never attached. To fix this behavior, first
qcs_attach_sc() API has been modified. Instead of returning a streamdesc
instance, it returns either 0 on success or a negative error code.
There should be no functional changes with this patch. It is only to be
able to extend qcs_attach_sc() with the possibility of skipping
streamdesc instantiation while still keeping a success return value.
This should be backported wherever the above patch has been merged. For
the record, it was scheduled for immediate backport on 3.1, plus merging
on older releases up to 2.8 after a period of observation.
As can be seen here, the build fails on m68k since commit 665dde648
("MINOR: debug: use LIM2A to show limits") in 3.1:
https://github.com/haproxy/haproxy/actions/runs/12440234399/job/34735360177
The reason is the comparison between a ulong limit and RLIM_INFINITY.
Indeed, on m68k, rlim_t is an unsigned long long. Let's just change
the function's input type to take an rlim_t instead. This also allows
to get rid of the casts in the call place.
This can be backported to 3.1 though it's not important given the low
prevalence of this platform for such use cases.
n 1.5-dev8, 13 years ago, support for setting pipe size was added by
commit bd9a0a778 ("OPTIM/MINOR: make it possible to change pipe size
(tune.pipesize)"). For compatibility purposes, it was defining
F_SETPIPE_SZ in compat.h if it was not set. It apparently always had
F_SETPIPE_SZ defined before being included.
Now in 3.2-dev1, commit fbc534a6f ("REORG: startup: move nofile limit
checks in limits.c") reordered a few includes and ended up with
mworker-prog.c including compat.h before fcntl.h, causing a redefinition
error on certain libcs:
CC src/mworker-prog.o
In file included from /usr/include/bits/fcntl.h:61:0,
from /usr/include/fcntl.h:35,
from include/haproxy/limits.h:11,
from include/haproxy/mworker.h:18,
from src/mworker-prog.c:27:
/usr/include/bits/fcntl-linux.h:203:0: warning: "F_SETPIPE_SZ" redefined [enabled by default]
In file included from include/haproxy/api-t.h:35:0,
from include/haproxy/api.h:33,
from src/mworker-prog.c:23:
include/haproxy/compat.h:161:0: note: this is the location of the previous definition
Let's simply include fcntl.h in compat.h before the macro is redefined.
There's normally no need to backport this, though it's harmless to do
it if needed.
There is a small race condition, where a server would check if there is
something left in the proxy queue, and adding something to the proxy
queue. If the server checks just before the stream is added to the queue,
and it no longer has any stream to deal with, then nothing will take
care of the stream, that may stay in the queue forever.
This was worked around with commit 5541d4995d, by checking for that exact
condition after adding the stream to the queue, and trying again to get
a server assigned if it is detected.
That fix lead to multiple infinite loops, that got fixed, but it is not
unlikely that it could happen again. So let's fix the initial problem
differently : a single server may mark itself as ready, and it removes
itself once used. The principle is that when we discover that the just
queued stream is alone with no active request anywhere ot dequeue it,
instead of rebalancing it, it will be assigned to that current "ready"
server that is available to handle it. The extra cost of the atomic ops
is negligible since the situation is super rare.
Make process_srv_queue() return the number of streams unqueued, as
pendconn_grab_from_px() did, as that number is used by
srv_update_status() to generate logs.
This should be backported up to 2.6 with
111ea83ed4
Add 2 counters in the SSL stats module for OCSP stapling.
- ssl_ocsp_staple is the number of OCSP response successfully stapled
with the handshake
- ssl_failed_ocsp_stapled is the number of OCSP response that we
couldn't staple, it could be because of an error or because the
response is expired.
These counters are incremented in the OCSP stapling callback, so if no
OCSP was configured they won't never increase. Also they are only
working in frontends.
This was discussed in github issue #2822.
In order to add stats from other files, the ssl_stats_module need to be
visible from other files.
This moves the ssl_counters definition in ssl_sock-t.h and removes the
static of ssl_stats_module.
Allow to build correctly without OCSP. It could be disabled easily with
OpenSSL build with OPENSSL_NO_OCSP. Or even with
DEFINE="-DOPENSSL_NO_OCSP" on haproxy make line.
When the ocsp response auto update process fails during insertion or
while validating the received ocsp response, we call
ssl_sock_update_ocsp_response or ssl_ocsp_check_response respectively
and both these functions take an 'err' parameter in which detailed error
messages can be written. Until now, those error messages were discarded
and the only information given to the user was a generic error
(ERR_CHECK or ERR_INSERT) which does not help much.
We now keep a pointer to the last error message in the certificate_ocsp
structure and dump its content in the update logs as well as in the
"show ssl ocsp-updates" cli command.
This issue was raised in GitHub #2817.
Define a set of functions to temporarily disable/reactivate tracing for
the current thread. This could be useful when wanting to quickly remove
tracing output for some code parts.
The API relies on a disable/resume set of functions, with a thread-local
counter. This counter is tested under __trace_enabled(). It is a
cumulative value so that the same count of resume must be issued after
several disable usage. There is also the possibility to force reset the
counter to 0 before restoring the old value.
This should be backported up to 3.1.
This commit is part of the current serie which aims to refactor and
improve overall performance of QUIC MUX I/O handler.
qcc_io_process() is responsible to perform some internal operations on
QUIC MUX after I/O completion. It is notably called on every qcc_io_cb()
tasklet handler.
The most intensive work on it is the purging of QCS instances after
transfer completion. This was implemented by looping on QCC streams tree
and inspecting the state of every QCS. The purpose of this commit is to
optimize this processing.
A new purg_list QCC member is defined. It is responsible to list every
QCS instances whose transfer has been completed. It is thus safe to
reuse <el_send> QCS list attach point. Stream purging will thus only
loop on purg_list instead of every known QCS.
This should be backported up to 3.1.
This commit is part of the current serie which aims to refactor and
improve overall performance of QUIC MUX I/O handler.
Define a recv_list element into qcc structure. This is used to
registered every instance of qcs which are currently blocked on
demuxing, which happen on no more space in <rx.appbuf>.
The purpose of this patch is to reduce qcc_io_recv() CPU usage. Now,
only recv_list iteration is performed, instead of the previous looping
over every qcs instances. This is useful as qcc_io_recv() is called each
time qcc_io_cb() is scheduled, even if only sending condition was the
wakeup origin.
A qcs is not inserted into recv_list immediately after blocking on demux
full buffer. Instead, this is only done after unblocking via stream
rcv_buf callback, which ensure that new buffer space is available.
This should be backported up to 3.1.
This commit refactors wait-for-handshake support from QUIC MUX. The flag
logic QC_CF_WAIT_HS is inverted : it is now positionned only if MUX is
instantiated before handshake completion. When the handshake is
completed, the flag is removed.
The flag is now set directly on initialization via qmux_init(). Removal
via qcc_wait_for_hs() is moved from qcc_io_process() to qcc_io_recv().
This is deemed more logical as QUIC MUX is scheduled on RECV to be
notify by the transport layer about handshake termination. Moreover,
qcc_wait_for_hs() is now called if recv subscription is still active.
This commit is the first of a serie which aims to refactor QUIC MUX I/O
handler and improves its overall performance. The ultimate objective is
to be able to stream qcc_io_cb() by removing pacing specific code path
via qcc_purge_sending().
This should be backported up to 3.1.
When the strict level is zero and BUG_ON() is not implemented, some
possible null-deref warnings are emitted again because some were
covering for these cases. Let's make it fall back to ASSUME() so that
the compiler continues to know that the tested expression never happens.
It also allows to further optimize certain functions by helping the
compiler eliminate certain tests for impossible values. However it
requires that the expression is really evaluated before passing the
result through ASSUME() otherwise it was shown that gcc-11 and above
will fail to evaluate its implications and will continue to emit the
null-deref warnings in case the expression is non-trivial (e.g. it
has multiple terms).
We don't do it for BUG_ON_HOT() however because the extra cost of
evaluating the condition is generally not welcome in fast paths,
particularly when that BUG_ON_HOT() was kept disabled for
performance reasons.
At plenty of places we have ALREADY_CHECKED() or DISGUISE() on a pointer
just to avoid "possibly null-deref" warnings. These ones have the side
effect of weakening optimizations by passing through an assembly step.
Using ASSUME_NONNULL() we can avoid that extra step. And when the
__builtin_unreachable() builtin is not present, we fall back to the old
method using assembly. The macro returns the input value so that it may
be used both as a declarative way to claim non-nullity or directly inside
an expression like DISGUISE().
Clang apparently has __builtin_assume() which does exactly the same
as our macro, since at least v3.8. Let's enable it, in case it may
even better detect assumptions vs unreachable code.
This macro takes an expression, tests it and calls an unreachable
statement if false. This allows the compiler to know that such a
combination does not happen, and totally eliminate tests that would
be related to this condition. When the statement is not available
in the compiler, we just perform a break from a do {} while loop
so that the expression remains evaluated if needed (e.g. function
call).
Due to __builtin_unreachable() only being associated to gcc 4.5 and
above, it turns out it was not enabled for clang. It's not used *that*
much but still a little bit, so let's enable it now. This reduces the
code size by 0.2% and makes it a bit more efficient.
We already have a __has_attribute() macro to detect when the compiler
supports a specific attribute, but we didn't have the equivalent for
builtins. clang-3 and gcc-10 have __has_builtin() for this. Let's just
bring it using the same mechanism as __has_attribute(), which will allow
us to simply define the macro's value for older compilers. It will save
us from keeping that many compiler-specific tests that are incomplete
(e.g. the __builtin_unreachable() test currently doesn't cover clang).
Let's encapsulate the code, which checks the applied nofile limit into
a separate helper check_nofile_lim_and_prealloc_fd(). Let's keep in this new
function scope the block, which tries to create a copy of FD with the highest
number, if prealloc-fd is set in the configuration.
In step_init_3() we try to apply provided or calculated earlier haproxy
maxsock and memmax limits.
Let's encapsulate these code blocks in dedicated functions:
apply_nofile_limit() and apply_memory_limit() and let's move them into
limits.c. Limits.c gathers now all the logic for calculating and setting
system limits in dependency of the provided configuration.
Let's encapsulate the code, which calculates global.maxconn and
global.maxsslconn into a dedicated function set_global_maxconn() and let's
move this function in limits.c. In limits.c we keep helpers to calculate and
check haproxy internal limits, based on the system nofile and memory limits.
->app_limited quic_drs struct member is not a boolean. This is
the index of the last transmitted packet marked as application-limited, or 0 if
the connection is not currently application-limited (see C.app_limited
definition in BBR v3 draft).
After these commits:
BUG/MINOR: quic: remove max_bw filter from delivery rate sampling
BUG/MINOR: quic: fix BBB max bandwidth oscillation issue
where some members were removed from bbr struct, the private data
size of QUIC cc algorithms may be reduced from 160 to 144 uint32_t.
Should be easily backported to 3.1 alonside the commits mentioned above.
This filter is no more needed after this commit:
BUG/MINOR: quic: fix BBB max bandwidth oscillation issue.
Indeed, one added this filter at delivery rate sampling level to filter
the BBR max bandwidth estimations and was inspired from ngtcp2 code source when
trying to fix the oscillation issue. But this BBR max bandwidth oscillation issue
was fixed by the aforementioned commit.
Furthermore this code tends to always increment the BBR max bandwidth. From my point
of view, this is not a good idea at all.
Must be backported to 3.1.
The windowed filters are used only the BBR implementation for QUIC to filter
the maximum bandwidth samples for its estimation over a virtual time interval
tracked by counting the cyclical progression through ProbeBW cycles. ngtcp2
and quiche use such windowed filters in their BBR implementation. But in a
slightly different way. When updating the 2nd or 3rd filter samples, this
is done based on their values in place of the time they have been sampled.
It seems more logical to rely on the sample timestamps even if this has no
implication because when a sample is updated using another sample because it
has the same value, they have both the same timestamps!
This patch modifies two statements which compare two consecutive filter samples
based on their values (smp[]->v) by statements which compare them based on the
virtual time they have been sampled (smp[]->t). This fully complies which the
code used by the Linux kernel in lib/win_minmax.c.
Alo take the opportunity of this patch to shorten some statements using <smp>
local variable value to update smp[2] sample in place of initializing its two
members with the <smp> member values.
This patch SHOULD be easily backported to 3.1 where BBR was first implemented.
Previous patch introduced stress mode to be able to easily test
alternative code paths.
The first point would be to force interruption of stats dump on every
line and check reentrant patchs, in particular while adding and removing
servers instances.
The purpose of this patch is to be able to use applet_putchk_stress()
during stats dump while not impacting other applets. To support this,
extract applet_putchk() into an internal _applet_putchk() which have a
new argument stress. Define two helpers applet_putchk() and
applet_putchk_stress(), the latter to set the stress argument to true.
For the moment, applet_putchk_stress() is not used. This will be the
subject of the next patch.
Define a new build mode DEBUG_STRESS. This will be used to stress some
code parts which cannot be reproduce easily with an alternative
suboptimal code.
First, a global <mode_stress> is set either to 1 or 0 depending on
DEBUG_STRESS compilation. A new global keyword "stress-level" is also
defined. It allows to specify a level from 0 to 9, to increase the
stress incurred on the code.
Helper macro STRESS_RUN* are defined for each stress level. This allows
to easily specify an instruction in default execution and a stress
counterpart if running on the corresponding stress level.
Since 9c91b30 ("MINOR: server: remove prev_deleted server list"), hlua
server pair iterator may use and return invalid (stale) server pointer
if multiple servers were deleted between two iterations.
Indeed, the server refcount mechanism (using srv_take()) is no longer
sufficient as the prev_deleted mitigation was removed.
To ensure server pointer consistency between two yields, the new watcher
mechanism must be used (as it already the case for stats dumping).
Thus in this patch we slightly change the server iteration logic:
hlua_server_list_iterator_context struct now stores the next valid server
pointer, and a watcher is added to ensure this pointer is never stale.
Then in hlua_listable_servers_pairs_iterator(), this next pointer is used
to create the Lua server object, and the next valid pointer is obtained by
leveraging watcher_next().
No backport needed unless 9c91b30 ("MINOR: server: remove prev_deleted
server list") is. Please note that dynamic servers were not supported in
Lua prior to 2.8, so it doesn't make sense to backport this patch further
than 2.8.
This patch is a direct follow-up to the previous one. Thanks to watcher
type, it is not safe to assume that servers manipulated via stats dump
were not targetted by a "delete server" CLI command. As such,
prev_deleted list server member is now unneeded. This patch thus removes
any reference to it.
If a server A is deleted while a stats dump is currently on it, deletion
is delayed thanks to reference counting. Server A is nonetheless removed
from the proxy list. However, this list is a single linked list. If the
next server B is deleted and freed immediately, server A would still
point to it. This problem has been solved by the prev_deleted list in
servers.
This model seems correct, but it is difficult to ensure completely its
validity. In particular, it implies when stats dump is resumed, server A
elements will be accessed despite the server being in a half-deleted
state.
Thus, it has been decided to completely ditch the refcount mechanism for
stats dump. Instead, use the watcher element to register every stats
dump currently tracking a server instance. Each time a server is deleted
on the CLI, each stats dump element which may points to it are updated
to access the next server instance, or NULL if this is the last server.
This ensures that a server which was deleted via CLI but not completely
freed is never accessed on stats dump resumption.
Currently, no race condition related to dynamic servers and stats dump
is known. However, as described above, the previous model is deemed too
fragile, as such this patch is labelled as bug-fix. It should be
backported up to 2.6, after a reasonable period of observation. It
relies on the following patch :
MINOR: list: define a watcher type
Define a new watcher type into list module. This type is similar to bref
and can be used to register an element which is currently tracking a
dynamic target. Contrary to bref, if the target is freed, every watcher
element are updated to point to a next valid entry or NULL.
This type will simplify handling of dynamic servers deletion, in
particular while stats dump are performed.
This patch is not a bug-fix. However, it is mandatory to fix a race
condition in dynamic servers. Thus, it should be backported along the
next commit up to 2.6.
Some master process' initialization steps are conditioned by receiving the
READY message from worker (pidfile creation, forwarding READY message to the
launching parent). So, master process can not do these initialization routines
before.
If the master process fails, while creating pid or forwarding the READY to the
parent in daemon mode, he exits with a proper alert message. In daemon mode we
no longer see such message, as process is already detached from the tty.
To fix this, as these alerts could be very useful, let's detach the master
process from the tty after his last initialization steps in _send_status.
Due to master-worker rework, daemonization fork happens now before parsing
and applying the configuration. This makes impossible to report correctly all
warnings and alerts to shell's stdout. Daemonzied process fails, while being
already in background, exit code reported by shell via '$?' equals to 0, as
it's the exit code of his parent.
To fix this, let's create a pipe between parent and daemonized child. The
child will send into this pipe a "READY" message, when it finishes his
initialization. The parent will wait on the "read" end of the pipe until
receiving something. If read() fails, parent obtains the status of the
exited child with waitpid(). So, the parent can correctly report the error to
the stdout and he can exit with child's exitcode.
This fix should be backported only in 3.1.
fddebug() is sometimes quite helpful, but annoying to use when following
a call path because it's a pain to always repeat the function name and
call place. Let's have it automatically prepend the function name, the
file name and the line number, and make its arguments optional, replacing
them by a simple LF when all absent. This way, simply placing:
fddebug();
is sufficient to emit a location follocing "[%s@%s:%d]\n". This function
must not be used in production (and even call places with it shouldn't be
committed) and it should only be used by developers, so the simplest the
better.
Commit 7f64bb79fd ("BUG/MINOR: debug: COUNT_IF() should return true/false")
allowed the COUNT_IF() macro to return the evaluated value. This is handy
to place it in "if ()" conditions and count them at the same time. When
glitches are disabled, the condition is just returned as-is, but most call
places do not use the result, making some compilers complain. In addition,
while reviewing this, it was noticed that when DEBUG_STRICT=0, the macro
would still be replaced by a "do { } while (0)" statement, which not only
does not evaluate the expression, but also cannot return anything. Ditto
for COUNT_IF_HOT().
Let's make sure both are always properly evaluated now.
The COUNT_IF() macro was initially meant to return true/false to be used
in if() conditions but had an extra do { } while(0) that prevents it from
doing so. Let's get rid of the do { } while(0) before the code generalizes
to too many places. There's no impact on existing code, but may have to be
backported if future fixes rely on it.
If master process can't open a pidfile, there is no sense to send SIGTTIN to
oldpids, as it will exit. So, old workers will terminate as well. It's better
to send the last alert to the log about unrecoverable error, because master is
already in its polling loop.
For the standalone mode we should keep the previous logic in this case: send
SIGTTIN to old process and unbind listeners for the new one. So, it's better
to put this error path in main(), as it's done when other configuration settings
can't be applied.
This patch should be backported only in 3.1.
Better late than never, commit 1f73d35 ("MINOR: stktable: implement
"recv-only" table option") implemented stktable flags and initial
definitions, but it lacks some comments plus the flag is stored as
16bits but the SKT_FL_ definition width allows for only 8bits so
it is a bit confusing, let's fix that
Thanks to previous commit stktable struct now have a "flags" struct member
Let's take this opportunity to remove the isolated "nopurge" attribute in
stktable struct and rely on a flag named STK_FL_NOPURGE instead.
This helps to better organize stktable struct members.
When "recv-only" keyword is added on a stick table declaration (in peers
or proxy section), haproxy considers that the table is only used for
data retrieval from a remote location and not used to perform local
updates. As such, it enables the retrieval of local-only values such
as conn_cur that are ignored by default. This can be useful in some
contexts where we want to know about local-values such are conn_cur
from a remote peer.
To do this, add stktable struct flags which default to NONE and enable
the RECV_ONLY flag on the table then "recv-only" keyword is found in the
table declaration. Then, when in peer_treat_updatemsg(), when handling
table updates, don't ignore data updates for local-only values if the flag
is set.
Now when tasklets are woken up via tasklet_wakeup(), tasklet_wakeup_on()
or tasklet_wakeup_after(), either the optional wakeup flags will be used,
or TASK_WOKEN_OTHER will be used.
This allows tasklet handlers waking up for any given cause to notice
whether or not they were also woken for another reason. For example, a
mux handler could skip heavy parts when seeing that TASK_WOKEN_OTHER is
absent, proving that no standard tasklet_wakeup() was done, for example
in response to a subscribe().
The benefit of the TASK_WOKEN_* flags is that they're purged during the
wakeup, and that they're easy to check for using TASK_WOKEN_ANY.
TASK_F_UEVT1 and TASK_F_UEVT2 are also usable for private use (e.g. wakeup
from a stream to a connection inside a mux).
Probably that in the future, code dealing with subscribe events should
start to place TASK_WOKEN_IO like is done for upper layers.
After master-worker mode refactoring, global.pidfile is only used in
handle_pidfile(), which opens the provided file and writes the PID into it. So,
it's more appropriate to perform the close(pidfd) and ha_free(&global.pidfile)
also in this function.
This commit prepares the fix of the pidfile creation, as it's created now very
early, when we are not sure, that process has successfully started. In
master-worker mode handle_pidfile() can be called in the master process context.
So, let's make it accessible from other compilation units via global.h.
This should be backported only in 3.1.
commit() method may be used to commit pending updates on the local patref
object:
hlua_patref flags were added:
HLUA_PATREF_FL_GEN means the patref object has been updated
and it is associated to a new revision (curr_gen) in order to prepare
and commit the pending updates.
upon commit, the pattern API is leveraged with curr_gen as revision to
commit new object items. Once commit is performed, previous (pending)
revisions that are older than the committed one are cleaned up (similar
to what's done with commit on the cli). Also, Patref function APIs now
take into account curr_gen to perform lookups.
pat_ref_may_commit() may be used to know if a given generation ID id still
valid, which means it may still be committed at some point. Else it means
that another pending generation ID older than the tested one was already
committed and thus other generations ID below this one are stale and must
be regenerated.
In order to extend the patref class features, let's wrap the pat_ref struct
into hlua_patref struct. This way we may add additional data alongside the
pat_ref pointer to store additional context required for pat_ref data
manipulation from lua.
Since the wrapper (hlua_patref) is an allocated object, we declare the _gc
metamethod for patref class in order to properly cleanup resources when
they are out of scope.
patref object may now leverage index and pair methamethods to list and
access patref elements at a specific index (=key)
Also, patref:is_map() method may be used to know if the patref stores acl
(key only) or map-style (key:value) patterns.
Implement patref class to expose pat_ref struct internal pattern struct
in lua. This is some prerequisite work needed to be able to manipulate
exisiting generic pattern object lists (acl/map) from Lua, because the Map
class can only be used to perform matching ops on Map files.
Now that PAT_REF events were defined in previous commit, let's actually
publish them from pattern API where relevant. Unlike server events,
pattern reference events are only published in the pat_ref subscriber's
list on purpose, because in some setups patref updates (updates performed
on a map for instance from action or cli) are very frequent, and we don't
want to impact pattern API performance just for that.
Moreover, as the main use case is to be able to subscribe to maps updates
from Lua, allowing a per-pattern reference registration is already enough.
No additional data is provided for such events (also for performance reason)
Care was taken not to publish events when the update doesn't affect the
live subset (the one targeted by curr_gen).
This is some prerequisite work for implementing PAT_REF events.
In this commit we define the PAT_REF event_hdl family (which gets family
slot id #2), with the following supported events:
- EVENT_HDL_SUB_PAT_REF_ADD: element was added to the current version of
the pattern ref
- EVENT_HDL_SUB_PAT_REF_DEL: element was deleted from the current
version of the pattern ref
- EVENT_HDL_SUB_PAT_REF_SET: element was modified in the current version
of the pattern ref
- EVENT_HDL_SUB_PAT_REF_COMMIT: pending element(s) was/were commited in
the current version of the pattern ref
- EVENT_HDL_SUB_PAT_REF_CLEAR: all elements were cleared from the
current version of the pattern ref
The goal is to be able to track a pat_ref struct in order to be notified
when it is updated. For performance reasons, events from this family won't
provide any additional info, and will only be published in the pat_ref
subscription list. Indeed, pat_ref may be updated at a relatively high
frequency (or worse, batch work), so we cannot afford doing expensive
treatment for each update.
This patch fixes the loss of information when computing the delivery rate
(quic_cc_drs.c) on links with very low latency due to usage of 32bits
variables with the millisecond as precision.
Initialize the quic_conn task with TASK_F_WANTS_TIME flag ask it to ask
the scheduler to update the call date of this task. This allows this task to get
a nanosecond resolution on the call date calling task_mono_time(). This is enabled
only for congestion control algorithms with delivery rate estimation support
(BBR only at this time).
Store the send date with nanosecond precision of each TX packet into
->time_sent_ns new quic_tx_packet struct member to store the date a packet was
sent in nanoseconds thanks to task_mono_time().
Make use of this new timestamp by the delivery rate estimation algorithm (quic_cc_drs.c).
Rename current ->time_sent member from quic_tx_packet struct to ->time_sent_ms to
distinguish the unit used by this variable (millisecond) and update the code which
uses this variable. The logic found in quic_loss.c is not modified at all.
Must be backported to 3.1.
The "421" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.
This patch should fix the issue #2794. It is quite simple so it may safely
be backported to 3.1 if necessary.
pat_ref_gen_delete(ref, gen_id, key) tries to delete all samples belonging
to <gen_id> and matching <key> under <ref>
The goal is to be able to target a single subset from <ref>
pat_ref_gen_find_elt(ref, gen_id, key) tries to find <elt> element
belonging to <gen_id> and matching <key> in <ref> reference.
The goal is to be able to target a single subset from <ref>
pat_ref_gen_set(ref, gen_id, value, err) modifies to <value> the sample
of all patterns matching <key> and belonging to <gen_id> (generation id)
under <ref>
The goal is to be able to target a single subset from <ref>
split pat_ref_set() function in 2 distinct functions. Indeed, since
0844bed7d3 ("MEDIUM: map/acl: Improve pat_ref_set() efficiency (for
"set-map", "add-acl" action perfs)"), pat_ref_set() prototype was updated
to include an extra <elt> argument. But the logic behind is not explicit
because the function will not only try to set <elt>, but also its
duplicate (unlike pat_ref_set_elt() which only tries to update <elt>).
Thus, to make it clearer and better distinguish between the key-based
lookup version and the elt-based one, restotre pat_ref_set() previous
prototype and add a dedicated pat_ref_set_elt_duplicate() that takes
<elt> as argument and tries to update <elt> and all duplicates.
A UDP datagram cannot be greater than 65535 bytes, as UDP length header
field is encoded on 2 bytes. As such, sendmsg() will reject a bigger
input with error EMSGSIZE. By default, this does not cause any issue as
QUIC datagrams are limited to 1.252 bytes and sent individually.
However, with GSO support, value bigger than 1.252 bytes are specified
on sendmsg(). If using a bufsize equal to or greater than 65535, syscall
could reject the input buffer with EMSGSIZE. As this value is not
expected, the connection is immediately closed by haproxy and the
transfer is interrupted.
This bug can easily reproduced by requesting a large object on loopback
interface and using a bufsize of 65535 bytes. In fact, the limit is
slightly less than 65535, as extra room is also needed for IP + UDP
headers.
Fix this by reducing the count of datagrams encoded in a single GSO
invokation via qc_prep_pkts(). Previously, it was set to 64 as specified
by man 7 udp. However, with 1252 datagrams, this is still too many.
Reduce it to a value of 52. Input to sendmsg will thus be restricted to
at most 65.104 bytes if last datagram is full.
If there is still data available for encoding in qc_prep_pkts(), they
will be written in a separate batch of datagrams. qc_send_ppkts() will
then loop over the whole QUIC Tx buffer and call sendmsg() for each
series of at most 52 datagrams.
This does not need to be backported.
Let's move mworker_reexec() and mworker_reload() in mworker.c. mworker_reload()
is called only within the functions, which are already in mworker.c. So, this
reorganization allows to declare mworker_reload() as a static.
mworker_run_master() is called only in master mode. mworker_loop() is static
and called only in mworker_run_master(). So let's move these both functions in
mworker.c.
We also need here to make run_thread_poll_loop() accessible from other units,
as it's used in mworker_loop().
mworker_prepare_master() performs some preparation routines for the new worker
process, which will be forked during the startup. It's called only in
master-worker mode, so let's move it in mworker.c.
mworker_on_new_child_failure() performs some routines for the worker process,
if it has failed the reload. As it's called only in mworker_catch_sigchld()
from mworker.c, let's move mworker_on_new_child_failure() in mworker.c as well.
Like this it could also be declared as a static.
This patch prepares the moving of on_new_child_failure definition into
mworker.c. So, let's rename it accordingly and let's also update its
description.
QUIC pacing was recently implemented to limit burst and improve overall
bandwidth. This is used only for MUX STREAM emission. Pacing requires
nanosecond resolution. As such, it used now_cpu_time() which relies on
clock_gettime() syscall.
The usage of clock_gettime() has several drawbacks :
* it is a syscall and thus requires a context-switch which may hurt
performance
* it is not be available on all systems
* timestamp is retrieved multiple times during a single task execution,
thus yielding different values which may tamper pacing calculation
Improve this by using task_mono_time() instead. This returns task call
time from the scheduler thread context. It requires the flag
TASK_F_WANTS_TIME on QUIC MUX tasklet to force the scheduler to update
call time with now_mono_time(). This solves every limitations listed
above :
* syscall invokation is only performed once before tasklet execution,
thus reducing context-switch impact
* on non compatible system, a millisecond timer is used as a fallback
which should ensure that pacing works decently for them
* timer value is now guaranteed to be fixed duing task execution
In the memprofile summary per DSO, we currently have to pay a high price
by calling dladdr() on each symbol when doing the summary per DSO at the
end, while we're not interested in these details, we just want the DSO
name which can be made cheaper to obtain, and easier to manipulate. So
let's create resolve_dso_name() to only extract minimal information from
an address. At the moment it still uses dladdr() though it avoids all the
extra expensive work, and will further be able to leverage the same
mechanism as "show libs" to instantly spot DSO from address ranges.
Some dependencies might very well rely on posix_memalign(), strndup()
or other less portable callsn making us miss them when chasing memory
leaks, resulting in negative global allocation counters. Let's provide
the handlers for the following functions:
strndup() // _POSIX_C_SOURCE >= 200809L || glibc >= 2.10
valloc() // _BSD_SOURCE || _XOPEN_SOURCE>=500 || glibc >= 2.12
aligned_alloc() // _ISOC11_SOURCE
posix_memalign() // _POSIX_C_SOURCE >= 200112L
memalign() // obsolete
pvalloc() // obsolete
This time we don't fail if they're not found, we just silently forward
the calls.
Some memory profiling outputs have showed negative counters, very likely
due to some libs calling strdup(). Let's add it to the list of monitored
activities.
Actually even haproxy itself uses some. Having "profiling.memory on" in
the config reveals 35 call places.
In "show profiling memory", we need to distinguish methods which really
free memory from those which do not so that we don't account for the
free value twice. However for now it's done using multiple tests, which
are going to complicate the addition of new methods. Let's switch to a
bit field defined as a mask in a single place instead, as we don't
intend to use more than 32/64 methods!
There's actually a problem with memprofiles: the pool pointer is stored
in ->info but some pools are replaced during startup, such as the trash
pool, leaving a dangling pointer there.
Let's complete the API with a new function memprof_remove_stale_info()
that will remove all stale references to this info pointer. It's also
present when USE_MEMORY_PROFILING is not set so as to ease the job on
callers.
Let's store progname in the global variable, as it is handy to use it in
different parts of code to format messages sent to stdout.
This reduces the number of arguments, which we should pass to some functions.
This commit prepares the usage of the global progname variable.
prepare_caps_from_permitted_set() use progname value in warning messages. So,
let's rename program_name argument to progname.
The ->app_limited member of the delivery rate struct (quic_cc_drs) aim is to
store the index of the last transmitted byte marked as application-limited
so that to track the application-limited phases. During these phases,
BBR must ignore delivery rate samples to properly estimate the delivery rate.
Without such a patch, the Startup phase could be exited very quickly with
a very low estimated bottleneck bandwidth. This had a very bad impact
on little objects with download times smaller than the expected Startup phase
duration. For such objects, with enough bandwith, BBR should stay in the Startup
state.
No need to be backported, as BBR is implemented in the current developement version.
Extend extra pacing support for newreno and nocc congestion algorithms,
as with cubic.
For better extensibility of cc algo definition, define a new flags field
in quic_cc_algo structure. For now, the only value is
QUIC_CC_ALGO_FL_OPT_PACING which is set if pacing support can be
optionally activated. Both cubic, newreno and nocc now supports this.
This new flag is then reused by QUIC config parser. If set, extra
quic-cc-algo burst parameter is taken into account. If positive, this
will activate pacing support on top of the congestion algorithm. As with
cubic previously, pacing is only supported if running under experimental
mode.
Only BBR is not flagged with this new value as pacing is directly
builtin in the algorithm and cannot be turn off. Furthermore, BBR
calculates automatically its value for maximum burst. As such, any
quic-cc-algo burst argument used with BBR is still ignored with a
warning.
This only concerns functions emitting warnings about misplaced tcp-request
rules. The direction is now specified in the functions name. For instance
"warnif_misplaced_tcp_conn" is replaced by "warnif_misplaced_tcp_req_conn".
In warnings about misplaced rules, only the first keyword is mentionned. It
works well for http-request or quic-initial rules for instance. But it is a
bit confusing for tcp-request rules, because the layer is missing (session
or content).
To make it a bit systematic (and genric), the second argument can now be
provided. It can be set to NULL if there is no layer or scope. But
otherwise, it may be specified and it will be reported in the warning.
So the following snippet:
tcp-request content reject if FALSE
tcp-request session reject if FALSE
tcp-request connection reject if FALSE
Will now emit the following warnings:
a 'tcp-request session' rule placed after a 'tcp-request content' rule will still be processed before.
a 'tcp-request connection' rule placed after a 'tcp-request session' rule will still be processed before.
This patch should fix the issue #2596.
qc_packet_loss_lookup() aim is to detect the packet losses. This is this function
which must called ->on_pkt_lost() BBR specific callback. It also set
<bytes_lost> passed parameter to the total number of bytes detected as lost upon
an ACK frame receipt for its caller.
Modify qc_release_lost_pkts() to call ->congestion_event() with the send time
from the newest packet detected as lost.
Modify qc_release_lost_pkts() to call ->slow_start() callback only if define
by the congestion control algorithm. This is not the case for BBR.
Add several callbacks to quic_cc_algo struct which are only called by BBR.
->get_drs() may be used to retrieve the delivery rate sampling information
from an congestion algorithm struct (quic_cc).
->on_transmit() must be called before sending any packet a QUIC sender.
->on_ack_rcvd() must be called after having received an ACK.
->on_pkt_lost() must be called after having detected a packet loss.
->congestion_event() must be called after any congestion event detection
Modify quic_cc.c to call ->event only if defined. This is not the case
for BBR.
Implement the version 3 of BBR for QUIC specified by the IETF in this draft:
https://datatracker.ietf.org/doc/draft-ietf-ccwg-bbr/
Here is an extract from the Abstract part to sum up the the capabilities of BBR:
BBR ("Bottleneck Bandwidth and Round-trip propagation time") uses recent
measurements of a transport connection's delivery rate, round-trip time, and
packet loss rate to build an explicit model of the network path. BBR then uses
this model to control both how fast it sends data and the maximum volume of data
it allows in flight in the network at any time. Relative to loss-based congestion
control algorithms such as Reno [RFC5681] or CUBIC [RFC9438], BBR offers
substantially higher throughput for bottlenecks with shallow buffers or random
losses, and substantially lower queueing delays for bottlenecks with deep buffers
(avoiding "bufferbloat"). BBR can be implemented in any transport protocol that
supports packet-delivery acknowledgment. Thus far, open source implementations
are available for TCP [RFC9293] and QUIC [RFC9000].
In haproxy, this implementation is considered as still experimental. It depends
on the newly implemented pacing feature.
BBR was asked in GH #2516 by @KazuyaKanemura, @osevan and @kennyZ96.
Implement the Kathleen Nichols' algorithm used by several congestion control
algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track
the maximum value of a data type during a fixe time interval.
In this implementation, counters which are periodically reset are used in place
of timestamps.
Only the max part has been implemented.
(see lib/minmax.c implemenation for Linux kernel).
Add ->initial_wnd new member to quic_cc_path struct to keep the initial value
of the congestion window. This member is initialized as soon as a QUIC connection
is allocated. This modification is required for BBR congestion control algorithm.
Once in a while we find some makefiles ignoring some outdated arguments
and just emit a warning. What's annoying is that if users (say, distro
packagers), have purposely added ERR=1 to their build scripts to make
sure to fail on any warning, these ones will be ignored and the build
can continue with invalid or missing options.
William rightfully suggested that ERR=1 should also catch make's warnings
so this patch implements this, by creating a new "complain" variable that
points either to "error" or "warning" depending on $(ERR), and that is
used to send the messages using $(call $(complain),...). This does the
job right at little effort (tested from GNU make 3.82 to 4.3).
Note that for this purpose the ERR declaration was upped in the makefile
so that it appears before the new errors.mk file is included.
tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and
tasklet_wakeup()) do not support passing a wakeup cause like
task_wakeup(). This is essentially due to an API limitation cause by
the fact that for a very long time the only reason for waking up was
to process pending I/O. But with the growing complexity of mux tasks,
it is becoming important to be able to skip certain heavy processing
when not strictly needed.
One possibility is to permit the caller of tasklet_wakeup() to pass
flags like task_wakeup(). Instead of going with a complex naming scheme,
let's simply make the flags optional and be zero when not specified. This
means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the
third one is the optional flags to be passed to the callee. Eligible flags
are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*)
which are cleared when the tasklet is executed. This way the handler
will find them in its <state> argument and will be able to distinguish
various causes for the call.
Everything in the tasklet layer supports flags, except that they are
just not implemented in the wakeup functions, while they are in the
task_wakeup functions. Initially it was not considered useful to pass
wakeup causes because these were essentially I/O, but with the growing
number of I/O handlers having to deal with various types of operations
(typically cheap I/O notifications on subscribe vs heavy parsing on
application-level wakeups), it would be nice to start to make this
distinction possible.
This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after()
to pass a set of flags that continues to be set as zero. For now this
changes nothing, but new functions will come.
Currently tasks being profiled have th_ctx->sched_call_date set to the
current nanosecond in monotonic time. But there's no other way to have
this, despite the scheduler being capable of it. Let's just declare a
new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time
just before calling the handler. This way, a task that needs nanosecond
resolution on the call date will be able to be called with an up-to-date
date without having to abuse now_mono_time() if not needed. In addition,
if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0),
the date is set to the most recently known now_ns, which is guaranteed
to be atomic and is only updated once per poll loop.
This date can be more conveniently retrieved using task_mono_time().
This can be useful, e.g. for pacing. The code was slightly adjusted so
as to merge the common parts between the profiling case and this one.
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
To improve debugging, extend "show quic" output to report if pacing is
activated on a connection. Two values will be displayed for pacing :
* a new counter paced_sent_ctr is defined in QCC structure. It will be
incremented each time an emission is interrupted due to pacing.
* pacing engine now saves the number of datagrams sent in the last paced
emission. This will be helpful to ensure burst parameter is valid.
Define a new QUIC congestion algorithm token 'cubic-pacing' for
quic-cc-algo bind keyword. This is identical to default cubic
implementation, except that pacing is used for STREAM frames emission.
This algorithm supports an extra argument to specify a burst size. This
is stored into a new bind_conf member named quic_pacing_burst which can
be reuse to initialize quic path.
Pacing support is still considered experimental. As such, 'cubic-pacing'
can only be used with expose-experimental-directives set.
Support pacing emission for STREAM frames at the QUIC MUX layer. This is
implemented by adding a quic_pacer engine into QCC structure.
The main changes have been written into qcc_io_send(). It now
differentiates cases when some frames have been rejected by transport
layer. This can occur as previously due to congestion or FD buffer full,
which requires subscribing on transport layer. The new case is when
emission has been interrupted due to pacing timing. In this case, QUIC
MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1.
On tasklet execution, if TASK_F_USR1 is set, all standard processing for
emission and reception is skipped. Instead, a new function
qcc_purge_sending() is called. Its purpose is to retry emission with the
saved STREAM frames list. Either all remaining frames can now be send,
subscribe is done on transport error or tasklet must be rescheduled for
pacing purging.
In the meantime, if tasklet is rescheduled due to other conditions,
TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM
frames. In this case, pacing expiration must be check before calling
qcc_send_frames() to ensure emission is now allowed.
QUIC MUX will be responsible to drive emission with pacing. This will be
implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To
prepare this, encapsulate each I/O tasklet wakeup into a new function
qcc_wakeup().
This commit is purely refactoring prior to pacing implementation into
QUIC MUX.
For STREAM emission, MUX QUIC previously used a local list defined under
qcc_io_send(). This was suitable as either all frames were sent, or
emission must be interrupted due to transport congestion or fatal error.
In the latter case, the list was emptied anyway and a new frame list was
built on future qcc_io_send() invokation.
For pacing, MUX QUIC may have to save the frame list if pacing should be
applied across emission. This is necessary to avoid to unnecessarily
rebuilt stream frame list between each paced emission. To support this,
STREAM list is now stored as a member of QCC structure.
Ensure frame list is always deleted, even on QCC release, using newly
defined utility function qcc_tx_frms_free().
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.
A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.
quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
Pacing will be implemented for STREAM frames emission. As such,
qc_send_mux() API has been extended to add an argument to a quic_pacer
engine.
If non NULL, engine will be used to pace emission. In short, no more
than one datagram will be emitted for each qc_send_mux() invokation.
Pacer is then notified about the emission and a timer for a future
emission is calculated. qc_send_mux() will return PACING error value, to
inform QUIC MUX layer that it will be responsible to retry emission
after some delay.
Extend quic_pacer engine to support pacing emission. Several functions
are defined.
* quic_pacing_sent_done() to notify engine about an emission of one or
several datagrams
* quic_pacing_expired() to check if emission should be delayed or can be
conducted immediately
This commit is part of a adjustment on QUIC transport send API to
support pacing. Here, qc_send_mux() return type has been changed to use
a new enum quic_tx_err.
This is useful to explain different failure causes of emission. For now,
only two values have been defined : NONE and FATAL. When pacing will be
implemented, a new value would be added to specify that emission was
interrupted on pacing. This won't be a fatal error as this allows to
retry emission but not immediately.
Extend QUIC transport emission function to support a maximum datagram
argument. The purpose is to ensure that qc_send() won't emit more than
the specified value, unless it is 0 which is considered as unlimited.
In qc_prep_pkts(), a counter of built datagram has been added to support
this. The packet building loop is interrupted if it reaches a specified
maximum value. Also, its return value has been changed to the number of
prepared datagrams. This is reused by qc_send() to interrupt its work if
a specified max datagram argument value is reached over one or several
iteration of prepared/sent datagrams.
This change is necessary to support pacing emission. Note that ideally,
the total length in bytes of emitted datagrams should be taken into
account instead of the raw number of datagrams. However, for a first
implementation, it was deemed easier to implement it with the latter.
Add QCC QC_CF_WAIT_FOR_HS and QCS QC_SF_TXBUB_OOB flags to their
respective show_flags to be able to decipher them via dev flags utility.
These values have been added in the current dev version, thus no need to
backport this patch.
When the request is too large to fit in a buffer a 414 or a 431 error
message is returned depending on the error state of the request parser. A
414 is returned if the URI is too long, otherwise a 431 is returned.
This patch should fix the issue #1309.
414-Uri-Too-Long and 431-Request-Header-Fields-Too-Large are now part of
supported status codes that can be define as error files. The hash table
defined in http_get_status_idx() was updated accordingly.
It is now possible to use a log-format string to define the "Set-Cookie"
header value of a response generated by a redirect rule. There is no special
check on the result format and it is not possible during the configuration
parsing. It is proably not a big deal because already existing "set-cookie"
and "clear-cookie" options don't perform any check.
Here is an example:
http-request redirect location https://someurl.com/ set-cookie haproxy="%[var(txn.var)]"
This patch should fix the issue #1784.
On prefix-based redirect, there is an option to drop the query-string of the
location. Here it is the opposite. an option is added to preserve the
query-string of the original URI for a localtion-based redirect.
By setting "keep-query" option, for a location-based redirect only, the
query-string of the original URI is appended to the location. If there is no
query-string, nothing is added (no empty '?'). If there is already a
non-empty query-string on the localtion, the original one is appended with
'&' separator.
This patch should fix issue #2728.
parse_size_err() currently is a function working only on an uint. It's
not convenient for certain elements such as rings on large machines.
This commit addresses this by having one function for uints and one
for ullong, and making parse_size_err() a macro that automatically
calls one or the other. It also has the benefit of automatically
supporting compatible types (long, size_t etc).
From time to time we face a configuration with very small timeouts which
look accidental because there could be expectations that they're expressed
in seconds and not milliseconds.
This commit adds a check for non-nul unitless values smaller than 100
and emits a warning suggesting to append an explicit unit if that was
the intent.
Only the common timeouts, the server check intervals and the resolvers
hold and timeout values were covered for now. All the code needs to be
manually reviewed to verify if it supports emitting warnings.
This may break some configs using "zero-warning", but greps in existing
configs indicate that these are extremely rare and solely intentionally
done during tests. At least even if a user leaves that after a test, it
will be more obvious when reading 10ms that something's probably not
correct.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "4k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, preventing from starting when set
e.g. to "64k". Let's make use of parse_size_err() on it so that units are
supported. This requires to turn it to uint as well, and to explicitly
limit its range to INT_MAX - 2*sizeof(void*), which was previously
partially handled as part of the sign check.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, and
since it's sometimes compared to an int, we limit its range to
0..INT_MAX.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
The qcc_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
36 such counters. The macro supports an optional description, though that
is not being used for now.
As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
proxy auth_uri struct was manually cleaned up during deinit, but the logic
behind was kind of akward because it was required to find out which ones
were shared or not. Instead, let's switch to a proper refcount mechanism
and free the auth_uri struct directly in proxy_free_common().
COUNT_GLITCH() will implement an unconditional counter on its declaration
line when DEBUG_GLITCHES is set, and do nothing otherwise. The output will
be reported as "GLT" and can be filtered as "glt" on the CLI. The purpose
is to help figure what's happening if some glitches counters start going
through the roof. The macro supports an optional string argument to
describe the cause of the glitch (e.g. "truncated header"), which is then
reported in the dump.
For now this is conditioned by DEBUG_GLITCHES but if it turns out to be
light enough, maybe we'll keep it enabled full time. In this case it
might have to be moved away from debug dev, or at least documented (or
done as debug counters maybe so that dev can remain undocumented and
updatable within a branch?).
In order to count new event types, we'll need to support empty conditions
so that we don't have to fake if (1) that would pollute the output. This
change checks if #cond is an empty string before concatenating it with
the optional var args, and avoids dumping the colon on the dump if the
whole description is empty.
After master-worker refactoring, master performs re-exec only once up to
receiving "reload" command or USR2 signal. There is no more the second
master's re-exec to free unused memory. Thus, there is no longer need to export
environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This
status can be simply saved in a global variable load_status.
Since 3.0, it is possible to assign a GUID to proxies, listeners and
servers. These objects are stored in a global tree guid_tree.
Proxies and listeners are static. However, servers may be added or
deleted at runtime, which imply that guid_tree must be protected. Fix
this by declaring a read-write lock to protect tree access.
For now, only guid_insert() and guid_remove() are protected using a
write lock. Outside of these, GUID tree is not accessed at runtime. If
server CLI commands are extended to support GUID as server identifier,
lookup operation should be extended with a read lock protection.
Note that during stat-file preloading, GUID tree is accessed for lookup.
However, as it is performed on startup which is single threaded, there
is no need for lock here. A BUG_ON() has been added to ensure this
precondition remains true.
This bug could caused a segfault when using dynamic servers with GUID.
However, it was never reproduced for now.
This must be backported up to 3.0. To avoid a conflict issue, the
previous cleanup patch can be merged before it.
event_hdl_sub_list_empty() may be used to know if the subscription list
passed as argument is empty or not (ie: if there currently are any
subcribers or not). It can be useful to know if the subscription is empty
is order to avoid unecessary preparation work and skip event publishing to
save CPU time if we already know that no one is interested in tracking the
changes for a given subscription list.
In order to help users detect when threads are behaving abnormally, let's
try to emit a warning when one is no longer making any progress. This will
allow to catch faulty situations more accurately, instead of occasionally
triggering just after the long task. It will also let users know that there
is something wrong with their configuration, and inspect the call trace to
figure whether they're using excessively long rules or Lua for example (the
usual warnings about lua-load vs lua-load-per-thread are still reported).
The warning will only be emitted for threads not yet marked as stuck so
as not to interfere with panic dumps and avoid sending a warning just
before a panic. A tainted flag is set when this happens however (0x2000).
There's currently no way to just emit a warning informing that a thread
is stuck without crashing. This is a problem because sometimes users
would benefit from this info to clean up their configuration (e.g. abuse
of map_regm, lua-load etc).
This commit adds a new function ha_stuck_warning() that will emit a
warning indicating that the designated thread has been stuck for XX
milliseconds, with a number of streams blocked, and will make that
thread dump its own state. The warning will then be sent to stderr,
along with some reminders about the impacts of such situations to
encourage users to fix their configuration.
In order not to disrupt operations, a local 4kB buffer is allocated
in the stack. This should be quite sufficient.
For now the function is not used.
The comment asks to update the "metrics_info" array, which does not
exist, instead it's called stat_cols_info[] and is in stats.c. Let's
mention all that to save time searching for the needed info.
While no version seems to have ever known that "metrics_info", it's not
needed to backport this as it's only a comment.
A ClientHello may be splitted accross several different CRYPTO frames,
then mixed in a single QUIC packet. This is used notably by clients such
as chrome to render the first Initial packet opaque to middleboxes.
Each packet frame is handled sequentially. Out-of-order CRYPTO frames
are buffered in a ncbuf, until gaps are filled and data is transferred
to the SSL stack. If CRYPTO frames are heavily splitted with small
fragments, buffering may fail as ncbuf does not support small gaps. This
causes the whole packet to be rejected and unacknowledged. It could be
solved if the client reemits its ClientHello after remixing its CRYPTO
frames.
This patch is written to improve CRYPTO frame parsing. Each CRYPTO
frames which cannot be buffered due to ncbuf limitation are now stored
in a temporary list. Packet parsing is completed until all frames have
been handled. If temporary list is not empty, reparsing is done on the
stored frames. With the newly buffered CRYPTO frames, ncbuf insert
operation may this time succeeds if the frame now covers a whole gap.
Reparsing will loop until either no progress can be made or it has been
done at least 3 times, to prevent CPU utilization.
This patch should fix github issue #2776.
This should be backported up to 2.6, after a period of observation. Note
that it relies on the following refactor patches :
MINOR: quic: extend return value of CRYPTO parsing
MINOR: quic: use dynamically allocated frame on parsing
MINOR: quic: simplify qc_parse_pkt_frms() return path
qc_handle_crypto_frm() is the function used to handled a newly received
CRYPTO frame. Change its API to use a newly dedicated return type. This
allows to report if the frame was properly handled, ignored if already
parsed previously or rejected after a fatal error.
This commit does not have any functional changes. However, it allows to
simplify qc_handle_crypto_frm() API by removing <fast_retrans> as output
parameter. Also, this patch will be necessary to support multiple
iteration of packet parsing for CRYPTO frames.
As reported by Pierre Maoui in GH #2477, it's not possible to render
control chars from variables or expressions verbatim in the payload part
of http-return statements. That's a problem because this part should not
require to be encoded at all (we could even imagine building favicons on
the fly for example).
In fact it is the LOG_OPT_HTTP option when passed as default options on
parse_logformat_string() which tells the log encoder that the payload
should be http-encoded using lf_chunk() instead of being printed using the
per-type encoder.
This option was set when parsing logformat expressions for lf-string
expression under http-return statements, as well as logformat expressions
for set-map action. While it is true that those actions may only be
used under http context, the LOG_OPT_HTTP logformat option is not relevant
there, because the payload is expected to be used without being encoded.
So let's simply get rid of this option when parsing logformat expressions
for set-map action key/value and lf-string from http-request return
action, and add a note next to LOG_OPT_HTTP option to indicate that it is
used to tell the log encoder that the payload should be HTTP-encoded.
Thanks to Pierre for having reported the issue and Willy for the
analysis and patch proposal.
These functions return a symbolic error code such as ECONNRESET to keep
logs compact while making them human-readable. It's a good alternative
to the numeric code in that it's more expressive, and a good one to the
full message since it's shorter and more precise (some codes even match
errno names).
The doc was updated so that the symbolic names appear in the table. It
could be useful to backport this feature to help with troubleshooting
some issues, though backporting the doc might possibly be more annoying
in case users have local patches already, so maybe the table update does
not need to be backported in this case.
While we get reports of connection setup errors in fc_err/bc_err, we
don't have the equivalent for the recv/send/splice syscalls. Let's
add provisions for new codes that cover the common errno values that
recv/send/splice can return, i.e. ECONNREFUSED, ENOMEM, EBADF, EFAULT,
EINVAL, ENOTCONN, ENOTSOCK, ENOBUFS, EPIPE. We also add a special case
for when the poller reported the error itself. It's worth noting that
EBADF/EFAULT/EINVAL will generally indicate serious bugs in the code
and should not be reported.
The only thing is that it's quite hard to forcefully (and reliably)
trigger these errors in automated tests as the timing is critical.
Using iptables to manually reset established connections in the
middle of large transfers at least permits to see some ECONNRESET
and/or EPIPE, but the other ones are harder to trigger.
It was the only one prefixed with "CO_ERR_", making it harder to batch
process and to look up. It was added in 2.5 by commit 61944f7a73 ("MINOR:
ssl: Set connection error code in case of SSL read or write fatal failure")
so it can be backported as far as 2.6 if needed to help integrate other
patches.
We're using a few occurrences of __builtin_prefetch() but tcc doesn't
know about it so let's give it a dummy definition. Now the code builds
and works again with tcc without thread support.
TCC is often convenient to quickly test builds, run CI tests etc. It has
limited thread support (e.g. no thread-local stuff) but that is often
sufficient for testing. TCC lacks __atomic_exchange_n() but has the
exactly equivalent __atomic_exchange(), and doesn't have any barrier.
For this reason we force the atomic_exchange to use the stricter SEQ_CST
mem ordering that allows to ignore the barrier.
[wt: that's upstream commit ca8b865 ("BUILD: support building with TCC")]
This commit introduces the tune.renice.startup and tune.renice.runtime
global keywords that allows to change the priority with setpriority().
tune.renice.startup is parsed and applied in the worker or the standalone
process for configuration parsing. If this keyword is used alone, the
nice value is changed to the previous one after configuration parsing.
tune.renice.runtime is applied after configuration parsing, so in the
worker or a standalone process. Combined with tune.renice.startup it
allows to have a different nice value during configuration parsing and
during runtime.
The feature was discussed in github issue #1919.
Example:
global
tune.renice.startup 15
tune.renice.runtime 0
When http-buffer-request option is set on a proxy, the processing will be
paused to wait the full request payload or a full buffer. So it is an entity
that block the processing, just like a rule or a filter that yields. So now,
it is reported as a waiting entity if an error or a timeout occurred.
To do so, an stream entity type is added for this option. There is no
pointer. And "waiting_entity" sample fetch returns the option name.
When a rule or a filter yields because it waits for something to be able to
continue its processing, this entity is saved in the stream. If an error or
a timeout occurred, info on this entity may be retrieved via the
"waiting_entity" sample fetch, for instance to dump it in the logs. This
info may be useful to found root cause of some bugs because it is a way to
know the processing was temporarily stopped. This may explain timeouts for
instance.
The sample fetch is not documented yet.
It is very similar to the last evaluated rule. When a filter returns an
error that interrupts the processing, it is saved in the stream, in the
last_entity field, with the type 2. The pointer on filter config is
saved. This pointer never changes during runtime and is part of the proxy's
structure. It is an element of the filter_configs list in the proxy
structure.
"last_entity" sample fetch was update accordingly. The filter identifier is
returned, if defined. Otherwise the save pointer.
The last evaluated rule is now saved in a generic structure, named
last_entity, with a type to identify it. The idea is to be able to store
other kind of entity that may interrupt a specific processing.
The type of the last evaluated rule is set to 1. It will be replace later by
an enum to be more explicit. In addition, the pointer to the rule itself is
saved instead of its location.
The sample fetch "last_entity" was added to retrieve the information about
it. In this case, it is the rule localtion, the config file containing the
rule followed by the line where the rule is defined, separated by a
colon. This sample fetch is not documented yet.
When an abstract unix socket is bound by HAProxy (using "abns@" prefix),
NUL bytes are appended at the end of its path until sun_path is filled
(for a total of 108 characters).
Here we add an alternative to pass only the non-NUL length of that path
to connect/bind calls, such that the effective path of the socket's name
is as humanly written. This may be useful to interconnect with existing
softwares that implement abstract sockets with this logic instead of the
default haproxy one.
This is achieved by implementing the "abnsz" socket prefix (instead of
"abns"), which stands for "zero-terminated ABNS". "abnsz" prefix may be
used anywhere "abns" is. Internally, haproxy uses the custom socket
family (AF_CUST_ABNS vs AF_CUST_ABNSZ) to differentiate default abns
sockets from zero-terminated ones.
Documentation was updated and regtest was added.
Fixes GH issues #977 and #2479
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
Thanks to previous commit, we may now use dedicated addrcmp functions for
each UNIX address family. This allows to simplify sock_unix_addrcmp()
function and avoid useless checks in order to try to guess the socket
type.
In this patch we implement sock_abns_addrcmp() and sock_abnsz_addrcmp()
functions, which are respectively used for ABNS and ABNSZ custom families
sock_unix_addrcmp() now only holds regular UNIX socket comparing logic.
For now it's the same as abns. We'll need to modify sock_unix_addrcmp(),
and a few other ones to support effective path length when dealing with
the \0. Let's check with Tristan's patch for this (upcoming patch).
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
This is a pre-requisite to adding the abnsz socket address family:
in this patch we make use of protocol API rework started by 732913f
("MINOR: protocol: properly assign the sock_domain and sock_family") in
order to implement a dedicated address family for ABNS sockets (based on
UNIX parent family).
Thanks to this, it will become trivial to implement a new ABNSZ (for abns
zero) family which is essentially the same as ABNS but with a slight
difference when it comes to path handling (ABNS uses the whole sun_path
length, while ABNSZ's path is zero terminated and evaluation stops at 0)
It was verified that this patch doesn't break reg-tests and behaves
properly (tests performed on the CLI with show sess and show fd).
Anywhere relevant, AF_CUST_ABNS is handled alongside AF_UNIX. If no
distinction needs to be made, real_family() is used to fetch the proper
real family type to handle it properly.
Both stream and dgram were converted, so no functional change should be
expected for this "internal" rework, except that proto will be displayed
as "abns_{stream,dgram}" instead of "unix_{stream,dgram}".
Before ("show sess" output):
0x64c35528aab0: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=21,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=
After:
0x619da7ad74c0: proto=abns_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=00 epoch=0 age=0s calls=1 rate=0 cpu=0 lat=0 rq[f=848000h,i=0,an=00h,ax=] rp[f=80008000h,i=0,an=00h,ax=] scf=[8,0h,fd=22,rex=10s,wex=] scb=[8,1h,fd=-1,rex=,wex=] exp=10s rc=0 c_exp=
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
When using trace with -dt, the trace_parse_cmd() function is doing a
strtok which write \0 into the argv string.
When using the mworker mode, and reloading, argv was modified and the
trace won't work anymore because the first : is replaced by a '\0'.
This patch fixes the issue by allocating a temporary string so we don't
modify the source string directly. It also replace strtok by its
reentrant version strtok_r.
Must be backported as far as 2.9.
strnlen2() is functionally equivalent to strnlen(). Goal is to provide
an alternative to strnlen() which is not portable since it requires
_POSIX_C_SOURCE >= 200809L
There is no reason to disable the 0-copy data forwarding if an end-of-stream
was reported on the consumer side. Indeed, the consumer will send data in
this case. So there is no reason to check the read side here.
This patch may be backported as far as 2.9.
Each server is inserted in a global list named servers_list on
new_server(). This list is then only used to finalize servers
initialization after parsing.
On dynamic server creation, there is no issue as new_server() is under
thread isolation. However, when a server is deleted after its refcount
reached zero, srv_drop() removes it from servers_list without lock
protection. In the longterm, this can cause list corruption and crashes,
especially if multiple adjacent servers are removed in parallel.
To fix this, convert servers_list to a mt_list. This should not impact
performance as servers_list is not used during runtime outside of server
creation/deletion.
This should fix github issue #2733. Thanks to Chris Staite who first
found the issue here.
This must be backported up to 2.6.
There are two parts in mworker_cli_proxy_create(): allocating and setting up
MASTER proxy and allocating and setting up servers on ipc_fd[0] of the
sockpairs shared with workers.
So, let's split mworker_cli_proxy_create() into two functions respectively.
Each of them takes **errmsg as an argument to write an error message, which may
be triggered by some subcalls. The content of this errmsg will allow to extend
the final alert message shown to user, if these new functions will fail.
The main goals of this split is to allow to move these two parts independantly
in future and makes the code of haproxy initialization in haproxy.c more
transparent.
The idea here is to record how many times a filter is being called on a
stream. We're incrementing the same counter all along, regardless of the
type of event, since the purpose is essentially to detect one that might
be misbehaving. The number of calls is reported in "show sess all" next
to the filter name. It may also help detect suboptimal processing. For
example compressing 1GB shows 138k calls to the compression filter, which
is roughly two calls per buffer. Maybe we wake up with incomplete buffers
and compress less. That's left for a future analysis.
Process_stream() is a complex function and a few times some lopos were
either witnessed or suspected. Each time this happens it's extremely
difficult to figure why because it involves combinations of analysers,
filters, errors etc.
Let's at least maintain a set of 4 counters per stream that report the
number of times we've been through each of the 4 most important blocks
(stconn changes, request analysers, response analysers, and propagation
of changes down). These ones are stored in the stream and reported in
"show sess all", just like they will be reported in panic dumps.
This macro works exactly like BUG_ON() except that it never logs anything
nor crashes, it only implements an atomic counter that is incremented on
every call. This can be used to count a number of unlikely events that are
worth checking at run time on setups showing unusual and unreproducible
behaviors.
These macros do not always kill the process, and sometimes it would be
nice to know if some match or not, and how many times (especially for the
CHECK_IF one).
This commit adds a new section "dbg_cnt" made of structs that contain
function name, file name, line number, check type, condition and match
count. A newe macro __DBG_COUNT() adds one to the counter, and is placed
inside _BUG_ON() and _BUG_ON_ONCE(). It's worth noting that the exact
type of the check is not very precise but in practice we don't care,
as most checks will cause the process to die anyway unless they're of
type _BUG_ON_ONCE() (used by CHECK_IF by default).
All of this is limited to !defined(USE_OBSOLETE_LINKER) because we're
creating a section, thus we need a modern linker to be able to scan
this section later. Doing so adds ~50kB to the executable due to the
~1266 BUG_ON() and others placed there. That's not huge in comparison
to the visibility it can provide.
The BUG_ON() macros are made of two levels so as to resolve the condition
to a string. However this doesn't offer much flexibility for performing
other operations when the condition is validated, so let's adjust them so
that the condition is checked in the outer macro and the operations are
performed in the inner one.
One main problem with panic dumps is that they're filling the dumping
thread's trash, and that the global thread_dump_buffer is too small to
catch enough of them.
Here we're proceeding differently. When dumping threads for a panic, we're
passing the magic value 0x2 as the buffer, and it will instruct the target
thread to allocate its own buffer using get_trash_chunk() (which is signal
safe), so that each thread dumps into its own buffer. Then the thread will
wait for the buffer to be consumed, and will assign its own thread_dump_buffer
to it. This way we can simply dump all threads' buffers from gdb like this:
(gdb) set $t=0
while ($t < global.nbthread)
printf "%s\n", ha_thread_ctx[$t].thread_dump_buffer.area
set $t=$t+1
end
For now we make it wait forever since it's only called on panic and we
want to make sure the thread doesn't leave and continues to use that trash
buffer or do other nasty stuff. That way the dumping thread will make all
of them die.
This would be useful to backport to the most recent branches to help
troubleshooting. It backports well to 2.9, except for some trivial
context in tinfo-t.h for an updated comment. 2.8 and older would also
require TAINTED_PANIC. The following previous patches are required:
MINOR: debug: make mark_tainted() return the previous value
MINOR: chunk: drop the global thread_dump_buffer
MINOR: debug: split ha_thread_dump() in two parts
MINOR: debug: slightly change the thread_dump_pointer signification
MINOR: debug: make ha_thread_dump_done() take the pointer to be used
MINOR: debug: replace ha_thread_dump() with its two components
At the few places we were calling ha_thread_dump(), now we're
calling separately ha_thread_dump_fill() and ha_thread_dump_done()
once the data are consumed.
Since mark_tainted() uses atomic ops to update the tainted status, let's
make it return the prior value, which will allow the caller to detect
if it's the first one to set it or not.
As mentioned in previous commit, b_peek_ofs() performs a wrapping check
but is often called with ofs == 0 as a constant. We can detect this case
with __builtin_const_p() so it makes sense to use it. A test shows a size
reduction of about 320 bytes, which is not much, but it happens in hot code
paths, and each 16 bytes reduction indicates an eliminated conditional
branch.
Some clear winners are ci_getblk_nc() (-48 bytes), h2c_dec_hdrs (-141B),
h1_copy_msg_data (-124B), tcpcheck_spop_expect_hello (-80B),
h1_parse_msg_data (-44B). These ones will definitely benefit from doing
less conditional jumps.
Some large functions were moved to buf.c by commit ac66df4e2 ("REORG:
buffers: move some of the heavy functions from buf.h to buf.c"). However,
as found by Amaury, haring doesn't build anymore. Upon close inspection,
b_getblk_nc() isn't that big since it's very much inlinable, and a part
of its apparently large size comes from the BUG_ON_HOT() that were
implemented. Regarding b_peek_varint(), it doesn't have any dependency
and is used only at 4 places in the DNS code, so its loop will not have
big impacts, and the rest around can be optimised away by the compiler
so it remains relevant to keep it inlined. Also it can serve as a base
to deduplicate the code in b_get_varint().
No backport needed.
The ARGT_ID argument type may now be used to set a custom resolve
function in order to help resolve the argument string value. If the
custom resolve function is not set, the behavior is the same as of
type ARGT_STR.
This issue came with this commit:
f627b92 BUG/MEDIUM: quic: always validate sender address on 0-RTT
and could be easily reproduced with picoquic QUIC client with -Q option
which splits a big ClientHello TLS message into two Initial datagrams.
A second condition must be fulfilled to reprodue this issue: picoquic
must not send the token provided by haproxy (NEW_TOKEN). To do that,
haproxy must be patched to prevent it to send such tokens.
Under these conditions, if haproxy has enough time to reply to the first Initial
datagrams, when it receives the second Initial datagram it sends a Retry paquet.
Then the client ignores the Retry paquet as mentionned by RFC 9000:
17.2.5.2. Handling a Retry Packet
A client MUST accept and process at most one Retry packet for each connection
attempt. After the client has received and processed an Initial or Retry packet
from the server, it MUST discard any subsequent Retry packets that it receives.
On its side, haproxy has closed the connection. When it receives the second
Initial datagram, it open a new connection but with Initial packets it
cannot decrypt (wrong ODCID) leaving the client without response.
To fix this, as the aim of the token (NEW_TOKEN) sent by haproxy is to validate
the peer address, in place of closing the connection when no token was received
for a 0RTT connection, one leaves this validation to the handshake process.
Indeed, the peer adress is validated during the handshake when a valid handshake
packet is received by the listener. But as one does not want haproxy to process
0RTT data when no token was received, one does not accept the connection before
the successful handshake completion. In addition to this, the 0RTT packets
are not released after successful handshake completion when no token was received
to leave a chance to haproxy to process these 0RTT data in such case (see
quic_conn_io_cb()).
Must be backported as far as 2.9.
When a filter is registered on the data, it means it may change the payload
length by rewritting data. It means consumers of the message cannot trust the
expected length of payload as announced by the producer. The commit 8bd835b2d2
("MEDIUM: filters/htx: Don't rely on HTX extra field if payload is filtered")
was pushed to solve this issue. When the HTTP payload of a message is filtered,
the extra field is set to 0 to be sure it will never be used by error by any
consumer. However, it is not enough.
Indeed, the filters must be called before fowarding some data. They cannot be
by-passed. But if a consumer is unable to flush the HTX message, some outgoing
data can remain blocked in the channel's buffer. If some new data are then
pushed because there is some room in the channel's buffe, the producer will set
the HTX extra field. At this stage, if the consumer is unblocked and can send
again data, it is possible to call it to forward outgoing data blocked in the
channel's buffer before waking the stream up to filter new input data. It is the
purpose of the data fast-forwarding. In this case, the HTX extra field will be
seen by the consumer. It is unexpected and leads to undefined behavior.
One consequence of this bug is to perform a wrong chunking on compressed
messages, leading to processing errors at the end of the message, reported as
"ID--" in logs.
To fix the bug, a HTX flag is added to state the payload of the current HTX
message is altered. When this flag is set (HTX_FL_ALTERED_PAYLOAD), the HTX
extra field must not be trusted. And to keep things simple, when this flag is
set, the HTX extra field is automatically set to 0 when the HTX message is
loaded, in htxbuf() function.
It is probably the less intrusive way to fix the bug for now. But this part must
be reviewed to save meta-info of the HTX message outside of the message itself.
This commit should solve the issue #2741. It must be backported as far as 2.9.
In the new master-worker architecture, when a worker process is forked and
successfully initialized it needs somehow to communicate its "READY" state to
the master, in order to terminate the previous worker and workers, that might
exceeded max_reloads counter.
So, let's implement for this a new master CLI _send_status command. A new
worker can send its status string "READY" to the master, when it's about
entering to the run poll loop, thus it can start to receive data.
In _send_status() in the master context we update the status of the new worker:
PROC_O_INIT flag is withdrawn.
When TERM signal is sent to a worker, worker terminates and this triggers the
mworker_catch_sigchld() handler in master. This handler deletes the exiting
process entry from the processes list.
In _send_status() we loop over the processes list twice. At the first time, in
order to stop workers that exceeded the max_reloads counter. At the second time,
in order to stop the worker forked before the last reload. In the corner case,
when max_reloads=1, we avoid to send SIGTERM twice to the same worker by
setting sigterm_sent flag during the first loop.
Previously reexec_on_failure() was called in cases when the process has failed
after reload, while it was parsing its configuration or it was trying to apply
it. reexec_on_failure() has called mworker_reexec() and the master process has
been reexecuted.
With the new architecture in such cases there is no longer need to reexecute
the master process after its reload again. It simply keeps the previous worker,
forked before the reload, and it lets the new one to exit with an error. But we
still need the code, which increments the number of failed reloads and which
notifies systemd with new "Reload failed!" status. So, let's reuse and adapt
for this reexec_on_failure() and let's rename it to on_new_child_failure().
Here, to distinguish between the new worker and the previous one let's add a
new process state PROC_O_INIT and let's set it, when the memory is allocated
for the new worker in the processes list.
Let's rename mworker_cli_sockpair_new() to
mworker_cli_global_proxy_new_listener() to outline that this function creates
the GLOBAL proxy, allocates the listener with "master-socket" bind conf and
attaches this listener to this GLOBAL proxy. Listener is bound to ipc_fd[1] of
the sockpair inherited in master and in worker (master CLI sockpair).
This is the first commit in a series to add the support of 4 primary reload
use-cases for the new master-worker architecture:
1. Newly forked worker process dies before any reload, due to some errors in
the configuration. Newly forked worker process crashes before any reload
after sending its "READY" state to master.
2. Newly forked worker process dies due to some errors in the new
configuration. This happens after reload, when this new configuration was
supplied, so the previous worker process is still here.
3. Newly forked worker process crashes after sending its "READY" state to
master due to some bugs. This happens after reload, so the previous worker
process is still here.
4. Newly forked worker process has sent its "READY" state to master and starts
to receive traffic. This happens after reload, the old worker hasn't
terminated yet, as it is waiting on some idle connection and it crashes.
Let's rename in this commit mworker_cli_proxy_new_listener() to
mworker_cli_master_proxy_new_listener() to outline, that this function creates
"master-socket" bind conf and allocates a listener. This listener is attached
to the MASTER proxy and it's bound to the ipc_fd[0] of the sockpair,
inherited in master and in worker processes (master CLI sockpair).
This commit is a part of the series to add a support of discovery mode in the
configuration parser and in initialization sequence.
So, let's add here KWF_DISCOVERY flag to distinguish the keywords,
which should be parsed in "discovery" mode and which are needed for master
process, from all others. Keywords, that should be parsed in "discovery" mode
have its dedicated parser funtions. Let's tag these functions with
KWF_DISCOVERY flag in keywords list. Like this, only these keyword parsers
might be called during the first configuration read in discovery mode.
This is the first commit from a series to add a support of discovery mode
in the configuration parser and in initialization sequence.
Discovery mode is the mode, when we read the configuration at the first time
and we parse and set runtime modes: daemon, zero-warning, master-worker. In
this mode we also parse some parameters needed for the master process to start,
in case if we are in the master-worker mode. Like this the master process
doesn't allocate any additional resources, which it doesn't use and it quickly
finishes its initialization and enters to its polling loop. The worker process
after its fork reads the rest of the configuration.
So, let's add in this commit MODE_DISCOVERY flag to check it in
configuration parser functions.
MODE_MWORKER_WAIT becames redundant with MODE_MWORKER, due to moving
master-worker fork in init(). This change allows master no longer perform
reexec just after forking in order to free additional memory.
As after the fork in the master process we set 'master' variable, we can
replace now MODE_MWORKER_WAIT in some 'if' statements by simple check of this
'master' variable.
Let's also continue to get rid of HAPROXY_MWORKER_WAIT_ONLY environment
variable, as it's no longer needed as well.
In cfg_program_postparser(), which is used to check if cmdline is defined to
launch a program, we completely remove the check of mode for now, because
the master process does not parse the configuration for the moment. 'program'
section parsing will be reintroduced in master later in the next commits.
This is a one of the commits to prepare the removal of MODE_MWORKER_WAIT
support, as it became redundant with MODE_MWORKER due to moving master-worker
fork in init().
Pierre Bonnat reported that SRV-based server-template recently stopped
to work properly.
After reviewing the changes, it was found that the regression was caused
by a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part of MAINT")
Indeed, HMAINT is not a regular maintenance flag. It was implemented in
b418c122a4d04c6 ("BUG/MINOR: server: make sure the HMAINT state is part
of MAINT"). This flag is only set (and never removed) when the server FQDN
is changed from its initial config-time value. This can happen with "set
server fqdn" command as well as SRV records updates from the DNS. This
flag should ideally belong to server flags.. but it was stored under
srv_admin enum because cur_admin is properly exported/imported via server
state-file while regular server's flags are not.
Due to a4d04c6, when a server FQDN changes, the server is considered in
maintenance, and since the HMAINT flag is never removed, the server is
stuck in maintenance.
To fix the issue, we partially revert a4d04c6. But this latter commit is
right on one point: HMAINT flag was way too confusing and mixed-up between
regular MAINT flags, thus there's nothing to blame about a4d04c6 as it was
error-prone anyway.. To prevent such kind of bugs from happening again,
let's rename HMAINT to something more explicit (SRV_ADMF_FQDN_CHANGED) and
make it stand out under srv_admin enum so we're not tempted to mix it with
regular maintenance flags anymore.
Since a4d04c6 was set to be backported in all versions, this patch must
be backported there as well.
wait-for-handshake http-request action was completely ineffective with
QUIC protocol. This commit implements its support for QUIC.
QUIC MUX layer is extended to support wait-for-handshake. A new function
qcc_handle_wait_for_hs() is executed during qcc_io_process(). It detects
if MUX processing occurs after underlying QUIC handshake completion. If
this is the case, it indicates that early data may be received. As such,
connection is flagged with CO_FL_EARLY_SSL_HS, which is necessary to
block stream processing on wait-for-handshake action.
After this, qcc subscribs on quic_conn layer for RECV notification. This
is used to detect QUIC handshake completion. Thus,
qcc_handle_wait_for_hs() can be reexecuted one last time, to remove
CO_FL_EARLY_SSL_HS and notify every streams flagged as
SE_FL_WAIT_FOR_HS.
This patch must be backported up to 2.6, after a mandatory period of
observation. Note that it relies on the backport of the two previous
patches :
- MINOR: quic: notify connection layer on handshake completion
- BUG/MINOR: stream: unblock stream on wait-for-handshake completion
For now it seems to work as before, and even when artificially inflating
the number of allocatable buffers per stream. The number of allocated
slots is always the same as the max number of streams, which guarantees
that each stream will find one buffer. we only grant one buffer per
stream at this point, since the goal was to replace the existing single
rxbuf.
A new demux blocking flag, H2_CF_DEM_RXBUF, was added to indicate
a failure to get an rxbuf slot from the connection. It was lightly
tested (by forcing bl_init() to a lower number of buffers). It is not
yet certain whether it's more useful to have a new flag or to reuse
the existing H2_CF_DEM_SFULL which indicates the rxbuf is full,
but at least the new flag more accurately translates the condition,
that may make a difference in the future. However, given that when
RXBUF is set, most of the time it results in a failure to find more
room to demux and it sets SFULL, for now we have to always clear
SFULL when clearing RXBUF as well. This means that most of the time
we'll see 3 combinations:
- none: everything's OK
- SFULL: the unique rx buffer is full
- RXBUF || (RXBUF|SFULL): cannot allocate more entries
Note that we need to be super careful in h2_frt_transfer_data() because
the htx_free_data_space() function doesn't guarantee that the room is
usable, so htx_add_data() may still fail despite an apparent room. For
this reason, h2_frt_transfer_data() maintains a "full" flag to indicate
that a transfer attempt failed and that a new buffer is required.
It's not convenient to have this flag in the middle of the demux flags,
it easily hides other ones that need to be added. Let's move it after
the other ones.
A stream is receiving data from after the HEADERS frame missing END_STREAM,
to the end of the stream or HREM (the presence of END_STREAM). We're now
adding a flag to the stream that indicates this state, as well as a counter
in the connection of streams currently receiving data. The purpose will be
to gauge at any instant the number of streams that might have to share the
available bandwidth and buffers count in order not to allocate too much flow
control to any single stream. For now the counter is kept up to date, and is
reported in "show fd".
The buffer ring is problematic in multiple aspects, one of which being
that it is only usable by one entity. With multiplexed protocols, we need
to have shared buffers used by many entities (streams and connection),
and the only way to use the buffer ring model in this case is to have
each entity store its own array, and keep a shared counter on allocated
entries. But even with the default 32 buf and 100 streams per HTTP/2
connection, we're speaking about 32*101*32 bytes = 103424 bytes per H2
connection, just to store up to 32 shared buffers, spread randomly in
these tables. Some users might want to achieve much higher than default
rates over high speed links (e.g. 30-50 MB/s at 100ms), which is 3 to 5
MB storage per connection, hence 180 to 300 buffers. There it starts to
cost a lot, up to 1 MB per connection, just to store buffer indexes.
Instead this patch introduces a variant which we call a buffer list.
That's basically just a free list encoded in an array. Each cell
contains a buffer structure, a next index, and a few flags. The index
could be reduced to 16 bits if needed, in order to make room for a new
struct member. The design permits initializing a whole freelist at once
using memset(0).
The list pointer is stored at a single location (e.g. the connection)
and all users (the streams) will just have indexes referencing their
first and last assigned entries (head and tail). This means that with
a single table we can now have all our buffers shared between multiple
streams, irrelevant to the number of potential streams which would want
to use them. Now the 180 to 300 entries array only costs 7.2 to 12 kB,
or 80 times less.
Two large functions (bl_deinit() & bl_get()) were implemented in buf.c.
A basic doc was added to explain how it works.
Over time, some of the buffer management functions grew quite a bit,
and were still forced to remain inlined since all defined in buf.h.
Let's create buf.c and move the heaviest ones there. All those moved
here were above 200 bytes.
sink_find_early() is a convenient function that can be used instead of
sink_find() during parsing time in order to try to find a matching
sink even if the sink is not defined yet.
Indeed, if the sink is not defined, sink_find_early() will try to create
it and mark it as forward-declared. It will also save informations from
the caller to better identify it in case of errors.
If the sink happens to be found in the config, it will transition from
forward-declared type to its final type. Else, it means that the sink
was not found in the config, in this case, during postresolve, we raise
an error to indicate that the sink was not found in the configuration.
It should help solve postresolving issue with rings, because for now only
log targets implement proper ring postresolving.. but rings may be used
at different places in the code, such as debug() converter or in "traces"
section.
Function may be used from places where per-context actions are usually
registered (tcp_act.c, http_act.c, quic_rules.c.. to name a few) in
order to expose the do_log() action.
do_log() is quite similar to sess_log() or strm_log(), excepts that it
may be called at any time during session handling in an opportunistic
way as long as the session exists (the stream may or may not exist).
Also, it will try to emit the log as INFO by default, unless set-log-level
is used on the stream, or error origin flag is set.
This commit is the last one of a serie whose objective is to restore
QUIC transfer throughput performance to the state prior to the recent
QUIC MUX buffer allocator rework.
This gain is obtained by reporting received out-of-order ACK data range
to the QUIC MUX which can then decount room in its txbuf window. This is
implemented in QUIC streamdesc layer by adding a new invokation of
notify_room callback. This is done into qc_stream_buf_store_ack() which
handle out-of-order ACK data range.
Previous commit has introduced merging of overlapping ACK data range. As
such, it's easy to only report the newly acknowledged data range.
As with in-order ACKs, this new notification is only performed on
released streambuf. As such, when a streambuf instance is released,
notify_room notification now also reports the total length of
out-of-order ACK data range currently stored. This value is stored in a
new streambuf member <room> to avoid unnecessary tree lookup.
This <room> member also serves on in-order ACK notification to reduce
the notified room. This prevents to report invalid values when overlap
ranges are treated first out-of-order and then in-order, which would
cause an invalid QUIC MUX txbuf window value.
After this change has been implemented, performance has been
significantly improved, both with ngtcp2-client rate usage and on
interop goodput test. These values are now similar to the rate observed
on older haproxy version before QUIC MUX buffer allocator rework.
QUIC streamdesc layer is responsible to handle reception of ACK for
streams. It removes stream data from the underlying buffers on ACK
reception.
Streamdesc layer treats ACK in order at the stream level. Out of order
ACKs are buffered in a tree until they can be handled on older data
acknowledgement reception. Previously, qf_stream instance which comes
from the quic_tx_packet was used as tree node to buffer such ranges.
Introduce a new type dedicated to represent out of order stream ack data
range. This type is named qc_stream_ack. It contains minimal infos only
relative to the acknowledged stream data range.
This allows to reduce size of frequently used quic_frame with the
removal of tree node from qf_stream. Another side effect of this change
is that now quic_frame are always released immediately on ACK reception,
both in-order and out-of-order. This allows to also release the
quic_tx_packet instance which should reduce memory consumption.
The drawback of this change is that qc_stream_ack instance must be
allocated on out-of-order ACK reception. As such, qc_stream_desc_ack()
may fail if an error happens on allocation. For the moment, such error
is silenly recovered up to qc_treat_rx_pkts() with the dropping of the
received packet containing the ACK frame. In the future, it may be
useful to close the connection as this error may only happens on low
memory usage.
It is no longer supported to declare debug traces, via 'trace' directive, in
a global section. A 'traces' directive must be used instead. The syntax of
the 'trace' directive in these sections remains the same. But it is no
longer experimental.
The main reason for this change is to avoid to have a ring section defined
before a global one. Indeed, for now, forward declarations of ring sections
are not supported. So to configure traces, you had to add a ring section
before the global one defining the traces. Most of time, that meant to have
two global sections :
global
[...] # global settings
ring <name>
[...]
global
[...] # trace config
In addition, it will be possible to easily extend the traces section by
adding some new directives.
qc_stream_desc_ack() is the entrypoint for streamdesc layer to handle a
new acknowledgement of previously emitted STREAM data.
Previously, it was only able to deal with in-order ACK offset. The
caller was responsible to buffer out-of-order ACKs. Change this by
dealing with the latter case directly in qc_stream_desc_ack(). This
notably simplify ACK handling in quic_rx module.
QUIC streamdesc layer is used to manage QUIC MUX stream txbuf data
storage until acknowledgment. Currently, it only supports in-order
acknowledgment at the stream level. This requires to be able to buffer
out-of-order ACKs until they can be handled.
Previously, these ACKs were stored in a tree to the streamdesc instance.
Move this indexed storage at the streambuf instance.
This commit is purely an architecture change. However, it will allow to
extend ACK management in future patches, such as the ability to merge
overlapping out-of-order ACKs.
qc_stream_desc layer is used by QUIC MUX to store emitted STREAM data
until their acknowledgement. Each stream with Tx capability can allocate
its own qc_stream_desc. In turn, each stream desc can have one or
multiple data buffers. This is useful when a MUX stream releases a
buffer and allocate a new one, to preserve bandwith without waiting to
receive all acknowledgement of the previous buffer.
Each buffer is encapsulated in a qc_stream_buf structure. Previously, it
was stored as a list into qc_stream_desc. Change this storage to use a
tree instead. Each buffer is indexed by their offset.
This commit does not introduce functional changes. However, this
rearchitecture will be necessary for future commit to extend ACK
management which require fetching individual buffer instance, not just
the first or last element of a streamdesc, by their offset.
qc_stream_desc_ack() is used to handle ACK received for STREAM frame. It
removes acknowledged data from their underlying buffer.
If all data were removed after ACK handling, qc_stream_desc instance
would automatically be freed at the end of qc_stream_desc_ack().
However, this renders the function complicated to use. Simplify this by
removing this automatic removal. Now, caller is responsible to check
after ACK handling if qc_stream_desc instance can be removed. This is
easily done using qc_stream_desc_done() helper.
qc_stream_desc is an intermediary layer between QUIC MUX and quic_conn.
It is a facility which permits to store data to emit and keep them for
retransmission until acknowledgment. This layer is responsible to notify
QUIC MUX each time a buffer is freed. This is necessary as MUX buffer
allocation is limited by the underlying congestion window size.
Refactor this to use a mechanism similar to send notification. A new
callback notify_room can now be registered to qc_stream_desc instance.
This is set by QUIC MUX to qmux_ctrl_room(). On MUX QUIC free, special
care is now taken to reset notify_room callback to NULL.
Thanks to this refactoring, further adjustment have been made to refine
the architecture. One of them is the removal of qc_stream_desc
QC_SD_FL_OOB_BUF, which is now converted to a MUX layer flag
QC_SF_TXBUF_OOB.
Previous commit implement a refactor of MUX send notification from
quic_conn layer. With this new architecture, a proper callback is
defined for each qc_stream_desc instance.
This architecture change allows to simplify notification from quic_conn
layer. First, ensure the MUX callback to properly ignore retransmission
of an already emitted frame. Luckily, this can be handled easily by
comparing offsets and FIN status. Also, each QCS instance can now be
unregistered from send notification just prior qc_stream_desc releasing.
This ensures a QCS is never manipulated from quic_conn after its
emission ending. Both these changes render the send notification more
robust. As a nice effect, flag QUIC_FL_CONN_TX_MUX_CONTEXT can be
removed as it is now unneeded.
For STREAM emission, MUX QUIC generates one or several frames and emit
them via qc_send_mux(). Lower layer may use them as-is, or split them to
lower chunk to fit in a QUIC packet. It is then responsible to notify
the MUX to report the amount of data sent.
Previously, this was done via a direct call from quic_conn to MUX using
qcc_streams_sent_done(). Modify this to have a better isolation accross
layers. Define a send callback handled by the qc_stream_desc instance.
This allows the MUX to register each QCS instance individually to the
renamved qmux_ctrl_send() which replaces qcc_streams_sent_done().
At quic_conn layer, qc_stream_desc_send() can be used now. This is a
wrapper to qc_stream_desc layer to invoke the send callback if
registered.
This mechanism of qc_stream_desc callback should be extended later to
implement other notifications accross the QUIC stack.
A shared counter is added in the thread context to track the total number of
streams created on the thread. This number is then reported in stats. It
will be a useful information to diagnose some bugs.
A shared counter is added in the thread context to track the current number
of streams. This number is then reported in stats. It will be a useful
information to diagnose some bugs.
Thanks to the previous patch, it is now possible to add an action to
dynamically change the maxumum number of connection retires for a stream.
"set-retries" action may now be used to do so, from a "tcp-request content"
or a "http-request" rule. This action accepts an expression or an integer
between 0 and 100. The integer value is checked during the configuration
parsing and leads to an error if it is not in the expected range. However,
for the expression, the value is retrieve at runtime. So, invalid value are
just ignored.
Too high value is forbidden to avoid any trouble. 100 retries seems already
be an amazingly hight value. In addition, the option is only available on
backend or listen sections.
Because the max retries is limited to 100 at most, it can be stored as a
unsigned short. This save some space in the stream structure.
Instead of directly relying on the backend parameter to limit the number of
connection retries, we now use a per-stream value. This value is by default
inherited from the backend value when it is set. So for now, there is no
change except the stream value is used instead of the backend value. But
thanks to this change, it will be possible to dynamically change this value.
This function was only used by TCP actions and was private to tcp_act.c
file. However, it make sense to make it public to be used by any action
relying on an int-or-expression argument.
The function is provided by glibc. Nothing prevents us from using our
own outside of glibc there (tested on aarch64 with musl). We still do
not enable it by default as we don't yet know if all archs work well,
but it's sufficient to pass USE_BACKTRACE=1 when building with musl to
verify it's OK.
No need to include this possibly non-existing file when using our own
backtrace() implementation, it's only needed for the libc-provided one.
Because of this it's currently not possible to build musl with backtrace
enabled.
In 1.8 when adding "set server fqdn" with commit b418c1228c ("MINOR:
server: cli: Add server FQDNs to server-state file and stats socket."),
the HMAINT flag was not made part of the MAINT ones, so technically
speaking when changing the FQDN, the server is not completely considered
as in maintenance mode.
In its defense, the code location around that was completely messy, with
the aggregator flag being hidden between other values and purposely but
discretely ignoring one of the flags, so the comments were updated to
make the intent clearer (particularly regarding CMAINT which looked like
it was also forgotten while it was on purpose).
This can be backported anywhere.
The solution found in commit b500e84e24 ("BUG/MINOR: server: shut down
streams under thread isolation") to deal with inter-thread stream
shutdown doesn't work fine because there exists code paths involving
a server lock which can then deadlock on thread_isolate(). A better
solution then consists in deferring the shutdown to the stream itself
and just wake it up for that.
The only thing is that TASK_WOKEN_OTHER is a bit too generic and we
need to pass at least 2 types of events (SF_ERR_DOWN and SF_ERR_KILLED),
so we're now leveraging the new TASK_F_UEVT1 and _UEVT2 flags on the
task's state to convey these info. The caller only needs to wake the
task up with these flags set, and the stream handler will then finish
the job locally using stream_shutdown_self().
This needs to be carefully backported to all branches affected by the
dequeuing issue and containing any of the 5541d4995d ("BUG/MEDIUM:
queue: deal with a rare TOCTOU in assign_server_and_queue()"), and/or
b11495652e ("BUG/MEDIUM: queue: implement a flag to check for the
dequeuing").
TASK_WOKEN_MSG only says "someone sent you a message" but doesn't convey
any info about the message. TASK_WOKEN_OTHER says "you're woken for another
reason" but doesn't tell which one. Most often they're used as-is by the
task handlers to report very specific situations.
For some important control notifications, having the ability to modulate
the message a little bit is useful, so let's define two user event types
UEVT1 and UEVT2 to be used in conjunction with TASK_WOKEN_MSG or _OTHER
so that the application can know that a specific condition was explicitly
requested. It will be used this way:
task_wakeup(s->task, TASK_WOKEN_MSG | TASK_F_UEVT1);
or:
task_wakeup(s->task, TASK_WOKEN_OTHER | TASK_F_UEVT2);
Since events are cumulative, keep in mind not to consider a 3rd value
as the combination of EVT1+EVT2; these really mean that the two events
appeared (though in unspecified order).
For now it is only available for proxies with frontend capability because
log-steps are only evaluated under sess_log() or strm_log() which
essentially focus on the frontend side when it comes to log settings so
it's better to keep it this way for better consistency, at least for now.
For now the setting does nothing (it is not considered during runtime),
it will be implemented and documented in upcoming commits.
add proxy->conf.log_steps eb32 root tree which will be used to store the
log origin identifiers that should result in haproxy emitting a log as
configured by the user using upcoming "log-steps" proxy keyword.
It was chosen to use eb32 tree instead of simple bitfield because despite
the slight overhead it is more future-proof given that we already
implemented the prerequisites for seamless custom log origins registration
that will also be usable from "log-steps" proxy keyword.
Following previous commits, let's improve log_orig_to_str() so that
extra log origins (registered through log_orig_register()) can be
translated to string from origin ID.
For that, it is required to add eb_32 tree node to log_origin struct in
order to enable quick integer lookup during runtime. Slow name lookup
using the list is acceptable for config parsing, but it is not the case
during runtime when log_orig_to_str() is expected to be used. Also, to
prevent duplicated info, get rid of ->id field and use ->tree.key instead
Thanks to previous commit, we can know check for log_orig optional flags
in functions taking struct log_orig as parameter. Let's take this
opportunity to add the LOG_ORIG_FL_ERROR flag and check this flag at a
few places to handle the log message differently because if the flag is
set then the caller expects the log to be handled as an error explicitly.
e.g.: in _process_send_log_override(), if the flag is set, use the error
log format instead of the dedicated one.
Rename 'enum log_orig' to 'enum log_orig_id', since this enum specifically
contains the log origin ids.
Add 'struct log_orig' which wraps 'enum log_orig' with optional flags
(no flags defined for now).
Add log_orig() helper func that takes id and flags as parameter and
returns log_orig struct initialized with input arguments.
Update functions taking log origin as parameter so they explicitly take
log orig id or log orig wrapper as argument depending on the level of
context expected by the function.
add a way to register additional log origins using log_origin_register()
that may be used as log profile steps from log profile sections.
For now this does nothing as no extra origins are registered and extra log
origins are not yet considered for runtime logging paths.
When specifying an extra logging step for on <step> under log-profile
section, the logging step is stored within a binary tree for efficient
lookup during runtime. No performance impact should be expected if extra
log origins are not being used, and slight performance impact if extra
log origins are used.
Don't forget to update the documentation when new log origins are added
(both %OG log alias and on <step> log-profile keyword are concerned.
The proxy lock state isn't passed down to relax_listener
through dequeue_proxy_listeners, which causes a deadlock
in relax_listener when it tries to get that lock.
Backporting: Older versions didn't have relax_listener and directly called
resume_listener in dequeue_proxy_listeners. lpx should just be passed directly
to resume_listener then.
The bug was introduced in commit 001328873c
[cf: This patch should fix the issue #2726. It must be backported as far as
2.4]
It has never been permitted to explicitly reference named defaults
sections for which there are duplicate names. This means that when
a duplicate defaults section is found, there's no point in keeping
it since it will never be used for lookups, so it can be dropped.
However, some such defaults sections might have some rules in them
that are implicitly referenced by proxies placed after them. In this
case they cannot be removed.
What is done here is that upon each new named section creation, if
another one is found with the same name, its config location is stored
into the new proxy's {prev_file,prev_line} pair, and the old section is
either destroyed if its refcount is null, or just unindexed. The dup
check when creating a new proxy now consists in checking the prev_line
instead of performing a dup lookup on the defaults section.
This will guarantee that we can't find duplicate defaults sections in
their tree anymore, while still keeping track of what's allocated and
releasing everything upon exit.
Beyond the consistency gain, there are nice savings for large configs
involving many defaults sections: a test with 300k sections saved
about 1.9 GB of RAM, and started 25% faster likely thanks to spending
less time allocating memory.
We'll soon delete unreferenced and duplicated named defaults sections
from the list of proxies. The problem with this is that this list (in
fact a name-based tree) is used to release all of them at the end. Let's
add a list of orphaned defaults sections, typically those containing
"http-check send" statements or various other rules, and that are
implicitly inherited by a proxy hence have a non-zero refcount while
also having a name. These now makes it possible to remove them from
the name index while still keeping their memory around for the lifetime
of the process, and cleaning it at the end.
Proxy file names are assigned a bit everywhere (resolvers, peers,
cli, logs, proxy). All these elements were enumerated and now use
copy_file_name(). The only ha_free() call was turned to drop_file_name().
As a bonus side effect, a 300k backend config saved 14 MB of RAM.
The file name used to point to the calling function's stack for stick
tables, which was OK during parsing but remained dangling afterwards.
At least it was already marked const so as not to accidentally free it.
Let's make it point to a file_name_node now.
In proxies, stick-tables, servers, etc... at plenty of places we store
a file name and a line number. Some file names are the result of strdup()
(e.g. in proxies), others not (e.g. stick-tables) and leave dangling
pointers at the end of parsing. The risk of double-free is not null
either.
In order to stop this, let's first add a simple tool that allows to
register short strings inside a global list, these strings happening
to be server names. The strings are either duplicated and stored upon
failure to find them, or just added to this storage. Since file names
are not expected to disappear before the end of the process, for now
we don't even implement refcounting, and we free them all at the end.
There's already a drop_file_name() function to reset the pointer like
ha_free() used to do, and even if not strictly needed it's a good
habit to get used to doing it.
The strings are returned as const so that they're stored as-is in
structs, and that nasty free() calls are easily caught. The pointer
points to the char[] storage inside the node itself. This way later
if we want to implement refcounting, it will be trivial to just look
up a string and change its associated node's refcount. If needed,
comparisons can also be made on pointers.
For now they're not used yet and are released on deinit().
gcc-12 and above report a wrong warning about a negative length being
passed to memcmp() on an impossible code path when built at -O0. The
pattern is the same at a few places, basically:
int foo(int op, const void *a, const void *b, size_t size, size_t arg)
{
if (op == 1) // arg is a strict multiple of size
return memcmp(a, b, arg - size);
return 0;
}
...
int bar()
{
return foo(0, a, b, sizeof(something), 0);
}
It *might* be possible to invent dummy values for the "len" argument
above in the real code, but that significantly complexifies it and as
usual can easily result in introducing undesired bugs.
Here we take a different approach consisting in shutting the
-Wstringop-overread warning on gcc>=12 at -O0 since that's the only
condition that triggers it. The issue was reported to and confirmed by
the gcc team here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114622
No backport needed, but this should be upstreamed into cebtree after
checking that all involved macros are available.
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.
Doing so will make sure that the local time always knows the monotonic
time when it is available.
On a synchronous send from the stream to an applet, if some data were sent,
we must take care to wake the applet up. It is important because if
everything was sent at this stage, there is no other chance to wake the
applet up, mainly because SE_FL_WAIT_DATA flag is set on the applet's sedesc
in sc_update_tx() at the end of process_stream(). This flag prevent any
wakeup of the applet for a send event.
It is not necessary for a mux because the mux stream is called when a
syncrhonous send from the stream is performed. So it is reponsible to wake
the mux connection if necessary.
This patch must be backport to 3.0.
Given that the original list-based version was using a list head as the
root of the variables, while the tree is using a single pointer, it made
sense to reuse that space to place multiple roots, indexed on the lower
bits of the name hash. Two roots slightly increase the performance level,
but the best gain is obtained with 4 roots. The performance is now always
above that of the list, even with small counts, and with 100 vars, it's
21% higher than before, or 67% higher than with the list.
We keep the same lock (it could have made sense to use one lock per head),
because most of the variables in large configs are attached to a stream
or a session, hence are not shared between threads. Thus there's no point
in sharding the pointer.
Configs involving many variables can start to eat a lot of CPU in name
lookups. The reason is that the names themselves are dynamic in that
they are relative to dynamic objects (sessions, streams, etc), so
there's no fixed index for example. The current implementation relies
on a standard linked list, and in order to speed up lookups and avoid
comparing strings, only a 64-bit hash of the variable's name is stored
and compared everywhere.
But with just 100 variables and 1000 accesses in a config, it's clearly
visible that variable name lookup can reach 56% CPU with a config
generated this way:
for i in {0..100}; do
printf "\thttp-request set-var(txn.var%04d) int(%d)" $i $i;
for j in {1..10}; do [ $i -lt $j ] || printf ",add(txn.var%04d)" $((i-j)); done;
echo;
done
The performance and a 4-core skylake 4.4 GHz reaches 85k RPS with a perf
profile showing:
Samples: 170K of event 'cycles', Event count (approx.): 142378815419
Overhead Shared Object Symbol
56.39% haproxy [.] var_to_smp
6.65% haproxy [.] var_set.part.0
5.76% haproxy [.] sample_process_cnv
3.23% haproxy [.] sample_conv_var2smp
2.88% haproxy [.] sample_conv_arith_add
2.33% haproxy [.] __pool_alloc
2.19% haproxy [.] action_store
2.13% haproxy [.] vars_get_by_desc
1.87% haproxy [.] smp_dup
[above, var_to_smp() calls var_get() under the read lock].
By switching to a binary tree, the cost is significantly lower, the
performance reaches 117k RPS (+37%) with this profile:
Samples: 170K of event 'cycles', Event count (approx.): 142323631229
Overhead Shared Object Symbol
40.22% haproxy [.] cebu64_lookup
7.12% haproxy [.] sample_process_cnv
6.15% haproxy [.] var_to_smp
4.75% haproxy [.] cebu64_insert
3.79% haproxy [.] sample_conv_var2smp
3.40% haproxy [.] cebu64_delete
3.10% haproxy [.] sample_conv_arith_add
2.36% haproxy [.] action_store
2.32% haproxy [.] __pool_alloc
2.08% haproxy [.] vars_get_by_desc
1.96% haproxy [.] smp_dup
1.75% haproxy [.] var_set.part.0
1.74% haproxy [.] cebu64_first
1.07% [kernel] [k] aq_hw_read_reg
1.03% haproxy [.] pool_put_to_cache
1.00% haproxy [.] sample_process
The performance lowers a bit earlier than with the list however. What
can be seen is that the performance maintains a plateau till 25 vars,
starts degrading a little bit for the tree while it remains stable till
28 vars for the list. Then both cross at 42 vars and the list continues
to degrade doing a hyperbole while the tree resists better. The biggest
loss is at around 32 variables where the list stays 10% higher.
Regardless, given the extremely narrow band where the list is better, it
looks relevant to switch to this in order to preserve the almost linear
performance of large setups. For example at 1000 variables and 10k
lookups, the tree is 18 times faster than the list.
In addition this reduces the size of the struct vars by 8 bytes since
there's a single pointer, though it could make sense to re-invest them
into a secondary head for example.
This is an import of the compact elastic binary trees at commit
a9cd84a ("OPTIM: descent: better prefetch less and for writes when
deleting")
These will be used to replace certain lists (and possibly certain
tree nodes as well). They're as fast (or even faster) than ebtrees
for lookups, as fast for insertion and slower for deletion, and a
node only uses 2 pointers (like a list).
The only changes were cebtree.h where common/tools.h was replaced
with ebtree.h which we already have and already provides the needed
functions and macros, and the addition of a wrapper cebtree-prv.h in
src/ to redirect to import/cebtree-prv.h.
All callers of vars_prune_* currently check the list for emptiness.
Let's leave that to vars_prune() itself, it will ease some changes in
the code. Thanks to the previous inlining of the vars_prune() function,
there's no performance loss, and even a very tiny 0.1% gain.
As unveiled in GH issue #2711, commit 5541d4995d ("BUG/MEDIUM: queue:
deal with a rare TOCTOU in assign_server_and_queue()") does have some
side effects in that it can occasionally cause an endless loop.
As Christopher analysed it, the problem is that process_srv_queue(),
which uses a trylock in order to leave only one thread in charge of
the dequeueing process, can lose the lock race against pendconn_add().
If this happens on the last served request, then there's no more thread
to deal with the dequeuing, and assign_server_and_queue() will loop
forever on a condition that was initially exepected to be extremely
rare (and still is, except that now it can become sticky). Previously
what was happening is that such queued requests would just time out
and since that was very rare, nobody would notice.
The root of the problem really is that trylock. It was added so that
only one thread dequeues at a time but it doesn't offer only that
guarantee since it also prevents a thread from dequeuing if another
one is in the process of queuing. We need a different criterion.
What we're doing now is to set a flag "dequeuing" in the server, which
indicates that one thread is currently in the process of dequeuing
requests. This one is atomically tested, and only if no thread is in
this process, then the thread grabs the queue's lock and dequeues.
This way it will be serialized with pendconn_add() and no request
addition will be missed.
It is not certain whether the original race covered by the fix above
can still happen with this change, so better keep that fix for now.
Thanks to @Yenya (Jan Kasprzak) for the precise and complete report
allowing to spot the problem.
This patch should be backported wherever the patch above was backported.
Since c5959fd ("MEDIUM: pattern: merge same pattern"), UAF (leading to
crash) can be experienced if the same pattern file (and match method) is
used in two default sections and the first one is not referenced later in
the config. In this case, the first default section will be cleaned up.
However, due to an unhandled case in the above optimization, the original
expr which the second default section relies on is mistakenly freed.
This issue was discovered while trying to reproduce GH #2708. The issue
was particularly tricky to reproduce given the config and sequence
required to make the UAF happen. Hopefully, Github user @asmnek not only
provided useful informations, but since he was able to consistently
trigger the crash in his environment he was able to nail down the crash to
the use of pattern file involved with 2 named default sections. Big thanks
to him.
To fix the issue, let's push the logic from c5959fd a bit further. Instead
of relying on "do_free" variable to know if the expression should be freed
or not (which proved to be insufficient in our case), let's switch to a
simple refcounting logic. This way, no matter who owns the expression, the
last one attempting to free it will be responsible for freeing it.
Refcount is implemented using a 32bit value which fills a previous 4 bytes
structure gap:
int mflags; /* 80 4 */
/* XXX 4 bytes hole, try to pack */
long unsigned int lock; /* 88 8 */
(output from pahole)
Even though it was not reproduced in 2.6 or below by @asmnek (the bug was
revealed thanks to another bugfix), this issue theorically affects all
stable versions (up to c5959fd), thus it should be backported to all
stable versions.
Allow the user to set the "initial state" of a server.
Context:
Servers are always set in an UP status by default. In
some cases, further checks are required to determine if the server is
ready to receive client traffic.
This introduces the "init-state {up|down}" configuration parameter to
the server.
- when set to 'fully-up', the server is considered immediately available
and can turn to the DOWN sate when ALL health checks fail.
- when set to 'up' (the default), the server is considered immediately
available and will initiate a health check that can turn it to the DOWN
state immediately if it fails.
- when set to 'down', the server initially is considered unavailable and
will initiate a health check that can turn it to the UP state immediately
if it succeeds.
- when set to 'fully-down', the server is initially considered unavailable
and can turn to the UP state when ALL health checks succeed.
The server's init-state is considered when the HAProxy instance
is (re)started, a new server is detected (for example via service
discovery / DNS resolution), a server exits maintenance, etc.
Link: https://github.com/haproxy/haproxy/issues/51
Add a factor parameter to stick-tables, called "brates-factor", that is
applied to in/out bytes rates to work around the 32-bits limit of the
frequency counters. Thanks to this factor, it is possible to have bytes
rates beyond the 4GB. Instead of counting each bytes, we count blocks
of bytes. Among other things, it will be useful for the bwlim filter, to be
able to configure shared limit exceeding the 4GB/s.
For now, this parameter must be in the range ]0-1024].
Multipath TCP (MPTCP), standardized in RFC8684 [1], is a TCP extension
that enables a TCP connection to use different paths.
Multipath TCP has been used for several use cases. On smartphones, MPTCP
enables seamless handovers between cellular and Wi-Fi networks while
preserving established connections. This use-case is what pushed Apple
to use MPTCP since 2013 in multiple applications [2]. On dual-stack
hosts, Multipath TCP enables the TCP connection to automatically use the
best performing path, either IPv4 or IPv6. If one path fails, MPTCP
automatically uses the other path.
To benefit from MPTCP, both the client and the server have to support
it. Multipath TCP is a backward-compatible TCP extension that is enabled
by default on recent Linux distributions (Debian, Ubuntu, Redhat, ...).
Multipath TCP is included in the Linux kernel since version 5.6 [3]. To
use it on Linux, an application must explicitly enable it when creating
the socket. No need to change anything else in the application.
This attached patch adds MPTCP per address support, to be used with:
mptcp{,4,6}@<address>[:port1[-port2]]
MPTCP v4 and v6 protocols have been added: they are mainly a copy of the
TCP ones, with small differences: names, proto, and receivers lists.
These protocols are stored in __protocol_by_family, as an alternative to
TCP, similar to what has been done with QUIC. By doing that, the size of
__protocol_by_family has not been increased, and it behaves like TCP.
MPTCP is both supported for the frontend and backend sides.
Also added an example of configuration using mptcp along with a backend
allowing to experiment with it.
Note that this is a re-implementation of Bjrn's work from 3 years ago
[4], when haproxy's internals were probably less ready to deal with
this, causing his work to be left pending for a while.
Currently, the TCP_MAXSEG socket option doesn't seem to be supported
with MPTCP [5]. This results in a warning when trying to set the MSS of
sockets in proto_tcp:tcp_bind_listener.
This can be resolved by adding two new variables:
sock_inet(6)_mptcp_maxseg_default that will hold the default
value of the TCP_MAXSEG option. Note that for the moment, this
will always be -1 as the option isn't supported. However, in the
future, when the support for this option will be added, it should
contain the correct value for the MSS, allowing to correctly
set the TCP_MAXSEG option.
Link: https://www.rfc-editor.org/rfc/rfc8684.html [1]
Link: https://www.tessares.net/apples-mptcp-story-so-far/ [2]
Link: https://www.mptcp.dev [3]
Link: https://github.com/haproxy/haproxy/issues/1028 [4]
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/515 [5]
Co-authored-by: Dorian Craps <dorian.craps@student.vinci.be>
Co-authored-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Add a new field alt_proto to the server structures that
specify if an alternate protocol should be used for this server.
This field can be transparently passed to protocol_lookup to get
an appropriate protocol structure.
This change allows thus to create servers with different protocols,
and not only TCP anymore.
Add a new parameter "alt" that will store wether this configuration
use an alternate protocol.
This alt pointer will contain a value that can be transparently
passed to protocol_lookup to obtain an appropriate protocol structure.
This change is needed to allow for example the servers to know if it
need to use an alternate protocol or not.
It has been reported by Wedl Michael, a student at the University of Applied
Sciences St. Poelten, a potential vulnerability into haproxy as described below.
An attacker could have obtained a TLS session ticket after having established
a connection to an haproxy QUIC listener, using its real IP address. The
attacker has not even to send a application level request (HTTP3). Then
the attacker could open a 0-RTT session with a spoofed IP address
trusted by the QUIC listen to bypass IP allow/block list and send HTTP3 requests.
To mitigate this vulnerability, one decided to use a token which can be provided
to the client each time it successfully managed to connect to haproxy. These
tokens may be reused for future connections to validate the address/path of the
remote peer as this is done with the Retry token which is used for the current
connection, not the next one. Such tokens are transported by NEW_TOKEN frames
which was not used at this time by haproxy.
So, each time a client connect to an haproxy QUIC listener with 0-RTT
enabled, it is provided with such a token which can be reused for the
next 0-RTT session. If no such a token is presented by the client,
haproxy checks if the session is a 0-RTT one, so with early-data presented
by the client. Contrary to the Retry token, the decision to refuse the
connection is made only when the TLS stack has been provided with
enough early-data from the Initial ClientHello TLS message and when
these data have been accepted. Hopefully, this event arrives fast enough
to allow haproxy to kill the connection if some early-data have been accepted
without token presented by the client.
quic_build_post_handshake_frames() has been modified to build a NEW_TOKEN
frame with this newly implemented token to be transported inside.
quic_tls_derive_retry_token_secret() was renamed to quic_do_tls_derive_token_secre()
and modified to be reused and derive the secret for the new token implementation.
quic_token_validate() has been implemented to validate both the Retry and
the new token implemented by this patch. When this is a non-retry token
which could not be validated, the datagram received is marked as requiring
a Retry packet to be sent, and no connection is created.
When the Initial packet does not embed any non-retry token and if 0-RTT is enabled
the connection is marked with this new flag: QUIC_FL_CONN_NO_TOKEN_RCVD. As soon
as the TLS stack detects that some early-data have been provided and accepted by
the client, the connection is marked to be killed (QUIC_FL_CONN_TO_KILL) from
ha_quic_add_handshake_data(). This is done calling qc_ssl_eary_data_accepted()
new function. The secret TLS handshake is interrupted as soon as possible returnin
0 from ha_quic_add_handshake_data(). The connection is also marked as
requiring a Retry packet to be sent (QUIC_FL_CONN_SEND_RETRY) from
ha_quic_add_handshake_data(). The the handshake I/O handler (quic_conn_io_cb())
knows how to behave: kill the connection after having sent a Retry packet.
About TLS stack compatibility, this patch is supported by aws-lc. It is
disabled for wolfssl which does not support 0-RTT at this time thanks
to HAVE_SSL_0RTT_QUIC.
This patch depends on these commits:
MINOR: quic: Add trace for QUIC_EV_CONN_IO_CB event.
MINOR: quic: Implement qc_ssl_eary_data_accepted().
MINOR: quic: Modify NEW_TOKEN frame structure (qf_new_token struct)
BUG/MINOR: quic: Missing incrementation in NEW_TOKEN frame builder
MINOR: quic: Token for future connections implementation.
MINOR: quic: Implement quic_tls_derive_token_secret().
MINOR: tools: Implement ipaddrcpy().
Must be backported as far as 2.6.
This function is a wrapper around SSL_get_early_data_status() for
OpenSSL derived stack and SSL_early_data_accepted() boringSSL derived
stacks like AWS-LC. It returns true for a TLS server if it has
accepted the early data received from a client.
Also implement quic_ssl_early_data_status_str() which is dedicated to be used
for debugging purposes (traces). This function converts the enum returned
by the two function mentionned above to a human readable string.
Modify qf_new_token structure to use a static buffer with QUIC_TOKEN_LEN
as size as defined by the token for future connections (quic_token.c).
Modify consequently the NEW_TOKEN frame parser (see quic_parse_new_token_frame()).
Also add comments to denote that the NEW_TOKEN parser function is used only by
clients and that its builder is used only by servers.
There exist two sorts of token used by QUIC. They are both used to validate
the peer address (path validation). Retry are used for the current
connection the client want to open. This patch implement the other
sort of tokens which after having been received from a connection, may
be provided for the next connection from the same IP address to validate
it (or validate the network path between the client and the server).
The token generation is implemented by quic_generate_token(), and
the token validation by quic_token_chek(). The same method
is used as for Retry tokens to build such tokens to be reused for
future connections. The format is very simple: one byte for the format
identifier to distinguish these new tokens for the Retry token, followed
by a 32bits timestamps. As this part is ciphered with AEAD as cryptographic
algorithm, 16 bytes are needed for the AEAD tag. 16 more random bytes
are added to this token and a salt to derive the AEAD secret used
to cipher the token. In addition to this salt, this is the client IP address
which is used also as AAD to derive the AEAD secret. So, the length of
the token is fixed: 37 bytes.
This is function is similar to quic_tls_derive_retry_token_secret().
Its aim is to derive the secret used to cipher the token to be used
for future connections.
This patch renames quic_tls_derive_retry_token_secret() to a more
and reuses its code to produce a more generic one: quic_do_tls_derive_token_secret().
Two arguments are added to this latter to produce both quic_tls_derive_retry_token_secret()
and quic_tls_derive_token_secret() new function which calls
quic_do_tls_derive_token_secret().
There was a typo in the macro name, where LENGTH was incorrectly
written. This didn't cause any issue because the typo appeared in all
occurrences in the codebase.
Support for 429 was recently added to L7 retries (0d142e075 "MINOR: proxy:
Add support of 429-Too-Many-Requests in retry-on status"). But the
l7_status_match() function was not properly updated. The switch statement
must match the 429 status to be able to perform a L7 retry.
This patch must be backported if the commit above is backported. It is
related to #2687.
The "429" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.
This patch should fix the issue #2687. It is quite simple so it may safely
be backported to 3.0 if necessary.
Activate the capture of the TLS signature_algorithms extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
Activate the capture of the TLS supported_versions extension from the
Client Hello. This list is stored in the ssl_capture buffer when the
global option "tune.ssl.capture-cipherlist-size" is enabled.
This patch is the follow-up of 1811d2a6ba (MINOR: tools: add helpers to
backup/clean/restore env).
In order to avoid unexpected behaviour in master-worker mode during the process
reload with a new configuration, when the old one has contained '*env' keywords,
let's backup its initial environment before calling parse_cfg() and let's clean
and restore it in the context of master process, just before it enters in a wait
polling loop.
This will garantee that new workers will have a new updated environment and not
the previous one inherited from the master, which does not read the configuration,
when it's in a wait-mode.
'setenv', 'presetenv', 'unsetenv', 'resetenv' keywords in configuration could
modify the process runtime environment. In case of master-worker mode this
creates a problem, as the configuration is read only once before the forking a
worker and then the master process does the reexec without reading any config
files, just to free the memory. So, during the reload a new worker process will
be created, but it will inherited the previous unchanged environment from the
master in wait mode, thus it won't benefit the changes in configuration,
related to '*env' keywords. This may cause unexpected behavior or some parser
errors in master-worker mode.
So, let's add a helper to backup all process env variables just before it will
read its configuration. And let's also add helpers to clean up the current
runtime environment and to restore it to its initial state (as it was before
parsing the config).
For custom families, there's sometimes an underlying real address and
it would be nice to be able to directly use the real family in calls
to bind() and connect() without having to add explicit checks for
exceptions everywhere.
Let's add a .real_family field to struct proto_fam for this. For now
it's always equal to the family except for non-transferable ones such
as rhttp where it's equal to the custom one (anything else could fit).
At plenty of places we have access to an address family which may
include some custom addresses but we cannot simply convert them to
the real families without performing some random protocol lookups.
Let's simply add a proto_fam table like we have for the protocols.
The protocols could even be indexed there, but for now it's not worth
it.
When we finally split sock_domain from sock_family in 2.3, something
was not cleanly finished. The family is what should be stored in the
address while the domain is what is supposed to be passed to socket().
But for the custom addresses, we did the opposite, just because the
protocol_lookup() function was acting on the domain, not the family
(both of which are equal for non-custom addresses).
This is an API bug but there's no point backporting it since it does
not have visible effects. It was visible in the code since a few places
were using PF_UNIX while others were comparing the domain against AF_MAX
instead of comparing the family.
This patch clarifies this in the comments on top of proto_fam, addresses
the indexing issue and properly reconfigures the two custom families.
Tests performed between a 1 Gbps connected server and a 100 mbps client,
distant by 95ms showed that:
- we need 1.1 MB in flight to fill the link
- rare but inevitable losses are sufficient to make cubic's window
collapse fast and long to recover
- a 100 MB object takes 69s to download
- tolerance for 1 loss between two ACKs suffices to shrink the download
time to 20-22s
- 2 losses go to 17-20s
- 4 losses reach 14-17s
At 100 concurrent connections that fill the server's link:
- 0 loss tolerance shows 2-3% losses
- 1 loss tolerance shows 3-5% losses
- 2 loss tolerance shows 10-13% losses
- 4 loss tolerance shows 23-29% losses
As such while there can be a significant gain sometimes in setting this
tolerance above zero, it can also significantly waste bandwidth by sending
far more than can be received. While it's probably not a solution to real
world problems, it repeatedly proved to be a very effective troubleshooting
tool helping to figure different root causes of low transfer speeds. In
spirit it is comparable to the no-cc congestion algorithm, i.e. it must
not be used except for experimentation.
Upon loss detection, qc_release_lost_pkts() notifies congestion
controllers about the event and its final time. However it does not
pass the number of lost packets, that can provide useful hints for
some controllers. Let's just pass this option.
Previous commit switch to small buffers for HTTP/3 HEADERS emission.
This ensures that several parallel streams can allocate their own buffer
without hitting the connection buffer limit based now on the congestion
window size.
However, this prevents the transmission of responses with uncommonly
large headers. Indeed, if all headers cannot be encoded in a single
buffer, an error is reported which cause the whole connection closure.
Adjust this by implementing a realloc API exposed by QUIC MUX. This
allows application layer to switch from a small to a default buffer and
restart its processing. This guarantees that again headers not longer
than bufsize can be properly transferred.
This patch extends qc_stream_desc API to be able to allocate small
buffers. QUIC MUX API is similarly updated as ultimatly each application
protocol is responsible to choose between a default or a smaller buffer.
Internally, the type of allocated buffer is remembered via qc_stream_buf
instance. This is mandatory to ensure that the buffer is released in the
correct pool, in particular as small and standard buffers can be
configured with the same size.
This commit is purely an API change. For the moment, small buffers are
not used. This will changed in a dedicated patch.
Define a new buffer pool reserved to allocate smaller memory area. For
the moment, its usage will be restricted to QUIC, as such it is declared
in quic_stream module.
Add a new config option "tune.bufsize.small" to specify the size of the
allocated objects. A special check ensures that it is not greater than
the default bufsize to avoid unexpected effects.
QUIC MUX buffer allocation limit is now directly based on the underlying
congestion window size. previous static limit based on conn-tx-buffers
is now unused. As such, this commit adds a warning to users to prevent
that it is now obsolete.
Secondly, update max-window-size setting. It is now the main entrypoint
to limit both the maximum congestion window size and the number of QUIC
MUX allocated buffer on emission. Remove its special value '0' which was
used to automatically adjust it on now unused conn-tx-buffers.
Each QUIC MUX may allocate buffers for MUX stream emission. These
buffers are then shared with quic_conn to handle ACK reception and
retransmission. A limit on the number of concurrent buffers used per
connection has been defined statically and can be updated via a
configuration option. This commit replaces the limit to instead use the
current underlying congestion window size.
The purpose of this change is to remove the artificial static buffer
count limit, which may be difficult to choose. Indeed, if a connection
performs with minimal loss rate, the buffer count would limit severely
its throughput. It could be increase to fix this, but it also impacts
others connections, even with less optimal performance, causing too many
extra data buffering on the MUX layer. By using the dynamic congestion
window size, haproxy ensures that MUX buffering corresponds roughly to
the network conditions.
Using QCC <buf_in_flight>, a new buffer can be allocated if it is less
than the current window size. If not, QCS emission is interrupted and
haproxy stream layer will subscribe until a new buffer is ready.
One of the criticals parts is to ensure that MUX layer previously
blocked on buffer allocation is properly woken up when sending can be
retried. This occurs on two occasions :
* after an already used Tx buffer is cleared on ACK reception. This case
is already handled by qcc_notify_buf() via quic_stream layer.
* on congestion window increase. A new qcc_notify_buf() invokation is
added into qc_notify_send().
Finally, remove <avail_bufs> QCC field which is now unused.
This commit is labelled MAJOR as it may have unexpected effect and could
cause significant behavior change. For example, in previous
implementation QUIC MUX would be able to buffer more data even if the
congestion window is small. With this patch, data cannot be transferred
from the stream layer which may cause more streams to be shut down on
client timeout. Another effect may be more CPU consumption as the
connection limit would be hit more often, causing more streams to be
interrupted and woken up in cycle.
Define a new QCC counter named <buf_in_flight>. Its purpose is to
account the current sum of all allocated stream buffer size used on
emission.
For this moment, this counter is updated and buffer allocation and
deallocation. It will be used to replace <avail_bufs> once congestion
window is used as limit for buffer allocation in a future commit.
Define a new qc_stream_desc flag QC_SD_FL_OOB_BUF. This is to mark
streams which are not subject to the connection limit on allocated MUX
stream buffer.
The purpose is to simplify handling of QUIC MUX streams which do not
transfer data and as such are not driven by haproxy layer, for example
HTTP/3 control stream. These streams interacts synchronously with QUIC
MUX and cannot retry emission in case of temporary failure.
This commit will be useful once connection buffer allocation limit is
reimplemented to directly rely on the congestion window size. This will
probably cause the buffer limit to be reached more frequently, maybe
even on QUIC MUX initialization. As such, it will be possible to mark
control streams and prevent them to be subject to the buffer limit.
QUIC MUX expose a new function qcs_send_metadata(). It can be used by an
application protocol to specify which streams are used for control
exchanges. For the moment, no such stream use this mechanism.
A limit per connection is put on the number of buffers allocated by QUIC
MUX for emission accross all its streams. This ensures memory
consumption remains under control. This limit is simply explained as a
count of buffers which can be concurrently allocated for each
connection.
As such, quic_conn structure was used to account currently allocated
buffers. However, a quic_conn nevers allocates new stream buffers. This
is only done at QUIC MUX layer. As such, this commit moves buffer
accounting inside QCC structure. This simplifies the API, most notably
qc_stream_buf_alloc() usage.
Note that this commit inverts the accounting. Previously, it was
initially set to 0 and increment for each allocated buffer. Now, it is
set to the maximum value and decrement for each buf usage. This is
considered as clearer to use.
Define a new global keyword tune.quic.frontend.max-window-size. This
allows to set globally the maximum congestion window size for each QUIC
frontend connections.
The default value is 0. It is a special value which automatically derive
the size from the configured QUIC connection buffer limit. This is
similar to the previous "quic-cc-algo" behavior, which can be used to
override the maximum window size per bind line.
load_cfg_in_mem() can continuously reallocate memory in order to load an
extremely large input from /dev/stdin, until it fails with ENOMEM, which means
that process has consumed all available RAM. In case of containers and
virtualized environments it's not very good.
So, in order to prevent this, let's introduce MAX_CFG_SIZE as 10MB, which will
limit the size of input supplied via /dev/stdin.
Some systems require log formats in the CLF format and that meant that I
could not send my logs for proxies in mode tcp to those servers. This
implements a format that uses log variables that are compatble with TCP
mode frontends and replaces traditional HTTP values in the CLF format
to make them stand out. Instead of logging method and URI like this
"GET /example HTTP/1.1" it will log "TCP " and for a response code I
used "000" so it would be easy to separate from legitimate HTTP
traffic. Now your log servers that require a CLF format can see the
timings for TCP traffic as well as HTTP.
It is now possible to use "drop" keyword for "on" lines under a
log-profile section to specify that no log at all should be emitted for
the specified step (setting an empty format was not sufficient to do so
because only the log payload would be empty, not the log header, thus the
log would still be emitted).
It may be useful to selectively disable logging at specific steps for a
given log target (since the log profile may be set on log directives):
log-profile myprof
on request format "blabla" sd "custom sd"
on response drop
New testcase was added to reg-tests/log/log_profiles.vtc
ci_insert() is a function which allows to insert a string <str> of size
<len> at <pos> of the input buffer. This is the equivalent of
ci_insert_line2() but without inserting '\r\n'
As readcfgfile no longer opens configuration files and reads them with fgets,
but performs only the parsing of provided data, let's rename it to parse_cfg by
analogy with read_cfg in haproxy.c.
Let's call load_cfg_in_ram() helper for each configuration file to load it's
content in some area in memory. Adapt readcfgfile() parser function
respectively. In order to limit changes in its scope we give as an argument a
cfgfile structure, already filled in init_args() and in load_cfg_in_ram() with
file metadata and content.
Parser function (readcfgfile()) uses now fgets_from_mem() instead of standard
fgets from libc implementations.
SPOE filter parses its own configuration file, pointed by 'config' keyword in
the configuration already loaded in memory. So, let's allocate and fill for
this a supplementary cfgfile structure, which is not referenced in cfg_cfgfiles
list. This structure and the memory with content of SPOE filter configuration
are freed immediately in parse_spoe_flt(), when readcfgfile() returns.
HAProxy OpenTracing filter also uses its own configuration file. So, let's
follow the same logic as we do for SPOE filter.
Add fgets_from_mem() helper to read lines from configuration files, stored now
as memory chunks. In order to limit changes in the first-level parser code
(readcfgfile()), it is better to reimplement the standard fgets, i.e. to
have a fgets, which can read the serialized data line by line from some memory
area, instead of file stream, and can keep the same behaviour as libc
implementations fgets.
list_append_word() helper was used before only to chain configuration file names
in a list. As now we start to use cfgfile structure which represents entire file
in memory and its metadata, let's adapt this helper to use this structure and
let's rename it to list_append_cfgfile().
Adapt functions, which process configuration files and directories to use
cfgfile structure and list_append_cfgfile() instead of wordlist.
This and following commits serve to prepare loading configuration files in
memory, before parsing them, as we may need to parse some parts of
configuration in different moments of the startup sequence. This is a case of
the new master-worker initialization process. Here we need to read at first
only the global and the program sections and only after some steps
(forking worker, etc) the rest of the configuration.
Add a new structure cfgfile to keep configuration files metadata and content,
loaded somewhere in a memory. Instances of filled cfgfile structures could be
chained in a list, as the order in which they were loaded is important.
We now have a trace_ctx to hold the sess, conn, qc, stream and so on.
This will allow us to pass it across layers so that other helpers can
help fill them.
Ideally it should be passed as an argument to __trace_enabled() by
__trace() so that it can be passed back to the trace callback. But
it seems that trace callbacks are smart enough to figure all their
info when they need them.
With "follow" from one source to another, it becomes possible for a
source to automatically follow another source's tracked pointer. The
best example is the session:
- the "session" source is enabled and has a "lockon session"
-> its lockon_ptr is equal to the session when valid
- other sources (h1,h2,h3 etc) are configured for "follow session"
and will then automatically check if session's lockon_ptr matches
its own session, in which case tracing will be enabled for that
trace (no state change).
It's not necessary to start/pause/stop traces when using this, only
"follow" followed by a source with lockon enabled is needed. Some
combinations might work better than others. At the moment the session
is almost never known from the backend, but this may improve.
The meta-source "all" is supported for the follower so that all sources
will follow the tracked one.
Reuse newly defined tot_time structure to measure various values related
to a QCS lifetime.
First, a timer is used to comptabilize the total QCS lifetime. Then, two
other timers are used to account the total time during which Tx from
stream layer to MUX is blocked, either on lack of buffer or due to
flow-control.
These three timers are reported in qmux_dump_qcs_info(). Thus, they are
available in traces and for QUIC MUX debug string sample.
Define a new utility type tot_time. Its purpose is to be able to account
elapsed time accross multiple periods. Functions are defined to easily
start and stop measures, and return the current value.
Define a new xprt_ops callback named dump_info. This can be used to
extend MUX debug string with infos from the lower layer.
Implement dump_info for QUIC stack. For now, only minimal info are
reported : bytes in flight and size of the sending window. This should
allow to detect if the congestion controller is fine. These info are
reported via QUIC MUX debug string sample.
Extract trace code to dump QCC and QCS instances into dedicated
functions named qmux_dump_qc{c,s}_info(). This will allow to easily
print QCC/QCS infos outside of traces.
These are passed to the underlying mux to retrieve debug information
at the mux level (stream/connection) as a string that's meant to be
added to logs.
The API is quite complex just because we can't pass any info to the
bottom function. So we construct a union and pass the argument as an
int, and expect the callee to fill that with its buffer in return.
Most likely the mux->ctl and ->sctl API should be reworked before
the release to simplify this.
The functions take an optional argument that is a bit mask of the
layers to dump:
muxs=1
muxc=2
xprt=4
conn=8
sock=16
The default (0) logs everything available.
STREAM frames have dedicated handling on retransmission. A special check
is done to remove data already acked in case of duplicated frames, thus
only unacked data are retransmitted.
This handling is faulty in case of an empty STREAM frame with FIN set.
On retransmission, this frame does not cover any unacked range as it is
empty and is thus discarded. This may cause the transfer to freeze with
the client waiting indefinitely for the FIN notification.
To handle retransmission of empty FIN STREAM frame, qc_stream_desc layer
have been extended. A new flag QC_SD_FL_WAIT_FOR_FIN is set by MUX QUIC
when FIN has been transmitted. If set, it prevents qc_stream_desc to be
freed until FIN is acknowledged. On retransmission side,
qc_stream_frm_is_acked() has been updated. It now reports false if
FIN bit is set on the frame and qc_stream_desc has QC_SD_FL_WAIT_FOR_FIN
set.
This must be backported up to 2.6. However, this modifies heavily
critical section for ACK handling and retransmission. As such, it must
be backported only after a period of observation.
This issue can be reproduced by using the following socat command as
server to add delay between the response and connection closure :
$ socat TCP-LISTEN:<port>,fork,reuseaddr,crlf SYSTEM:'echo "HTTP/1.1 200 OK"; echo ""; sleep 1;'
On the client side, ngtcp2 can be used to simulate packet drop. Without
this patch, connection will be interrupted on QUIC idle timeout or
haproxy client timeout with ERR_DRAINING on ngtcp2 :
$ ngtcp2-client --exit-on-all-streams-close -r 0.3 <host> <port> "http://<host>:<port>/?s=32o"
Alternatively to ngtcp2 random loss, an extra haproxy patch can also be
used to force skipping the emission of the empty STREAM frame :
diff --git a/include/haproxy/quic_tx-t.h b/include/haproxy/quic_tx-t.h
index efbdfe687..1ff899acd 100644
--- a/include/haproxy/quic_tx-t.h
+++ b/include/haproxy/quic_tx-t.h
@@ -26,6 +26,8 @@ extern struct pool_head *pool_head_quic_cc_buf;
/* Flag a sent packet as being probing with old data */
#define QUIC_FL_TX_PACKET_PROBE_WITH_OLD_DATA (1UL << 5)
+#define QUIC_FL_TX_PACKET_SKIP_SENDTO (1UL << 6)
+
/* Structure to store enough information about TX QUIC packets. */
struct quic_tx_packet {
/* List entry point. */
diff --git a/src/quic_tx.c b/src/quic_tx.c
index 2f199ac3c..2702fc9b9 100644
--- a/src/quic_tx.c
+++ b/src/quic_tx.c
@@ -318,7 +318,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
tmpbuf.size = tmpbuf.data = dglen;
TRACE_PROTO("TX dgram", QUIC_EV_CONN_SPPKTS, qc);
- if (!skip_sendto) {
+ if (!skip_sendto && !(first_pkt->flags & QUIC_FL_TX_PACKET_SKIP_SENDTO)) {
int ret = qc_snd_buf(qc, &tmpbuf, tmpbuf.data, 0, gso);
if (ret < 0) {
if (gso && ret == -EIO) {
@@ -354,6 +354,7 @@ static int qc_send_ppkts(struct buffer *buf, struct ssl_sock_ctx *ctx)
qc->cntrs.sent_bytes_gso += ret;
}
}
+ first_pkt->flags &= ~QUIC_FL_TX_PACKET_SKIP_SENDTO;
b_del(buf, dglen + QUIC_DGRAM_HEADLEN);
qc->bytes.tx += tmpbuf.data;
@@ -2066,6 +2067,17 @@ static int qc_do_build_pkt(unsigned char *pos, const unsigned char *end,
continue;
}
+ switch (cf->type) {
+ case QUIC_FT_STREAM_8 ... QUIC_FT_STREAM_F:
+ if (!cf->stream.len && (qc->flags & QUIC_FL_CONN_TX_MUX_CONTEXT)) {
+ TRACE_USER("artificially drop packet with empty STREAM frame", QUIC_EV_CONN_TXPKT, qc);
+ pkt->flags |= QUIC_FL_TX_PACKET_SKIP_SENDTO;
+ }
+ break;
+ default:
+ break;
+ }
+
quic_tx_packet_refinc(pkt);
cf->pkt = pkt;
}
When a STREAM frame is retransmitted, a check is performed to remove
range of data already acked from it. This is useful when STREAM frames
are duplicated and splitted to cover different data ranges. The newly
retransmitted frame contains only unacked data.
This process is performed similarly in qc_dup_pkt_frms() and
qc_build_frms(). Refactor the code into a new function named
qc_stream_frm_is_acked(). It returns true if frame data are already
fully acked and retransmission can be avoided. If only a partial range
of data is acknowledged, frame content is updated to only cover the
unacked data.
This patch does not have any functional change. However, it simplifies
retransmission for STREAM frames. Also, it will be reused to fix
retransmission for empty STREAM frames with FIN set from the following
patch :
BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
As such, it must be backported prior to it.
qc_stream_desc had a field <release> used as a boolean. Convert it with
a new <flags> field and QC_SD_FL_RELEASE value as equivalent.
The purpose of this patch is to be able to extend qc_stream_desc by
adding newer flags values. This patch is required for the following
patch
BUG/MEDIUM: quic: handle retransmit for standalone FIN STREAM
As such, it must be backported prior to it.
haproxy supports tunnel establishment through HTTP Upgrade mechanism.
Since the following commit, extended CONNECT is also supported for
HTTP/2 both on frontend and backend side.
commit 9bf957335e
MEDIUM: mux_h2: generate Extended CONNECT from htx upgrade
As specified by HTTP/2 rfc, "h2c" can be used by an HTTP/1.1 client to
request an upgrade to HTTP/2. In haproxy, this is not supported so it
silently ignores this. However, Connection and Upgrade headers are
forwarded as-is on the backend side.
If using HTTP/1 on the backend side and the server supports this upgrade
mechanism, haproxy won't be able to parse the HTTP response. If using
HTTP/2, mux backend tries to incorrectly convert the request to an
Extended CONNECT with h2c protocol, which may also prevent the response
to be transmitted.
To fix this, flag HTTP/1 request with "h2c" or "h2" token in an upgrade
header. On converting the header list to HTX, the upgrade header is
skipped if any of this token is present and the H1_MF_CONN_UPG flag is
removed.
This issue can easily be reproduced using curl --http2 argument to
connect to an HTTP/1 frontend.
This must be backported up to 2.4 after a period of observation.
Decode QUIC MUX connection and stream elements via qcc_show_flags() and
qcs_show_flags(). Flags definition have been moved outside of USE_QUIC
to ease compilation of flags binary.
Add ->get_info() new control layer callback definition to protocol struct to
retreive statiscal counters information at transport layer (TCPv4/TCPv6) identified by
an integer into a long long int.
Move the TCP specific code from get_tcp_info() to the tcp_get_info() control layer
function (src/proto_tcp.c) and define it as the ->get_info() callback for
TCPv4 and TCPv6.
Note that get_tcp_info() is called for several TCP sample fetches.
This patch is useful to support some of these sample fetches for QUIC and to
keep the code simple and easy to maintain.
Then reactivate HAVE_SSL_0RTT and HAVE_SSL_0RTT_QUIC for AWS-LC, which
were wrongly deactivated in f5353f2c ("MINOR: ssl: add HAVE_SSL_0RTT
constant").
Must be backported to 3.0.
There's a rare TOCTOU case that happens from time to time with maxconn 1
and multiple threads. Between the moment we see the queue full and the
moment we queue a request, it's possible that the last request on the
server or proxy ended and that no other one is left to offer it its place.
Given that all this code path is performance-critical and we cannot afford
to increase the lock duration, better recheck for the condition after
queueing. For this we need to be able to check for the condition and
cleanly dequeue a request. That's what this patch provides via the new
function pendconn_must_try_again(). It will catch more requests than
absolutely needed though it will catch them all. It may find that around
1/1000 of requests are at risk, though testing shows that in practice,
it's around 1 per million that really gets stuck (other ones benefit
from timing and finishing late requests). Maybe in the future some
conditions might be refined but it's harmless.
What happens to such requests is that they're dequeued and their pendconn
freed, so that the caller can decide to try to LB or queue them again. For
now the function is not used, it's just added separately for easier tracking.
Add ->state_cli() new callback to quic_cc_algo struct to define a
function called by the "show quic (cc|full)" commands to dump some information
about the congestion algorithm internal state currently in use by the QUIC
connections.
Implement this callback for CUBIC algorithm to dump its internal variables:
- K: (the time to reach the cubic curve inflexion point),
- last_w_max: the last maximum window value reached before intering
the last recovery period. This is also the window value at the
inflexion point of the cubic curve,
- wdiff: the difference between the current window value and last_w_max.
So negative before the inflexion point, and positive after.
In 2.5-dev9, commit 631c7e866 ("MEDIUM: h1: Force close mode for invalid
uses of T-E header") enforced a recently arrived new security rule in the
HTTP specification aiming at preventing a class of content-smuggling
attacks involving HTTP/1.0 agents. It consists in handling the very rare
T-E + C-L requests or responses in close mode.
It happens it does have an impact of a rare few and very old clients
(probably running insecure TLS stacks by the way) that continue to send
both with their POST requests. The impact is that for each and every
request they'll have to reconnect, possibly negotiating a full TLS
handshake that becomes harmful to the machine in terms of CPU computation.
This commit adds a new option "h1-do-not-close-on-insecure-transfer-encoding"
that does exactly what it says, it just asks not to close on such messages,
even though the message continues to be sanitized and C-L dropped. It means
that the risk is only between the sender and haproxy, which is limited, and
might be the only acceptable solution for such environments having to deal
with broken implementations.
The cases are so rare that it should not need to be backported, or in the
worst case, to the latest LTS if there is any demand.
Define a new quic-initial "send-retry" rule. This allows to force the
emission of a Retry packet on an initial without token instead of
instantiating a new QUIC connection.
Define a new quic-initial action named "reject". Contrary to dgram-drop,
the client is notified of the rejection by a CONNECTION_CLOSE with
CONNECTION_REFUSED error code.
To be able to emit the necessary CONNECTION_CLOSE frame, quic_conn is
instantiated, contrary to dgram-drop action. quic_set_connection_close()
is called immediatly after qc_new_conn() which prevents the handshake
startup.
To extend quic-initial rules, pass quic_dgram instance to argument for
the various actions. As such, quic_dgram is now supported as an obj_type
and can be used in session origin field.
Add ACL condition support for quic-initial rules. This requires the
extension of quic_parse_quic_initial() to parse an extra if/unless
block.
Only layer4 client samples are allowed to be used with quic-initial
rules. However, due to the early execution of quic-initial rules prior
to any connection instantiation, some samples are non supported.
To be able to use the 4 described samples, a dummy session is
instantiated before quic-initial rules execution. Its src and dst fields
are set from the received datagram values.
Implement a new set of rules labelled as quic-initial.
These rules as specific to QUIC. They are scheduled to be executed early
on Initial packet parsing, prior a new QUIC connection instantiation.
Contrary to tcp-request connection, this allows to reject traffic
earlier, most notably by avoiding unnecessary QUIC SSL handshake
processing.
A new module quic_rules is created. Its main function
quic_init_exec_rules() is called on Initial packet parsing in function
quic_rx_pkt_retrieve_conn().
For the moment, only "accept" and "dgram-drop" are valid actions. Both
are final. The latter drops silently the Initial packet instead of
allocating a new QUIC connection.
With AWS-LC, the aead part is covered by the EVP_AEAD API which
provides the correct EVP_aead_chacha20_poly1305(), however for header
protection it does not provides an EVP_CIPHER for chacha20.
This patch implements exceptions in the header protection code and use
EVP_CIPHER_CHACHA20 and EVP_CIPHER_CTX_CHACHA20 placeholders so we can
use the CRYPTO_chacha_20() primitive manually instead of the EVP_CIPHER
API.
This requires to check if we are using EVP_CIPHER_CTX_CHACHA20 when
doing EVP_CIPHER_CTX_free().
In order to prepare the code for using Chacha20 with the EVP_AEAD API,
both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key
argument.
Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key
won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter
to use it.
Some of the crypto functions used for headers protection in QUIC are
named with an "aes" name even thought they are not used for AES
encryption only.
This patch renames these "aes" to "hp" so it is clearer.
The QUIC crypto is using the EVP_CIPHER API in order to achieve
authenticated encryption, this was the API which was used with OpenSSL.
With libraries that inspires from BoringSSL (libreSSL and AWS-LC), the
AEAD algorithms are implemented using the EVP_AEAD API.
This patch converts the call to the EVP_CIPHER API when called in the
contex of AEAD cryptography for QUIC.
The patch defines some QUIC_AEAD macros that can be either EVP_CIPHER or
EVP_AEAD depending on the library.
This was mainly done for AWS-LC but this could be useful for other
libraries. This should finally allow to use CHACHA20_POLY1305 with
AWS-LC.
This patch allows to use the following ciphers with the EVP_AEAD API:
- TLS1_3_CK_AES_128_GCM_SHA256
- TLS1_3_CK_AES_256_GCM_SHA384
AWS-LC does not implement TLS1_3_CK_AES_128_CCM_SHA256 and
TLS1_3_CK_CHACHA20_POLY1305_SHA256 requires some hack for headers
protection which will come in another patch.
Add a new struct member to sft structure named e_processed in order to
track the total number of events processed by sft applets.
sink_forward_oc_io_handler() and sink_forward_io_handler() now make use
of ring_dispatch_messages() optional value added in the previous commit
in order to increase the number of processed events.
ring_dispatch_messages() now takes an optional argument <processed> which
must point to a size_t counter when provided.
When provided, the value is updated to the number of messages processed
by the function.
spoe_check_vsn() function can now be used to check if a version, converted
to an integer, via spoe_str_to_vsn() for instance, is supported. To do so,
the list of all supported version is now exported.
Add session/stream scopes related to the parent. To do so, "psess", "ptxn",
"preq" or "pres" must be used instead of tranditionnal scopes (without the
first "p"). the "proc" scope is not concerned by this change because it is
not linked to a stream. When such scopes are used, a specific flags is added
on the variable description during the variable parsing.
For now, theses scopes are parsed and the variable description is updated
accordingly. But at the end, any operation on the variable value fails.
Now a variable description is retrieved when a variable is parsed, we can
use it to set or unset the variable value. It is mandatory to be able to
know the parent stream, if any, must be used, instead of the current one.
A variable description is now used to parse a variable and extract its name
and its scope. It is mandatory to be able to add some flags on the variable
when it is evaluated (set or get). Among other things, this will be used to
know the parent stream, if any, must be used, instead of the current one.
A pointer to a parent stream was added in the stream structure. For now,
this pointer is never set, but the idea is to have an access to a stream
environment from another one from the moment there is a parent/child
relationship betwee these streams.
Concretely, for now, there is nothing to formalize this relationship.
Fix build warning on NetBSD by reapplying f278eec37a ("BUILD: tree-wide:
cast arguments to tolower/toupper to unsigned char").
This should fix issue #2551.
It is more handy to use LIM2A in debug_parse_cli_show_dev(), as it allows to
show a custom string ("unlimited"), if a given limit value equals to 0.
normalize_rlim() handler is needed to convert properly RLIM_INFINITY to zero,
with the respect of type sizes, as rlim_t is always 4 bytes on 32bit and
64bit arch.
During tests, it's pretty visible that with many threads and a large
number of FDs, the process may take time to be ready. The reason for
this is that the full fdtab array is scanned by each and every thread
at boot in fd_reregister_all() in order to make each thread-local
poller adopt the FDs that are relevant to it. The problem is that
when dealing with 1-2M FDs and 64+ threads, it starts to represent
quite a number of loops, and usually the fdtab array doesn't entirely
fit in the CPU's L3 cache, causing extra memory accesses.
It's particularly visible when issuing debugging commands to the CLI
because usually the first one fails while the CPU is at 100% for half
a second (which also is socat's timeout). A quick test with this:
global
stats socket /tmp/sock1 level admin mode 666
stats timeout 1h
maxconn 2000000
And the following script started in another window:
while ! time socat -t5 - /tmp/sock1 <<< "show version";do date -Ins;done
shows that it takes 1.58s for the socat instance that succeeds on an
Ampere Altra with 80 cores, this requires to change the timeout (defaults
to half a second) otherwise it returns nothing. In addition it also means
that during reloads, some CPU spikes will be noticed.
Adding a prefetch of the current FD + 16 improves the startup time by 30%
but that's far from being sufficient.
In practice all of this is performed at boot time, a moment at which we
know that extremely few FDs are registered (basically just the listeners),
so FD numbers are usually very low and the rest of the table is scanned
for no benefit. Ideally, knowing upfront how many FDs we have should be
sufficient.
A first approach would consist in counting the entries on a single thread
before registering pollers. It's not necessarily efficient and would take
time anyway.
This patch takes a different approach. It consists in keeping a thread-local
max ("fd_highest") that is updated whenever fd_insert() is called with a
larger number. Of course this is not correct once all threads have started,
but it will remain valid during boot since the same value is used during
startup and is cloned for each thread, and no scheduling happens anywhere
during this period, so that all threads are aware of the highest FD they've
seen registered, even if it had been done in some init code, and this without
having to deal with a shared variable.
Here on the test platform, the script gets its response in 10ms vs 1580
before.
SPOE functions definitions were splitted on 2 or more lines, with the return
type alone on the first line. It is unusual in the HAProxy code.
The related issue is #2502.
It is the huge part of the series. The patch is not so huge, it removes
functions to produce or consume frames. The SPOE applet is pretty light
now. But since this patch, the SPOP multiplexer is now used. The SPOP mode
is now automatically ised for SPOP backends. So if there are bugs in the
SPOP multiplexer, they will be visible now.
The related issue is #2502.
The SPOP health-checks are now performed using the SPOP multiplexer. This
will be fixed later, but for now, it is considered as a L4 health-check and
no specific status code is reported. It means the corresponding vtest script
is marked as broken for now.
Functionnaly speaking, the same is performed. A connection is opened, a
HELLO frame is sent to the agent and we wait for the HELLO frame from the
agent in reply. But only L4OK, L4KO or L4TOUT will be reported.
The related issue is #2502.
It is no possible yet to use it. Idles connections and pipelining mode are
not supported for now. But it should be possible to open a SPOP connection,
perform the HELLO handshake, send a NOTIFY frame based on data produced by
the client side and receive the corresponding ACK frame to transfer its
content to the client side.
The related issue is #2502.
Structures describing the SPOE applet context, the SPOE filter configuration
and context and the SPOE messages and groups are moved in the C file. In
spoe-t.h file, it remains the structure describing an SPOE agent and flags
used by both sides.
In addition, the SPOE frontend, created for a given SPOE engine, is moved
from the SPOE filter configuration to the SPOE agent structure.
The related issue is #2502.