When the app_ops were removed, direct calls to the SC wake callback function
were replaced by tasklet wakeups. However, in conn_create_mux(), it was
replaced by a direct call to sc_conn_process(). However, sc_conn_process()
is only usable when the SC is attach to a stream. A backend mux can be
created for a healcheck. In this context, sc_conn_process() cannot be
called.
Because of this bug, crashes can be experienced when an error is triggered
during a SSL connection attempt from a healthcheck.
To fix the issue, the call to sc_conn_process() was replaced by a tasklet
wakeup.
This patch should fix the issue #3326. No backport needed.
When a peer sends a dictionary entry update with a value (the else
branch at line 2109), the entry id decoded from the wire was never
validated against dc->max_entries before being used as an array index
into dc->rx[].
A malicious peer can send id=N where N > 128 (PEER_STKT_CACHE_MAX_ENTRIES)
to:
- dc->rx[id-1].de at line 2123: OOB read followed by atomic decrement
and potential free of an attacker-controlled pointer via
dict_entry_unref()
- dc->rx[id-1].de = de at line 2124: OOB write of a heap pointer at
an attacker-controlled offset (16-byte stride, ~64 GiB range)
The bounds check was added to the key-only branch in commit f9e51beec
("BUG/MINOR: peers: Do not ignore a protocol error for dictionary
entries.") but was never added to the with-value branch. The bug has
been present since dictionary support was introduced in commit
8d78fa7def ("MINOR: peers: Make peers protocol support new
"server_name" data type.").
Reachable from any TCP client that knows the configured peer name
(no cryptographic authentication on the peers protocol). Requires a
stick-table with "store server_key" in the configuration.
Fix by hoisting the bounds check above the branch so it covers both
paths.
Must be backported as far as 2.6.
When the input chunk is already the large buffer (chk->size ==
large_trash_size), the <= comparison still matched and returned
another large buffer of the same size. Callers that retry on a
non-NULL return value (sample.c:4567 in json_query) loop forever.
The json_query infinite loop is trivially triggered: mjson_unescape()
returns -1 not only when the output buffer is too small but also for
any \uXXYY escape where XX != "00" (mjson.c:305) and for invalid
escapes like \q. The retry loop assumes -1 always means "grow the
buffer", so a 14-byte JSON body of {"k":"\u0100"} hangs the worker
thread permanently. Send N such requests to exhaust all worker
threads.
Use < instead of <= so a chunk that is already large yields NULL.
This also fixes the json converter overflow at sample.c:2869 where
no recheck happens after the "growth" returned a same-size buffer.
Introduced in commit ce912271db ("MEDIUM: chunk: Add support for
large chunks"). No backport needed.
A copy-paste error in alloc_trash_buffers_per_thread() passes
global.tune.bufsize_large to alloc_small_trash_buffers() instead of
global.tune.bufsize_small. This sets small_trash_size = bufsize_large.
When tune.bufsize.large is configured, get_larger_trash_chunk() then
incorrectly matches a large buffer against small_trash_size at line
169 and "grows" it to a regular (smaller) buffer. b_xfer() at line
179 attempts to copy the large buffer's contents into the smaller one:
- Default builds (DEBUG_STRICT=1): BUG_ON in __b_putblk() aborts
the process -> remote DoS
- DEBUG_STRICT=0 builds: BUG_ON becomes ASSUME() and the compiler
elides the check -> heap overflow with attacker-controlled bytes
Reachable via the json converter (sample.c:2862) when escaping
~bufsize_large/6 control characters in attacker-supplied data such
as a request header or body.
Introduced in commit 92a24a4e87 ("MEDIUM: chunk: Add support for
small chunks"). No backport needed.
hlua_error() is a printf-family function (calls vsnprintf), but
hlua_patref_set, hlua_patref_add, and _hlua_patref_add_bulk pass
errmsg directly as the format string. errmsg is built by pattern.c
helpers that embed the user-supplied key or value verbatim, e.g.
pat_ref_set_elt() generates "unable to parse '<value>'".
A Lua script calling:
ref:set("key", "%p.%p.%p.%p.%p.%p.%p.%p")
against a map with an integer output type (where the parse fails)
gets stack/register contents formatted into the (nil, err) return
value -> ASLR/canary leak. With %n and no _FORTIFY_SOURCE this
becomes an arbitrary write primitive.
This must be backported as far as the Patref Lua API exists.
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address.
[wla: this is only reachable if the Lua script passes more than
max_http_hdr header values, which requires access to the script itself]
This must be backported as far as the httpclient Lua API exists.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address. With no stack
canary, this is direct RCE; with a canary, it requires a leak first.
Reachable from any deployment that loads Lua scripts. While Lua
scripts are nominally trusted, this turns "can edit Lua" into "can
execute arbitrary native code", which is a meaningful boundary in
many setups (Lua sandbox escape).
This must be backported as far as the httpclient Lua API exists.
When the secret argument to jwt_decrypt_secret is a variable
(ARGT_VAR) rather than a literal string, alloc_trash_chunk() is
called to hold the base64-decoded secret but the buffer is never
released. The end: label frees input, decrypted_cek, out, and the
decoded_items array but not secret.
Each request leaks one trash chunk (~tune.bufsize, default 16KB).
At ~65000 requests per GiB this allows slow memory exhaustion DoS
against any config of the form:
http-request set-var(txn.x) req.hdr(...),jwt_decrypt_secret(txn.key)
This must be backported as far as JWE support exists.
convert_ecdsa_sig() calls i2d_ECDSA_SIG(ecdsa_sig, &p) where p
points into signature->area, a trash chunk of tune.bufsize bytes
(default 16384). i2d writes with no output bound.
The raw R||S input can be up to bufsize bytes (filled by
base64urldec at jwt.c:520-527), giving bignum_len up to 8192. The
DER encoding adds a SEQUENCE header (2-4 bytes), two INTEGER headers
(2-4 bytes each), and up to two leading-zero sign-padding bytes when
the bignum high bit is set. With two 8192-byte bignums having the
high bit set, the encoding is ~16398 bytes, overflowing the 16384-
byte buffer by ~14 bytes.
Triggered by any JWT with alg=ES256/384/512 and a ~21830-character
base64url signature. The signature does not need to verify
successfully; the overflow happens before verification. Reachable
from any config using jwt_verify with an EC algorithm.
Also fixes the existing wrong check: i2d returns -1 on error which
became SIZE_MAX in the size_t signature->data, defeating the
"== 0" test.
This must be backported as far as JWT support exists.
In sample_conv_jwt_decrypt_secret(), when a JWE token has an empty
encrypted-key section but the algorithm is not "dir" (e.g. A128KW),
neither branch initializes decrypted_cek. The NULL pointer is then
passed to decrypt_ciphertext() which dereferences it:
- For GCM encodings: aes_process() calls b_orig(NULL) -> SIGSEGV
- For CBC encodings: b_data(NULL) at jwe.c:463 -> SIGSEGV
A single HTTP request with a crafted Authorization header crashes the
worker process. Trigger token (JOSE header {"alg":"A128KW","enc":"A128GCM"},
empty CEK section between the two dots):
eyJhbGciOiJBMTI4S1ciLCJlbmMiOiJBMTI4R0NNIn0..AAAAAAAAAAAAAAAA.AA.AA
Reachable in any configuration using the jwt_decrypt_secret converter.
The other two decrypt converters (jwt_decrypt_jwk, jwt_decrypt_cert)
already have the check.
This must be backported as far as JWE support exists.
The 16-bit name_len field is read directly from the ClientHello and
stored as the sample length without any validation against srv_len,
ext_len, or the channel buffer size. A 65-byte ClientHello with
name_len=0xffff produces a sample claiming 65535 bytes of data when
only ~4 bytes are actually present in the buffer.
Downstream consumers then read tens of kilobytes past the channel
buffer:
- pattern.c:741 XXH3() hashes 65535 bytes -> ~50KB OOB heap read
- sample.c smp_dup memcpy if large trash configured
- log-format %[req.ssl_sni] leaks heap contents to logs/headers
Reachable pre-authentication on any TCP frontend using req.ssl_sni
(req_ssl_sni), which is the documented way to do SNI-based content
switching in TCP mode. No SSL handshake is required; the parser
runs on raw buffer contents in tcp-request content rules.
Bug introduced in commit d4c33c8889 (2013). The ALPN parser in
the same file at line 1044 has the equivalent check; SNI never did.
This must be backported to all supported versions.
When the healthcheck section support was added, the tcpcheck type was moved
into the tcpcheck ruleset. However, conn_install_mux_chk() function was not
updated accordingly. So the TCP mode was always returned.
No backport needed. This patch is related to #3324 but it is not the root
cause of the issue.
As reported by GH @phihos on GH #3320, using the shm-stats-file feature
with objects exceeding 127 chars would result in object name being
unexpectedly truncated, while GUID API supports up to 128 chars.
Indeed, with the config below, and shm-stats-file enabled:
server s1 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_1 disabled
server s10 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_10 disabled
haproxy would store the second server object with the same id as the first
one, but upon reload, only the first one would be restored, which would
eventually cause shm-stats-file slot exhaustion with repetitive reloads.
@phihos, found out the underlying issue, in counters.c we used snprintf()
with sizeof(shm_obj->guid) - 1 as <size> parameter, while we should have
use sizeof(shm_obj->guid) instead since shm_obj->guid already takes the
terminating NULL byte into account.
So we simply apply the fix suggested by @phihos, and hopefully this should
solve the shm-stats-file slot leak that was observed.
Unfortunately, for now, we cannot warn the user that a duplicate
shm-stats-file object was found, because we accept duplicate objects
by design for 2 reasons. The first one is for a new process to be able
to change the object type for a previously known GUID while allowing
previous processes to use the old object as long as they are alive.
The second reason is that upon startup we cannot afford to scan the
whole object list, as soon as we find a match (type + GUID), we bind
the object, and this way we avoid unnecessary lookup time.
Perhaps we have room for improvement in the future, but for now let's
keep it this way.
It should be backported to 3.3
Big thanks to @phihos for the bug description, analysis and
suggestions.
The parsing of the "healthcheck" parameter for dynamic servers was not
finished. The post-config was missing, leading to a crash because the
ruleset pointer was NULL.
To fix the issue, check_server_tcpcheck() function is called in
cli_parse_add_server().
No backport needed.
When an early response is sent to the client and the H1 connection is
switched to the draining state, we must take care to disable the 0-copy data
forwarding because the backend side is no longer here. It is an issue
because this prevent any regular receive to be performed.
This patch should fix the issue #3316. It must be backported as far as 3.0.
Functions used to initialize haterm (the splicing and the response buffers)
were defined and registered in haterm.c. The problem is that this file in
compiled with haproxy. So it may be an issue. And for the splicing part,
warnings may be emitted when haproxy is started.
To avoid any issue during haproxy startup and to avoid to initialize some
part of haterm, all init functions were moved into haterm_init.c file.
No backport needed.
This is another pre-requisite work for upcoming decompression filter.
In this patch we implement the "filter-sequence" directive which can be
used in proxy section (frontend,backend,listen) and takes 2 parameters
The first one is the direction (request or response), the second one
is a comma separated list of filter names previously declared on the
proxy using the "filter" keyword.
The main goal of this directive is to be able to instruct haproxy in which
order the filters should be executed on request and response paths,
especially if the ordering between request and response handling must
differ, and without relying on the filter declaration ordering (within
the proxy) which is used by default by haproxy.
Another benefit of this feature is that it becomes possible to "ignore"
a previously declared filter on the proxy. Indeed, when filter-sequence
is defined for a given direction (request/response), then it will be used
over the implicit filter ordering, but if a filter which was previously
declared is not specified in the related filter-sequence, it will not be
executed on purpose. This can be used as a way to temporarily disable a
filter without completely removing its configuration.
Documentation was updated (check examples for more info)
flt_conf struct stores the filter id, which is used internally to check
match the filter against static pointer identifier, and also used as
descriptive text to describe the filter. But the id is not consistent
with the public name as used in the configuration (for instance when
selecting filter through the 'filter' directive).
What we do in this patch is that we add flt_conf->name member, which
stores the real filter name as seen in the configuration. This will
allow to select filters by their name from other directives in the
configuration.
The ssl_crtname_index was registered with SSL_get_ex_new_index() but the
certificate name is stored on a SSL_CTX object via SSL_CTX_set_ex_data().
The free callback is only invoked for the object type matching the index
registration, so the strdup'd name was never freed when the SSL_CTX was
released.
Fix this by using SSL_CTX_get_ex_new_index() instead, which ensures the
free callback fires when the SSL_CTX is destroyed.
No backport needed.
Following request options are now handled as flags:
- ?k=1 => flag HS_ST_OPT_CHUNK_RES is set
- ?c=0 => flag HS_ST_OPT_NO_CACHE is set
- ?R=1 => flag HS_ST_OPT_RANDOM_RES is set
- ?A=A => flag HS_ST_OPT_REQ_AFTER_RES is set.
By default, none is set.
The support for the splicing was added and enabled by default, if
supported. The command line option '-dS' was also added to disable the
feature.
When the splicing can be used and the front multiplexer agrees to proceed,
tee() is used to "copy" data from the master pipe to the client pipe.
Now the zero-copy data forwarding is supported, we will add the splicing
support. To do so, we first create a master pipe with vmsplice() during
haterm startup. It is only performed if the splicing is supported. And its
size can be configured by setting "tune.pipesize" global parameter.
This master pipe will be used to fill the pipe with the client.
The support for the zero-copy data forwarding was added and enabled by
default. The command line option '-dZ' was also added to disable the
feature.
Concretely, when haterm pushes the response payload, if the zero-copy
forwarding is supported, a dedicated function is used to do so.
hstream_ff_snd() will rely on se_nego_ff() to know how many data can send
and at the end, on se_done_ff() to really send data.
hstream_add_ff_data() function was added to perform the raw copy of the
payload in the sedesc I/O buffer.
hstream_add_data() function is renamed to hstream_add_htx_data() because
there will be a similar function to add data in zero-copy forwarding
mode. The function was also adapted to take the data length to add in
parameter and to return the number of written bytes.
This new sample fetch returns the name of the certificate selected for
an incoming SSL/TLS connection, as it would appear in "show ssl cert".
It may be a filename with its relative or absolute path, or an alias,
depending on how the certificate was declared in the configuration.
The certificate name is stored as ex_data on the SSL_CTX at load time
in ckch_inst_new_load_store(), and freed via a dedicated free callback.
The "feature" predicate takes an argument name. Not passing one will
cause strstr() to always find something, including at the end of the
string, and to read past end that ASAN detects. We need to check that
we didn't reach end before proceeding.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499133314
The issue is present since 2.4 with commit 58ca706e16 ("MINOR: config:
add predicate "feature" to detect certain built-in features") so this
fix must be backported to all stable versions.
Using awslc_api_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the
error message being filled. Let's always fill the error message on error.
This was introduced in 3.3 with commit 3d15c07ed0 ("MINOR: cfgcond: add
"awslc_api_atleast" and "awslc_api_before""), and this fix must be
backported to 3.3.
Using openssl_version_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the error
message being filled. Let's always fill the error message on error.
This was introduced in 2.5 with commit 3aeb3f9347 ("MINOR: cfgcond:
implements openssl_version_atleast and openssl_version_before"), and
this fix must be backported to 2.6.
cfg_eval_condition() says that the <errptr> pointer will be set upon
error. However, cfg_eval_cond_expr() can fail (e.g. failure to handle
a dynamic argument) but would branch to "done" and leave errptr unset.
Let's check for this case as well.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499135825
The bug was introduced in 2.5 around commit ca81887599 ("MINOR:
cfgcond: insert an expression between the condition and the term") so
the fix must be backported as far as 2.6.
The previous ACME_RSLV_WAIT state served a dual role: it applied the
initial dns-delay before the first DNS probe and also handled the
delay between retries. There was no way to simply wait a fixed delay
before submitting the challenge without also triggering DNS pre-checks.
Replace ACME_RSLV_WAIT with two distinct states:
- ACME_INITIAL_DELAY: an optional initial wait before proceeding,
only applied when "challenge-ready" includes the new "delay" keyword
- ACME_RSLV_RETRY_DELAY: the delay between resolution retries, always
applied when DNS pre-checks are in progress
The new "delay" keyword in "challenge-ready" can be used standalone
(wait then submit the challenge directly) or combined with "dns" (wait
then start the DNS pre-checks). When "delay" is not set, the first DNS
probe fires immediately.
Update the documentation accordingly.
The TASK_WOKEN_TIMER check that previously handled the case where
RSLV_TRIGGER was reached directly from the CLI command is therefore dead
code and can be removed.
Fix the following build warning from obsolete compilers for <orig_frm>
variable in qcc_qstrm_send_frames() function :
src/mux_quic_qstrm.c:266:17: warning: 'orig_frm' may be used
uninitialized in this function [-Wmaybe-uninitialized]
The variable is now explicitely initialized to NULL on each loop, which
should prevent this warning. Note that for code clarity, the variable is
renamed <next_frm>.
No need to backport.
Previously the dns timeout timer was initialized in ACME_RSLV_WAIT,
before the initial dns-delay expires. This meant the countdown started
before any DNS request was actually sent, so the effective timeout was
shorter than expected by one dns-delay period.
Move the initialization to ACME_RSLV_TRIGGER so the timer starts only
when the first DNS resolution attempt is triggered. Update the
documentation to clarify this behaviour.
Defines an API for xprt_qstrm so that the QMux transport parameters can
be retrieved by the MUX layer on its initialization. This concerns both
local and remote parameters.
Functions xprt_qstrm_lparams/rparams() are defined and exported for
this. They are both used in qmux_init() if QMux protocol is active.
On SSL handshake completion, MUX layer can be initialized if not already
the case. However, for QMux protocol, it is necessary first to perform
transport parameters exchange, via the new xprt_qstrm layer. This patch
ensures this is performed if any flag CO_FL_QSTRM_* is set on the
connection.
Also, SSL layer registers itself via add_xprt. This ensures that it can
be used by xprt_qstrm for the emission/reception of the necessary
frames.
This patch implements QMux emission of transport parameters via
xprt_qstrm. Similarly to receive, this is performed in conn_send_qstrm()
which uses lower xprt snd_buf operation. The connection must first be
flagged with CO_FL_QSTRM_SEND to trigger this step.
Extend xprt_qstrm to implement the reception of QMux transport
parameters. This is performed via conn_recv_qstrm() which relies on the
lower xprt rcv_buf operation. Once received, parameters are kept in
xprt_qstrm context, so that the MUX can retrieve them on init.
For the reception of parameters to be active, the connection must first
be flagged with CO_FL_QSTRM_RECV.
Add get_alpn operation support for xprt_qstrm. This simply acts as a
passthrough method to the underlying XPRT layer.
This function is necessary for QMux when running above SSL, as mux-quic
will access ALPN during its initialization in order to instantiate the
proper application protocol layer.
Define a new XPRT layer for the new QMux protocol. Its role will be to
perform the initial exchange of transport parameters.
On completion, contrary to XPRT handshake, xprt_qstrm will first init
the MUX and then removes itself. This will be necessary so that the
parameters can be retrieved by the MUX during its initialization.
This patch only declares the new xprt_qstrm along with basic operations.
Future commits will implement the proper reception/emission steps.
Similarly to reception, a new buffer is defined in QCC connection to
handle emission for QMux protocol. This replaces the trash buffer usage
in qcc_qstrm_send_frames().
This buffer is necessary to handle partial emission. On retry, the
buffer must be completely emitted before starting to send new frames.
Each time a QUIC frame is emitted, mux-quic layer is notified via a
callback to update the underlying QCS. For QUIC, this is performed via
qc_stream_desc element.
In QMux protocol, this can be simplified as there is no
qc_stream_desc/quic_conn layer interaction. Instead, each time snd_buf
is called, QCS can be updated immediately using its return value. This
is performed via a new function qstrm_ctrl_send().
Its work is similar to the QUIC equivalent but in a simpler mode. In
particular, sent data can be immediately removed from the Tx buffer as
there is no need for retransmission when running above TCP.
This patchs implement mux-quic reception for the new QMux protocol. This
is performed via the new function qcc_qstrm_send_frames(). Its interface
is similar to the QUIC equivalent : it takes a list of frames and
encodes them in a buffer before sending it via snd_buf.
Contrary to QUIC, a check on CO_FL_ERROR flag is performed prior to
every qcc_qstrm_send_frames() invokation to interrupt emission. This is
necessary as the transport layer may set it during snd_buf. This is not
the case currently for quic_conn layer, but maybe a similar mechanism
should be implemented as well for QUIC in the future.
The previous patch defines a new QCC buffer member to implement QMux
reception. This patch completes this by perfoming realign on it during
qcc_qstrm_recv(). This is necessary when there is not enough contiguous
data to read a whole frame.
When QMux is used, mux-quic must actively performed reception of new
content. This has been implemented by the previous patch.
The current patch extends this by defining a buffer on QCC dedicated to
this operation. This replaces the usage of the trash buffer. This is
necessary to deal with incomplete reads.
Implements parsing of frames related to flow-control for mux-quic
running on the new QMux protocol. This simply calls qcc_recv_*() MUX
functions already used by QUIC.
This patch implements a new function qcc_qstrm_recv() dedicated to the
new QMux protocol. It is responsible to perform data reception via
rcv_buf() callback. This is defined in a new mux_quic_strm module.
Read data are parsed in frames. Each frame is handled via standard
mux-quic functions. Currently, only STREAM and RESET_STREAM types are
implemented.
One major difference between QUIC and QMux is that mux-quic is passive
on the reception side on the former protocol. For the new one, mux-quic
becomes active. Thus, a new call to qcc_qstrm_recv() is performed via
qcc_io_recv().
STREAM frame will also be used by the new QMux protocol. This requires
some adaptation in the qf_stream structure. Reference to qc_stream_desc
object is replaced by a generic void* pointer.
This change is necessary as QMux protocol will not use any
qc_stream_desc elements for emission.
Ensure mux-quic traces will be compatible with the new QMux protocol.
This is necessary as the quic_conn element is accessed to display some
transport information. Use conn_is_quic() to protect these accesses.
Ensure mux-quic operations related to initialization and shutdown will
be compatible with the new QMux protocol. This requires to use
conn_is_quic() before any access to the quic_conn element, in
qmux_init(), qcc_shutdown() and qcc_release().
Adapts mux-quic functions related to emission for future QMux protocol
support.
In short, QCS will not used a qc_stream_desc object but instead a plain
buffer. This is inserted as a union in QCS structure. Every access to
QUIC qc_stream_desc is protected by a prior conn_is_quic() check. Also,
pacing is useless for QMux and thus is disabled for such protocol.
Move <stream> field from qcs type into the inner structure 'tx'. This
change is only a minor refactoring without any impact. It is cleaner as
Rx buffer elements are already present in 'rx' inner structure.
This reorganization is performed before introducing of a new Tx buffer
field used for QMux protocol.
Implement parse/build methods for QX_TRANSPORT_PARAMETER frame. Both
functions may fail due to buffer space too small (encoding) or truncated
frame (parsing).
Define a new frame type for QMux transport parameter exchange. Frame
type is 0x3f5153300d0a0d0a and is declared as an extra frame, outside of
quic_frame_parsers / quic_frame_builders.
The next patch will implement parsing/encoding of this frame payload.
The previous patch refactored QUIC transport parameters decoding and
validity checks. These two operation are now performed in two distinct
functions. This renders quic_tp_dec_err type useless. Thus, this patch
removes it. Function returns are converted to a simple integer value.
Function quic_transport_params_decode() is used for decoding received
parameters. Prior to this patch, it also contained validity checks on
some of the parameters. Finally, it also tested that mandatory
parameters were indeed found.
This patch separates this two parts. Params validity is now tested in a
new function quic_transport_params_check(), which can be called just
after decode operation.
This patch will be useful for QMux protocol, as this allows to reuse
decode operation without executing checks which are tied to the QUIC
specification, in particular for mandatory parameters.
The documentation for functions related to transport parameters decoding
is unclear or sometimes completely wrong on the meaning of the <server>
argument. It must be set to reflect the origin of the parameters,
contrary to what was implied in function comments.
Fix this by rewriting comments related to this <server> argument. This
should prevent to make any mistake in the future.
This is purely a documentation fix. However, it could be useful to
backport it up to 2.6.
This patch is a direct follow-up of the previous one. This time,
refactoring is performed on qc_build_frm() which is used for frame
encoding.
Function prototype has changed as now packet argument is removed. To be
able to check frame validity with a packet, one can use the new parent
function qc_build_frm_pkt() which relies on qc_build_frm().
As with the previous patch, there is no function change expected. The
objective is to facilitate a future QMux implementation.
This patch refactors parsing in QUIC frame module. Function
qc_parse_frm() has been splitted in three :
* qc_parse_frm_type()
* qc_parse_frm_pkt()
* qc_parse_frm_payload()
No functional change. The main objective of this patch is to facilitate
a QMux implementation. One of the gain is the ability to manipulate QUIC
frames without any reference to a QUIC packet as it is irrelevant for
QMux. Also, quic_set_connection_close() calls are extracted as this
relies on qc type. The caller is now responsible to set the required
error code.
Set the default dns-delay to 30s so it can be more efficient with fast
DNS providers. The dns-timeout is set to 600s by default so this does
not have a big impact, it will only do more check and allow the
challenge to be started more quickly.
When using the dns-01 challenge method with "challenge-ready dns", HAProxy
retries DNS resolution indefinitely at the interval set by "dns-delay". This
adds a "dns-timeout" keyword to set a maximum duration for the DNS check phase
(default: 600s). If the next resolution attempt would be scheduled beyond that
deadline, the renewal is aborted with an explicit error message.
A new "dnsstarttime" field is stored in the acme_ctx to record when DNS
resolution began, used to evaluate the timeout on each retry.
Thanks to this patch, it is now possible to specify an healthcheck section
on the server line. In that case, the server will use the tcpcheck as
defined in the correspoding healthcheck section instead of the proxy's one.
tcpcheck_ruleset struct was extended to host a config part that will be used
for healthcheck sections. This config part is mainly used to store element
for the server's tcpcheck part.
When a healthcheck section is parsed, a ruleset is created with its name
(which must be unique). "*healthcheck-{NAME}" is used for these ruleset. So
it is not possible to mix them with regular rulesets.
For now, in a healthcheck section, the type must be defined, based on the
options name (tcp-check, httpchk, redis-check...). In addition, several
"tcp-check" or "http-check" rules can be specified, depending on the
healthcheck type.
Functions used to parse directives related to tcpchecks were split to have a
first step testing the proxy and creating the tcpcheck ruleset if necessary,
and a second step filling the ruleset. The aim of this patch is to preapre
the parsing of healthcheck sections. In this context, only the second steip
will be used.
When log-format stirngs were parsed in context of a tcpcheck, ARGC_SRV
context was used instead of ARGC_TCK. This context is used to report
accurrate errors.
This patch could be backported to all stable versions.
The proxy flag PR_O_TCPCHK_SSL is replaced by a flag on the tcpcheck
itself. When TCPCHK_FL_USE_SSL flag is set, it means the healthcheck will
use an SSL connection and the SSL xprt must be prepared for the server.
In tcpchecks context, when HTTP sample expressions are parsed, there is no
reason to set the proxy's http_needed value to 1. This value is only used
for streams to allocate an HTTP txn.
This patch could be backported to all stable versions.
disable-on-404 and send-state options, configured on an HTTP healtcheck,
were handled as proxy options. Now, these options are handled in the
tcp-check itself. So the corresponding PR_O and PR_02 flags are removed.
The tcpcheck_rules structure is replaced by the tcpcheck structure. The main
difference is that the ruleset is now referenced in the tcpcheck structure,
instead of the rules list. The flags about the ruleset type are moved into
the ruleset structure and flags to track unused rules remains on the
tcpcheck structure. So it should be easier to track unused rulesets. But it
should be possible to configure a set of tcpcheck rules outside of the proxy
scope.
The main idea of these changes is to prepare the parsing of a new
healthcheck section. So this patch is quite huge, but it is mainly about
renaming some fields.
When parsing httpchck option, a wrong flag (TCPCHK_SND_HTTP_FROM_OPT) was
set on the rules, while it is in fact a flag for a send rule. Let's remove
it. There is no issue here because there is no corresponding flag for
tcpcheck rules.
This patch must be backported to all stable versions.
These actions were added recently and it appeared the way binary headers
were retrieved could be simplified.
First, there is no reason to retrieve a base64 encoded string. It is
possible to rely on the binary string directly. "b64dec" converter can be
used to perform a base64 decoding if necessary.
Then, using a log-format string is quite overkill and probably
conterintuitive. Most of time, the headers will be retrieved from a
variable. So a sample expression is easier to use. Thanks to the previous
patch, it is quite easy to achieve.
This patch relies on the commit "MINOR: action: Add a sample expression
field in arguments used by HTTP actions". The documentation was updated
accordingly.
An error is erroneously triggered if a if/unless statement is found after
set-headers-bin and add-headers-bin actions. To make it works, during
parsing of these actions, we should leave when an unknown argument is found
to let the rule parser the opportunity to parse an if/unless statement.
No backport needed.
Add keylog_format_fc and keylog_format_bc global variables containing
the SSLKEYLOGFILE log-format strings for the frontend (client-facing)
and backend (server-facing) TLS connections respectively. These produce
output compatible with the SSLKEYLOGFILE format described at:
https://tlswg.org/sslkeylogfile/draft-ietf-tls-keylogfile.html
Both formats are also exported as environment variables at startup:
HAPROXY_KEYLOG_FC_LOG_FMT
HAPROXY_KEYLOG_BC_LOG_FMT
These variables contains \n so they might not be compatible with syslog
servers, using them with stderr or a sink might be required.
These can be referenced directly in "log-format" directives to produce
SSLKEYLOGFILE-compatible output, usable by network analyzers such as
Wireshark to decrypt captured TLS traffic.
Reverse the default, to hide the version from stats by default, and add
a new keyword, "stats show-version", to enable them, as we don't want to
disclose the version by default, especially on public websites.
When binary headers are decoded, return value of decode_varint() function is
not properly handled. On error, it can return -1. However, the result is
inconditionnaly added to an unsigned offset.
Now, a temporary variable is used to be abl to test decode_varint() return
value. It is added to the offset on success only.
No backport needed.
When h1_snd_buf() inherits the CO_SFL_MSG_MORE flag from the upper layer, it
unconditionally propagates it to H1C_F_CO_MSG_MORE, which eventually sets
MSG_MORE on the sendmsg() call. For bodyless responses (HEAD, 204, 304), this
causes the kernel to cork the TCP connection for ~200ms waiting for body data
that will never be sent.
With an H1 frontend and H2 backend, this adds ~200ms of latency to many or
all bodyless responses. The 200ms corresponds to the kernel's tcp_cork_time
default. H1 backends are less affected because h1_postparse_res_hdrs() sets
HTX_FL_EOM during header parsing for bodyless responses, but H2 backends
frequently deliver the end-of-stream signal in a separate scheduling round,
leaving htx_expect_more() returning TRUE when headers are first forwarded.
The fix guards H1C_F_CO_MSG_MORE so it is only set when the connection is a
backend (H1C_F_IS_BACK) or the response is not bodyless
(!H1S_F_BODYLESS_RESP). This ensures bodyless responses on the front
connection are sent immediately without corking.
This should be backported to all stable branches.
Co-developed-by: Billy Campoli <bcampoli@meta.com>
Co-developed-by: Chandan Avdhut <cavdhut@meta.com>
Co-developed-by: Neel Raja <neelraja@meta.com
These actions allow setting, adding and deleting multiple headers from
the same action, without having to know the header names during parsing.
This is useful when doing things with SPOE.
The CLI commands (get|add|del|clear|commit|set) | (acl|map) does not
contain a permission check on admin level.
Must be backported to 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
'set ssl ocsp-response', 'update ssl ocsp-response', 'show ssl
ocsp-response', 'show ssl ocsp-updates' are lacking permissions checks
on admin level.
Must be backported in 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
Both 'set ssl tls-key' and 'show tls-keys' command are missing the
permission checks so the commands can be used only in admin mode.
Must be backported to 3.3. This can be a breaking change for some users.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when map/acl commands are accessed
without admin level. This is to warn users that these commands will be
restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when OCSP commands are accessed without
admin level. This is to warn users that these commands will be
restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
This commit adds an ha_warning() when 'show tls-keys' or 'set ssl
tls-key' are accessed without admin level. This is to warn users that
these commands will be restricted to admin only in HAProxy 3.3.
Must be backported in every stable branches.
Initially reported by Cameron Brown.
The previous patch implemented the 'dns-check' option. This one replaces
it by a more generic 'challenge-ready' option, which allows the user to
chose the condition to validate the readiness of a challenge. It could
be 'cli', 'dns' or both.
When in dns-01 mode it's by default to 'cli' so the external tool used to
configure the TXT record can validate itself. If the tool does not
validate the TXT record, you can use 'cli,dns' so a DNS check would be
done after the CLI validated with 'challenge_ready'.
For an automated validation of the challenge, it should be set to 'dns',
this would check that the TXT record is right by itself.
When using the dns-01 challenge type, TXT record propagation across
DNS servers can take time. If the ACME server verifies the challenge
before the record is visible, the challenge fails and it's not possible
to trigger it again.
This patch introduces an optional DNS pre-check mechanism controlled
by two new configuration directives in the "acme" section:
- "dns-check on|off": enable DNS propagation verification before
notifying the ACME server (default: off)
- "dns-delay <time>": delay before querying DNS (default: 300s)
When enabled, three new states are inserted in the state machine
between AUTH and CHALLENGE:
- ACME_RSLV_WAIT: waits dns-delay seconds before starting
- ACME_RSLV_TRIGGER: starts an async TXT resolution for each
pending authorization using HAProxy's resolver infrastructure
- ACME_RSLV_READY: compares the resolved TXT record against the
expected token; retries from ACME_RSLV_WAIT if any record is
missing or does not match
The "acme_rslv" structure is implemented in acme_resolvers.c, it holds
the resolution for each domain. The "auth" structure which contains each
challenge to resolve contains an "acme_rslv" structure. Once
ACME_RSLV_TRIGGER leaves, the DNS tasks run on the same thread, and the
last DNS task which finishes will wake up acme_process().
Note that the resolution goes through the configured resolvers, not
through the authoritative name servers of the domain. The result may
therefore still be affected by DNS caching at the resolver level.
This patch adds support for TXT records. It allows to get the first
string of a TXT-record which is limited to 255 characters.
The rest of the record is ignored.
Latest commit a336c467a0 ("BUG/MINOR: net_helper: fix length controls
on ip.fp tcp options parsing") was malformed and broke the build. This
should be backported wherever the fix above is backported.
If opt len is truncated by tcplen we may read 1 Byte after the
tcp header.
There is also missing controls parsing MSS and WS we may compute
invalid values on fingerprint reading after the tcp header in
case of truncated options.
This patch should be backported on versions including ip.fp
We leverage the SE_FL_APP_STARTED flag to detect whether the application
layer had a chance to run or not when an RST_STREAM is received. This
allows us to triage RST_STREAM between regular ones and harmful ones,
and to count glitches for them. It reveals extremely effective at
detecting fast HEADERS+RST pairs.
It could be useful to backport it to 3.2, though it depends on these
two previous patches to be backported first (the first one was already
planned and the second one is harmless, though will require to drop
the haterm changes):
BUG/MINOR: stconn: Always declare the SC created from healthchecks as a back SC
MINOR: stconn: flag the stream endpoint descriptor when the app has started
In order to improve our ability to distinguish operations that had
already started from others under high loads, it would be nice to know
if an application layer (stream) has started to work with an endpoint
or not. The use case typically is a frontend mux instantiating a stream
to instantly cancel it. Currently this info will take some time to be
detected and processed if the applcation's task takes time to wake up.
By flagging the sedesc with SE_FL_APP_STARTED the first time a the app
layer starts, the lower layers can know whether they're cancelling a
stream that has started to work or not, and act accordingly. For now
this is done unconditionally on the backend, and performed early in the
only two app layers that can be reached by a frontend: process_stream()
and process_hstream() (for haterm).
The SC created from a healthcheck is always a back SC. But SC_FL_ISBACK
flags was missing. Instead of passing it when sc_new_from_check() is called,
the function was simplified to set SC_FL_ISBACK flag systematically when a
SC is created from a healthcheck.
This patch should be backported as far as 2.6.
RFC 9000 lists each supported frames and the type of packets in which it
can be present.
Prior to this patch, a packet with an incompatible frame is dropped.
However, QUIC specification mandates that the connection is immediately
closed with PROTOCOL_VIOLATION error code. This patch completes
qc_parse_frm() to add such connection closure.
This must be backported up to 2.6.
When an htx DATA block is partially transfer, we must take care to remove
exactly the copied size. To do so, we must save the size of the last block
value copied and not rely on the last data block after the copy. Indeed,
data can be merged with an existing DATA block, so the last block size can
be larger than the last part copied.
Because of this issue, it is possible to remove more data than
expected. Worse, this could lead to a crash by performing an integer
overflow on the block size.
No backport needed.
Fix a leak of the task object in acme_start_task() when one of the
condition in the function failed.
Fix issue #3308.
Must be backported to 3.2 and later.
There are two settings to control idle connection sharing across
threads.
tune.idle-pool.shared, that enables or disables it, and then
tune.takeover-other-tg-connections, which lets you or not get idle
connections from other thread groups.
Add a new keyword for tune.idle-pool.shared, "full", that lets you get
connections from other thread groups (equivalent to "full" keyword for
tune.takeover-other-tg-connections). The "on" keyword now will be
equivalent to the "restrict" one, which allowed getting connection from
other thread groups only when not doing it would result in a connection
failure (when reverse-http or when strict-macxonn are used).
tune.takeover-other-tg-connections will be deprecated.
If server returns an auth with status valid it seems that client
needs to always skip it, CA can recycle authorizations, without
this change haproxy fails to obtain certificates in that case.
It is also something that is explicitly allowed and stated
in the dns-persist-01 draft RFC.
Note that it would be better to change how haproxy does status polling,
and implements the state machine, but that will take some thought
and time, this patch is a quick fix of the problem.
See:
https://github.com/letsencrypt/boulder/issues/2125https://github.com/letsencrypt/pebble/issues/133
This must be backported to 3.2 and later.
When abortonclose option is enabled (by default since 3.3), the HTTP rules
can no longer yield if the client aborts. However, stream aborts were also
considered. So it was possible to interrupt yielding rules, especially on
the response processing, while the client was still waiting for the
response.
So now, when abortonclose option is enabled, we now take care to only
consider client aborts to prevent HTTP rules to yield.
Many thanks to @DirkyJerky for his detailed analysis.
This patch should fix the issue #3306. It should be backported as far as
2.8.
warnif_misplaced_* functions return 1 when a warning is reported and 0
otherwise. So the caller must properly handle the return value.
When parsing a proxy, ERR_WARN code must be added to the error code instead
of the return value. When a warning was reported, ERR_RETRYABLE (1) was
added instead of ERR_WARN.
And when tcp rules were parsed, warnings were ignored. Message were emitted
but the return values were ignored.
This patch should be backported to all stable versions.
When warnif_cond_conflicts() is called, we must take care to emit a warning
only when a conflict is reported. We cannot rely on the err_code variable
because some warnings may have been already reported. We now rely on the
errmsg variable. If it contains something, a warning is emitted. It is good
enough becasue warnif_cond_conflicts() only reports warnings.
This patch should fix the issue #3305. It is a 3.4-dev specific issue. No
backport needed.
When picking a mux, pay attention to its MX_FL_FRAMED. If it is set,
then it means we explicitely want QUIC, so don't use that mux for any
protocol that is not QUIC.
When parsing the check address, store the associated proto too.
That way we can use the notation like quic4@address, and the right
protocol will be used. It is possible for checks to use a different
protocol than the server, ie we can have a QUIC server but want to run
TCP checks, so we can't just reuse whatever the server uses.
WIP: store the protocol in checks
Don't assume the check will reuse the server's xprt. It may not be true
if some settings such as the ALPN has been set, and it differs from the
server's one. If the server is QUIC, and we want to use TCP for checks,
we certainly don't want to reuse its XPRT.
Permission checks on the CLI for ACME are missing.
This patch adds a check on the ACME commands
so they can only be run in admin mode.
ACME is stil a feature in experimental-mode.
Initial report by Cameron Brown.
Must be backported to 3.2 and later.
Permission checks on the CLI for ECH are missing.
This patch adds a check for "(add|set|del|show) ssl ech" commands
so they can only be run in admin mode.
ECH is stil a feature in experimental-mode and is not compiled by
default.
Initial report by Cameron Brown.
Must be backported to 3.3.
This patch fixes a warning that can be reproduced with gcc-8.5 on RHEL8
(gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)).
This should fix issue #3303.
Must be backported everywhere 917e82f283 ("MINOR: debug: copy debug
symbols from /usr/lib/debug when present") was backported, which is
to branch 3.2 for now.
Fix the check or arguments of the 'acme challenge_ready' command which
was checking if all arguments are NULL instead of one of the argument.
Must be backported to 3.2 and later.
Replace atol() by _strl2uic() in cases the input are ISTs when parsing
the retry-after header. There's no risk of an error since it will stop
at the first non-digit.
Must be backported to 3.2 and later.
In acme_req_finalize() the data buffer is only freed when a2base64url
succeed. This patch moves the allocation so it free() the DER buffer in
every cases.
Must be backported to 3.2 and later.
Thanks to previous commits, it is possible to use small buffers at different
places: to store the request when a connection is queued or when L7 retries
are enabled, or for health-checks requests. However, there was no
configuration parameter to fine tune small buffer use.
It is now possible, thanks to the proxy option "use-small-buffers".
Documentation was updated accordingly.
If support for small buffers is enabled, we now try to use them for
healthcheck requests. First, we take care the tcpcheck ruleset may use small
buffers. Send rules using LF strings or too large data are excluded. The
ability to use small buffers or not are set on the ruleset. All send rules
of the ruleset must be compatible. This info is then transfer to server's
healthchecks relying on this ruleset.
Then, when a healthcheck is running, when a send rule is evaluated, if
possible, we try to use small buffers. On error, the ability to use small
buffers is removed and we retry with a regular buffer. It means on the first
error, the support is disabled for the healthcheck and all other runs will
use regular buffers.
In h2_rcv_buf(), HTX flags are transfer with data when htx_xfer() is
called. There is no reason to continue to deal with them in the H2 mux. In
addition, there is no reason to set SE_FL_EOI flag when a parsing error was
reported. This part was added before the stconn era. Nowadays, when an HTX
parsing error is reported, an error on the sedesc should also be reported.
This reverts commit 44932b6c41.
The patch above was only necessary to handle partial headers or trailers
parsing. There was nothing to prevent the H2 multiplexer to start to add
headers or trailers in an HTX message and to stop the processing on error,
leaving the HTX message with no EOH/EOT block.
From the HTX API point of view, it is unexepected. And this was fixed thanks
to the commit ba7dc46a9 ("BUG/MINOR: h2/h3: Never insert partial
headers/trailers in an HTX message").
So this patch can be reverted. It is important to not report a parsign error
too early, when there are still data to transfer to the upper layer.
This patch must be backport where 44932b6c4 was backported but only after
backporting ba7dc46a9 first.
When a HTX stream is queued, if the request is small enough, it is moved
into a small buffer. This should save memory on instances intensively using
queues.
Applet and connection receive function were update to block receive when a
small buffer is in use.
In the same way support for large chunks was added to properly work with
large buffers, we are now adding supports for small chunks because it is
possible to process small buffers.
So a dedicated memory pool is added to allocate small
chunks. alloc_small_trash_chunk() must be used to allocate a small
chunk. alloc_trash_chunk_sz() and free_trash_chunk() were uppdated to
support small chunks.
In addition, small trash buffers are also created, using the same mechanism
than for regular trash buffers. So three thread-local trash buffers are
created. get_small_trash_chunk() must be used to get a small trash buffer.
And get_trash_chunk_sz() was updated to also deal with small buffers.
htx_move_to_small_buffer()/htx_move_to_large_buffer() and
htx_copy_to_small_buffer()/htx_copy_to_large_buffer() functions can now be
used to move or copy blocks from a default buffer to a small or large
buffer. The destination buffer is allocated and then each blocks are
transferred into it.
These funtions relies in htx_xfer() function.
htx_xfer() function should replace htx_xfer_blks(). It will be a bit easier to
maintain and to use. The behavior of htx_xfer() can be changed by calling it
with specific flags:
* HTX_XFER_KEEP_SRC_BLKS: Blocks from the source message are just copied
* HTX_XFER_PARTIAL_HDRS_COPY: It is allowed to partially xfer headers or trailers
* HTX_XFER_HDRS_ONLY: only headers are xferred
By default (HTX_XFER_DEFAULT or 0), all blocks from the source message are moved
into to the destination mesage. So copied in the destination messageand removed
from the source message.
The caller must still define the maximum amount of data (including meta-data)
that can be xferred.
It is no longer necessary to specify a block type to stop the copy. Most of
time, with htx_xfer_blks(), this parameter was set to HTX_BLK_UNUSED. And
otherwise it was only specified to transfer headers.
It is important to not that the caller is responsible to verify the original
HTX message is well-formated. Especially, it must be sure headers part and
trailers part are complete (finished by EOH/EOT block).
For now, htx_xfer_blks() is not removed for compatiblity reason. But it is
deprecated.
When small buffer size was greater than the default buffer size, an error
was triggered. We now do the same than for large buffer. A warning is
emitted and the small buffer size is set to 0 do disable small buffer
allocation.
Because small buffers were only used by QUIC streams, the pool used to alloc
these buffers was located in the quic code. However, their usage will be
extended to other parts. So, the small buffers pool was moved into the
dynbuf part.
http-errors parsing has been refactored in a recent serie of patches.
However, a null deref was introduced by the following patch in case a
non-existent http-errors section is referenced by an "errorfiles"
directive.
commit 2ca7601c2d
MINOR/OPTIM: http_htx: lookup once http_errors section on check/init
Fix this by delaying ha_free() so that it is called after ha_alert().
No need to backport.
The cfg_parse_acme() function checks if an 'acme' section is already
existing in the configuration with cur_acme->linenum > 0. But the wrong
filename and line number are displayed in the commit message.
Must be backported to 3.2 and later.
This patch fixes a leak of the ext_san structure when
sk_X509_EXTENSION_push() failed. sk_X509_EXTENSION_pop_free() is already
suppose to free it, so ext_san must be set to NULL upon success to avoid
a double-free.
Must be backported to 3.2 and later.
Use proxy_check_http_errors() on defaults proxy instances. This will
emit alert messages for errorfiles directives referencing a non-existing
http-errors section, or a warning if an explicitely listed status code
is not present in the target section.
This is a small behavior changes, as previouly this was only performed
for regular proxies. Thus, errorfile/errorfiles directives in an unused
defaults were never checked.
This may prevent startup of haproxy with a configuration file previously
considered as valid. However, this change is considered as necessary to
be able to use http-errors with dynamic backends. Any invalid defaults
will be detected on startup, rather than having to discover it at
runtime via "add backend" invokation.
Thus, any restriction on http-errors usage is now lifted for the
creation of dynamic backends.
The previous patch has splitted the original proxy_check_errors()
function in two, so that check and init steps are performed separately.
However, this renders the code inefficient for "errorfiles" directive as
tree lookup on http-errors section is performed twice.
Optimize this by adding a reference to the section in conf_errors
structure. This is resolved during proxy_check_http_errors() and
proxy_finalize_http_errors() can reuse it.
No need to backport.
Function proxy_check_errors() is used when configuration parsing is
over. This patch splits it in two newly named ones.
The first function is named proxy_check_http_errors(). It is responsible
to check for the validity of any "errorfiles" directive which could
reference non-existent http-errors section or code not defined in such
section. This function is now called via proxy_finalize().
The second function is named proxy_finalize_http_errors(). It converts
each conf_errors type used during parsing in a proper http_reply type
for runtime usage. This function is still called via post-proxy-check,
after proxy_finalize().
This patch does not bring any functional change. However, it will become
necessary to ensure http-errors can be used as expected with dynamic
backends.
This patch is the second part of the refactoring for http-errors
parsing. It renames some fields in <conf_errors> structure to clarify
their usage. In particular, union variants are renamed "inl"/"section",
which better highlight the link with the newly defined enum
http_err_directive.
In conf_errors struct, arbitrary integer values were used for both
<type> field and <status> array. This renders the code difficult to
follow.
Replaces these values with proper enums type. Two new types are defined
for each of these fields. The first one represents the directive type,
derived from the keyword used (errorfile vs errorfiles). This directly
represents which part of <info> union should be manipulated.
The second enum is used for errorfiles directive with a reference on a
http-errors section. It indicates whether or not if a status code should
be imported from this section, and if this import is explicit or
implicit.
Several resources were leaked on both success and error paths:
- X509_NAME *nm was never freed. X509_REQ_set_subject_name() makes
an internal copy, so nm must be freed separately by the caller.
- str_san allocated via my_strndup() was never freed on either path.
- On error paths after allocation, x (X509_REQ) and exts
(STACK_OF(X509_EXTENSION)) were also leaked.
Fix this by adding proper cleanup of all allocated resources in both
the success and error paths. Also move sk_X509_EXTENSION_pop_free()
after X509_REQ_sign() so it is not skipped when sign fails, and
initialize nm to NULL to make early error paths safe.
Must be backported as far as 3.2.
There was a leftover of "activity[tid].ctr1++" in commit 7d40b3134
("MEDIUM: sched: do not run a same task multiple times in series")
that unfortunately only builds in development mode :-(
This introduces 3 new settings: tune.h2.be.max-frames-at-once and
tune.h2.fe.max-frames-at-once, which limit the number of frames that
will be processed at once for backend and frontend side respectively,
and tune.h2.fe.max-rst-at-once which limits the number of RST_STREAM
frames processed at once on the frontend.
We can now yield when reading too many frames at once, which allows to
limit the latency caused by processing too many frames in large buffers.
However if we stop due to the RST budget being depleted, it's most likely
the sign of a protocol abuse, so we make the tasklet go to BULK since
the goal is to punish it.
By limiting the number of RST per loop to 1, the SSL response time drops
from 95ms to 1.6ms during an H2 RST flood attack, and the maximum SSL
connection rate drops from 35.5k to 28.0k instead of 11.8k. A moderate
SSL load that shows 1ms response time and 23kcps increases to 2ms with
15kcps versus 95ms and 800cps before. The average loop time goes down
from 270-280us to 160us, while still doubling the attack absorption
rate with the same CPU capacity.
This patch may usefully be backported to 3.3 and 3.2. Note that to be
effective, this relies on the following patches:
MEDIUM: sched: do not run a same task multiple times in series
MINOR: sched: do not requeue a tasklet into the current queue
MINOR: sched: do not punish self-waking tasklets anymore
MEDIUM: sched: do not punish self-waking tasklets if TASK_WOKEN_ANY
MEDIUM: sched: change scheduler budgets to lower TL_BULK
Having less yielding tasks in TL_BULK and more in TL_NORMAL, we need
to rebalance these queues' priorities. Tests have shown that raising
TL_NORMAL to 40% and lowering TL_BULK to 3% seems to give about the
best tradeoffs.
Self-waking tasklets are currently punished and go to the BULK list.
However it's a problem with muxes or the stick-table purge that just
yield and wake themselves up to limit the latency they cause to the
rest of the process, because by doing so to help others, they punish
themselves. Let's check if any TASK_WOKEN_ANY flag is present on
the tasklet and stop sending tasks presenting such a flag to TL_BULK.
Since tasklet_wakeup() by default passes TASK_WOKEN_OTHER, it means
that such tasklets will no longer be punished. However, tasks which
only want a best-effort wakeup can simply pass 0.
It's worth noting that a comparison was made between going into
TL_BULK at all and only setting the TASK_SELF_WAKING flag, and
it shows that the average latencies are ~10% better when entirely
avoiding TL_BULK in this case.
Nowadays due to yield etc, it's counter-productive to permanently
punish self-waking tasklets, let's abandon this principle as it prevent
finer task priority handling.
We continue to check for the TASK_SELF_WAKING flag to place a task
into TL_BULK in case some code wants to make use of it in the future
(similarly to TASK_HEAVY), but no code sets it anymore. It could
possible make sense in the future to replace this flag with a one-shot
variant requesting low-priority.
As found by Christopher, the concept of waking a tasklet up into the
current queue is totally flawed, because if a task is in TL_BULK or
TL_HEAVY, all the tasklets it will wake up will end up in the same
queue. Not only this will clobber such queues, but it will also
reduce their quality of service, and this can contaminate other
tasklets due to the numerous wakeups there are now with the subsribe
mechanism between layers.
There's always a risk that some tasks run multiple times if they wake
each other up. Now we include the loop counter in the task struct and
stop processing the queue it's in when meeting a task that has already
run. We only pick 16 bits since that's only what remains free in the
task common part, so from time to time (once every 65536) it will be
possible to wrongly match a task as having already run and stop evaluating
its queue, but it's rare enough that we don't care, because this will
be OK on the next iteration.
This patch improves the robustness of the QPACK varint decoder and fixes
potential 1-byte out-of-bounds reads in qpack_decode_fs().
In qpack_decode_fs(), two 1-byte OOB reads were possible on truncated
streams between two varint decoding. These occurred when trying to read
the byte containing the Huffman bit <h> and the Value Length prefix
immediately following an Index or a Name Length.
Note that these OOB are limited to a single byte because
qpack_get_varint() already ensures that its input length is non-zero
before consuming any data.
The fixes in qpack_decode_fs() are:
- When decoding an index, we now verify that at least one byte remains
to safely access the following <h> bit and value length.
- When decoding a literal, we now check len < name_len + 1 to ensure
the byte starting the header value is reachable.
In qpack_get_varint(), the maximum value is now strictly capped at 2^62-1
as per RFC. This is enforced using a budget-based check:
(v & 127) > (limit - ret) >> shift
This prevents values from overflowing into the 63rd or 64th bits, which
would otherwise break subsequent signed comparisons (e.g., if (len < name_len))
by interpreting the length as a negative value, leading to false positive
tests.
Thank you to @jming912 for having reported this issue in GH #3302.
Must be backported as far as 2.6
In the ENFILE and ENOMEM cases, when accept() fails, an irrelevant
global.maxsock value was printed that doesn't reflect system limits.
Now the actconn is printed that gives a hint about the failure reasons.
Should be backported in all stable branches.
We anticipated that the do-log action should be expanded with optional
arguments at some point. Now that we heard of multiple use-cases
that could be achieved with do-log action, but that are limitated by the
fact that all do-log statements inherit from the implicit log-profile
defined on the logger, we need to provide a way for the user to specify
that custom log-profile that could be used per do-log actions individually
This is what we try to achieve in this commit, by leveraging the
prerequisite work performed by the last 2 commits.
In process_send_log(), now also consider the ctx if ctx->profile != NULL
In that case, we do as if logger->prof was set, but we consider
ctx->profile in priority over the logger one. What this means is that
it will become possible to pass ctx.profile to a profile that will be
used no matter what to generate the log payload.
This is a pre-requisite to implement optional "profile" argument for
do-log action
do_log() is just a wrapper to use do_log_ctx() with pre-filled ctx, but
we now have the low-level do_log_ctx() variant which can be used to
pass specific ctx parameters instead.
Since version 3.1, the display order of old workers in 'show proc' was
accidentally reversed. The oldest worker was shown first and the newest
last, which was not the intended behavior. This regression was introduced
during the master-worker rework.
Fix this by sorting the list during deserialization in
mworker_env_to_proc_list().
An alternative fix would have been to iterate the list in reverse order
in the show proc function, but that approach risks introducing
inconsistencies when backporting to older versions.
Must be backported to 3.1 and later.
When using rq-load on tune.h2.fe.max-concurrent-streams, it's easy to
reach a situation where only one stream is allowed. There's nothing
wrong with this but it turns out that slightly higher values do not
necessarily cause significantly higher loads and will improve the user
experience. For this reason the keyword now also supports "min" to
specify a value. Experimentation shows that values from 5 to 15 remain
very effective at protecting the run queue while allowing a great level
of parallelism that keeps a site fluid.
Global setting tune.h2.fe.max-concurrent-streams now supports an optional
"rq-load" option to pass either a target load, or a keyword among "auto"
and "ignore". These are used to quadratically reduce the advertised streams
limit when the thread's run queue size goes beyong the configured value,
and automatically reduce the load on the process from new connections.
With "auto", instead of taking an explicit value, it uses as a target the
"tune.runqueue-depth" setting (which might be automatic). Tests have shown
that values between 50 and 100 are already very effective at reducing the
loads during attacks from 100000 to around 1500. By default, "ignore"
is in effect, which means that the dynamic tuning is not enabled.
The hard limit on the number of concurrent streams is currently
determined only by configuration and returned by
h2c_max_concurrent_streams(). However this doesn't permit to
change such settings on the fly without risking to break connections,
and it doesn't allow a connection to pick a different value, which
could be desirable for example to try to slow abuse down.
Let's store a copy of h2c_max_concurrent_streams() at connection
creation time into the h2c as streams_hard_limit. This inflates
the h2c size from 1324 to 1328 (0.3%) which is acceptable for the
expected benefits.
The new field th_ctx->rq_tot_peak contains the computed peak run queue
length averaged over the last 512 calls. This is computed when entering
process_runnable_tasks. It will not take into account new tasks that are
created or woken up during this round nor those which are evicted, which
is the reason why we're using a peak measurement to increase chances to
observe transient high values. Tests have shown that 512 samples are good
to provide a relatively smooth average measurement while still fading
away in a matter of milliseconds at high loads. Since this value is
only updated once per round, it cannot be used as a statistic and
shouldn't be exposed, it's only for internal use (self-regulation).
After commit 594408cd61 ("BUG/MINOR: mworker/cli: fix show proc
pagination using reload counter"), the old-workers pagination stores
ctx->next_reload = child->reloads on flush failure, then skips entries
with child->reloads >= ctx->next_reload on resume.
The >= comparison is direction-dependent: it assumes the list is in
descending reload order (newest first). On current master, proc_list
is in ascending order (oldest first) because mworker_env_to_proc_list()
appends deserialized entries before mworker_prepare_master() appends
the new worker. This means the skip logic is inverted and can miss
entries or loop incorrectly depending on the version.
We fix this by renaming the context field to resume_reload and changing its
semantics: it now tracks the reload count of the last *successfully
flushed* row rather than the failed one. On flush failure, resume_reload
is left unchanged so the failed row is replayed on the next call. On
resume, entries are skipped by walking the list until the marker entry is
found (exact == match), which works regardless of list direction.
Additionally, we have to handle the unlikely case where the marker entry
is deleted from proc_list between handler calls (e.g. the process exits and
SIGCHLD processing removes it). Detect this by tracking the previous
LEAVING entry's reload count during the skip phase: if two consecutive
entries straddle the skip value (one > skip, the other < skip), the
deleted entry's former position has been crossed, so skipping stops and
the current entry is emitted.
This should be backported to all stable branches. On branches where
proc_list is in descending order (2.9, 3.0), the fix applies the
same way since the skip logic is now direction-agnostic.
HTTP/3 parser cannot deal with unaligned frames, except for DATA. As it
was expected that such case would not occur, a simple BUG_ON() was
written to protect HEADERS parsing.
First, this BUG_ON() was incorrectly written due an incorrect operator
'>=' vs '>' when checking if data wraps. Thus this patch correct it.
However this correction is not sufficient as it still possible to handle
a large unaligned HEADERS frame, which would trigger this BUG_ON(). This
is very unlikely as HEADERS is the first received frame on a request
stream, but not completely impossible. As HTTP/3 frame header (type +
length) is parsed first and removed, this leaves a small gap at the
buffer beginning. If this small gap is then filled with the remaining
frame payload, it would result in unaligned data. Also, trailers are
also sensitive here as in this case a HEADERS frame is handled after
other frames.
The objective of this patch is to ensure that an unaligned frame is now
handled in a safe way. This is extend to all HTTP/3 frames (except DATA)
and not only to HEADERS type. Parsing is interrupted if frame payload is
wrapping in the buffer. This should never happen except maybe with some
weird clients, so the connection is closed with H3_EXCESSIVE_LOAD error.
This approach is considered the safest one, in particular for backport
purpose. In the future, realign operation via copy may be implemented
instead if considered as useful.
This must be backported up to 2.6.
In QUIC, a STREAM frame may be received with no data but with FIN bit
set. This situation is tedious to handle and haproxy parsing code has
changed several times to deal with this situation. Now, H3 and H09
layers parsing code are skipped in favor of the shared function
qcs_http_handle_standalone_fin() used to handle the HTX EOM emission.
However, this shortcut bypasses an important HTTP/3 validation check on
the received body size vs the announced content-length header. Under
some conditions, this could cause a desynchronization with the backend
server which could be exploited for request smuggling.
Fix HTTP/3 parsing code by adding a call to h3_check_body_size() prior
to qcs_http_handle_standalone_fin() if content-length header has been
found. If the body size is incorrect, the stream is immediately resetted
with H3_MESSAGE_ERROR code and the error is forwarded to the stream
layer.
Thanks to Martino Spagnuolo for his detailed report on this issue and
for having contacting us about it via the security mailing list.
This must be backported up to 2.6.
hstream_build_http_resp() currently uses snprintf() to build the
status code and the generated X-req/X-rsp header values.
These strings are short and are fully derived from already parsed request
state, so they can be assembled directly in the HAProxy trash buffer using
`chunk_strcat()` and `ultoa_o()`.
This keeps the generated output unchanged while removing the remaining
`snprintf()` calls from the response-building path.
No functional change is expected.
Signed-off-by: Aleksandar Lazic <al-haproxy@none.at>
The window size increments are 31 bits and the topmost bit is reserved
and should be ignored, however it was not masked, so a peer sending it
set would emit a negative value which could actually reduce the current
window instead of increasing it. Note that the window cannot reach zero
as there's already a test for this, but transfers could slow down to
the same speed as if an initial window of just a few bytes had been
advertised. Let's just mask the reserved bit before processing.
This should be backported to all stable versions.
The stream ID indicated in GOAWAY frames must have its bit 31 (R) ignored
and this wasn't the case. The effect is that if this bit was present, the
GOAWAY frame would mark the last acceptable stream as negative, which is
the default situation (unlimited), thus would basically result in this
GOAWAY frame to be ignored since it would replace a negative last_sid
with another negative one. The impact is thus basically that if a peer
would emit anything non-zero in the R bit, the GOAWAY frame would be
ignored and new streams would still be initiated on the backend, before
being rejected by the server.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
The key type received over the peers protocol is not checked for
validity and as a result can crash the process when passed through
peer_int_key_type[] in peer_treat_definemsg(). The risk remains
very low since only trusted peers may exchange tables, however it
represents a risk the day haproxy supports new key types, because
mixing old and new versions could then cause the old ones to crash.
Let's add the required check in peer_treat_definemsg().
It is also worth noting that in this function a few protocol identifiers
of type int read directly from a var_int via intdecode() and that some
protocol aliasing may occur (e.g. table_id, table_id_len etc). This is
not supposed to be a problem but it could hide implementation bugs and
cause interoperability issues once fixed, so these should be addressed
in a future commit that will not be marked for backporting.
Thanks to Haruto Kimura (Stella) for finding and reporting this bug.
This fix needs to be backported to all stable versions.
In pcli_prefix_to_pid(), when resolving a worker by absolute pid
(@!<pid>) or by relative pid (@1), a worker that still has PROC_O_INIT
set (i.e. not yet ready, still initializing) could be returned as a
valid target.
During a reload, if a client connects to the master CLI and sends a
command targeting a worker (e.g. @@1 or @@!<pid>), the master resolves
the target pid and attempts to forward the command by transferring a fd
over the worker's sockpair. If the worker is still initializing and has
not yet sent its READY signal, its end of the sockpair is not usable,
causing send_fd_uxst() to fail with EPIPE. This results in the
following alert being repeated in a loop:
[ALERT] (550032) : socketpair: Cannot transfer the fd 13 over sockpair@5. Giving up.
The situation is even worse if the initializing worker has already
exited (e.g. due to a bind failure) but has not yet been removed from
the process list: in that case the sockpair's remote end is already
closed, making the failure immediate and unrecoverable until the dead
worker is cleaned up.
This was not possible before 3.1 because the master's polling loop only
started once all workers were fully ready, making it impossible to
receive CLI connections while a worker was still initializing.
Fix this by skipping workers with PROC_O_INIT set in both the absolute
and relative pid resolution paths of pcli_prefix_to_pid(), so that
only fully initialized workers can be targeted.
Must be backported to 3.1 and later.
When loading libs into the core dump, let's also try to load
libthread_db.so.1 that gdb usually requires. It can significantly help
decoding the threads for systems which require it, and the file is quite
small. It can appear at a few different locations and is generally next
to libpthread.so, or alternately libc, so we first look where we found
them, and fall back to a few other common places. The file is really
small, a few tens of kB usually.
When set-dumpable=libs, let's also pick the debug symbols for the libs
we're loading. For now we only try /usr/lib/debug/<path>, which is quite
common and easy to guess. Build IDs could also be used but are more
complex to deal with, so let's stay simple for now.
When "set-dumpable" is set to "libs", in addition to marking the process
dumpable, haproxy also reads the binary and shared objects into memory as
a tar archive in a page-aligned location so that these files are easily
extractable from a future core dump. The goal here is to always have
access to the exact same binary and libs as those which caused the core
to happen. It's indeed very frequent to miss some of these, or to get
mismatching files due to a local update that didn't experience a reload,
or to get those of a host system instead of the container.
The in-memory tar file presents everything under a directory called
"core-%d" where %d corresponds to the PID of the worker process. In
order to ease the finding of these data in the core dump, the memory
area is contiguous and surrounded by PROT_NONE pages so that it appears
in its own segment in the core file. The total size used by this is a
few tens of MB, which is not a problem on large systems.
The global "set-dumpable" keyword currently is only positional. Let's
extend its syntax to support arguments. For now we support both "on"
and "off" to explicitly enable or disable it.
New function load_file_into_tar() concatenates a file into an in-memory
tar archive and grows its size. Only the base name and a provided prefix
are used to name the faile. If the file cannot be loaded, it's added as
size zero and permissions 0 to show that it failed to load. This will
be used to load post-mortem information so it needs to remain simple.
The purpose here is to create a tar file header in memory from a known
file name, prefix, size and mode. It will be used to prepare archives
of libs in use for improved debugging, but may probably be useful for
other purposes due to its simplicity.
Since 7a1382da7 ("BUG/MINOR: spoe: Fix condition to abort processing on
client abort"), the chn variable is no longer used in
spoe_process_event(). Let's remove it
This patch must be backported with the commit above, as far as 3.1.
The test to detect client aborts in the SPOE, introduced by commit b3be3b94a
("BUG/MEDIUM: spoe: Properly abort processing on client abort"), was no
correct. Producer flags must not be tested. Only the frontend SC must be
tested when the abortonclose option is set.
Because of this bug, when a client aborted, the SPOE processing was aborted
too, regardless the abortonclose option.
This patch must be backpoeted with the commit above, so as far as 3.1.
haproxy_sticktable_local_updates corresponds to the table->localupdate
counter, which is used internally by the peers protocol to identify
update messages in order to send and ack them among peers.
Here we decide to expose this information, as it is already the case in
"show peers" output, because it turns out that this value, which is
cumulative and grows in sync with the number of updates triggered on the
table due to changes initiated by the current process, can be used to
compute the update rate of the table. Computing the update rate of the
table (from the process point of view, ie: updates sent by the process and
not those received by the process), can be a great load indicator in order
to properly scale the infrastructure that is intended to handle the
table updates.
Note that there is a pitfall, which is that the value will eventually
wrap since it is stored using unsigned 32bits integer. Scripts or system
making use of this value must take wrapping into account between two
readings to properly compute the effective number of updates that were
performed between two readings. Also, they must ensure that the "polling"
rate between readings is small enough so that the value cannot wrap behind
their back.
We no longer rely on now_offset stored in the shm-stats-file. Instead
haproxy automatically computes the now_offset relative to the monotonic
clock and the shared global clock.
Indeed, the previous model based on static now_offset when monotonic
clock is available proved to be insufficient when used in
combination with shm-stats-file (that is when monotonic clock is shared
between multiple co-processes). In ideal situation co-processes would
correctly apply the offset to their local monotonic clock and end up
with consistent now_ns. But when restarting from an existing
shm-stats-file from a previous session (ie: prior to reboot), then the
local monotonic clock would no longer be consistent with the one used
to update the file previously, so applying a static offset would fail
to restore clock consistency.
For this specific issue, a workaround was brought by 09bf116
("BUG/MEDIUM: stats-file: detect and fix inconsistent shared clock when resuming from shm-stats-file")
but the solution implemented there was deemed too fragile, because there
is a 60sec window where the fix would fail to detect inconsistent clock
and would leave haproxy with a broken clock ranging from 0 to 60 seconds,
which can be huge..
By simply recomputing the now_offset each time we learn from another
process (through the shared map by reading global_now_ns), we simply
recompute our local offset (difference between OUR monotonic clock
and the SHARED one). Also, in clock_update_global_date(), we make
sure we always recompute the now_offset as now_ms may have been
updated from shared clock if shared clock was ahead of us.
Thanks to that new logic, interrupted processes, resumed processes,
processed started with shm-stats-file from previous session now
correctly recover from those various situations and multiple
co-processes with diverting clocks on startup end up converging to
the same values.
Since it is no longer relevant to save now_offset in the map, it was
removed but to prevent shm-stats-file incompatibility with previous
versions, 8-byte hole was forced, and we didn't bump the shm-stats-file
version on purpose.
This patch may be backported in 3.3 after a solid period of observation
to ensure we didn't break things.
mystrtod() was not length-aware and relied on null-termination or a
non-numeric character to stop. The fix adds a length parameter as a
strict upper bound for all pointer accesses.
The practical impact in haproxy is essentially null: all callers embed
the JSON payload inside a large haproxy buffer, so the speculative read
past the last digit lands on memory that is still within the same
allocation. ASAN cannot detect it in a normal haproxy run for the same
reason — the overread never escapes the enclosing buffer. Triggering a
detectable fault requires placing the JSON payload at the exact end of
an allocation.
Note: the 'path' buffer was using a null-terminated string so the result
of strlen is passed to it, this part was not at risk.
Thanks to Kamil Frankowicz for the original bug report.
This patch must be backported to all maintained versions.
The commit 9f1e9ee0e ("DEBUG: stream: Display the currently running rule in
stream dump") revealed a bug. When a stream is dumped, if it is blocked on a
rule, we must take care the rule has a keyword to display its name.
Indeed, some action parsings are inlined with the rule parser. In that case,
there is no keyword attached to the rule.
Because of this bug, crashes can be experienced when a stream is
dumped. Now, when there is no keyword, "?" is display instead.
This patch must be backported as far as 2.6.
When a L7 retry is performed, we should not rely on b_xfer() to swap the L7
buffer with the request buffer. When it is performed the request buffer is
not allocated. b_xfer() must not be called with an unallocated destination
buffer. The swap remains an optim. For instance, It is not performed on
buffers of different size. So the caller is responsible to provide an
allocated destination buffer with enough free space to transfer data.
However, when a L7 retry is performed, we cannot allocate a request buffer,
because we cannot yield. An error was reported, if we wait for a buffer, the
error will be handled by process_stream(). But we can swap the buffers by
hand. At this stage, we know there is no request buffer, so we can easily
swap it with the L7 buffer.
Note there is no real bug for now.
This patch could be backported to all stable versions.
In HTX, headers and trailers parts must always be complete. It is unexpected
to found header blocks without the EOH block or trailer blocks without the
EOT block. So, during H2/H3 message parsing, we must take care to remove any
HEADER/TRAILER block inserted when an error is encountered. It is mandatory
to be sure to properly report parsing error to upper layer.x
It is now performed by calling htx_truncat_blk() function on the error
path. The tail block is saved before converting any HEADERS/TRAILERS frame
to HTX. It is used to remove all inserted block on error.
This patch rely on the following one:
"MINOR: htx: Add function to truncate all blocks after a specific block"
It should be backported with the commit above to all stable versions for
the H2 part and as far as 2.8 for h3 one.
When H2 or H3 trailers are inserted in an HTX message, we must take care to
not exceed the maximum number of trailers allowed in a message (same than
the maximum number of headers, i.e tune.http.maxhdr). However, all HTX
blocks in the HTX message were considered. Only TRAILERS HTX blocks must be
considered.
To fix the issue, in h2_make_htx_trailers(), we rely on the "idx" variable
at the end of the for loop. In h3_trailers_to_htx(), we rely on the
"hdr_idx" variable.
This patch must be backported to all stables versions for the H2 part and as
far as 2.8 for the H3 one.
pouet
L7 retries are buggy when a large buffer is used on the request channel. A
memcpy is used to copy data from the request buffer into the L7 buffer. The
L7 buffer is for now always a standard buffer. So if a larger buffer is
used, this leads to a buffer overflow and crash the process.
The Best way to fix the issue is to disable L7 retries when a large buffer
was allocated for the request channel. In that case, we don't want to
allocate an extra large buffer.
No backport needed.
When a large buffer is used on a channel, once we've started to send data to
the opposite side, receives are blocked temporarily to be sure to flush the
large buffer ASAP to be able to fall back on regular buffers. This was
performed by skipping call to the endpoint (connection or applet). Howerver,
doing so, this broken the abortonclose and more generally this masked any
shut or error events reported by the lower layer.
To fix the issue, instead of skipping receives, we now try a receive but
with a requested size set to 0.
No backport needed
Client abort when abortonclose is configured was ignored when messges were
sent on event while it works properly when messages are sent via an
"send-spoe-group" action.
To fix the issue, when the SPOE filter is waiting for the SPOE applet
response, it must check if a client abort was reported and if so, must
interrupt its processing.
This patch should be backported as far as 3.1.
When the SPOE applet is created, the SPOE filter is set in SENDING_MSGS
state. When the applet has transferred data, it should switch the filter to
WAITING_ACK state. Concretly, there is no bug. At best, it could save some
useless applet wakeups.
This patch should be backported as far as 3.1
When SC's shudown callback functions were merged, a regression was
introduced. The applet was no longer woken up. Because of this bug, an
applet could remain blocked, waiting for an I/O event or a timeout.
This patch should fix the issue #3301.
No backport needed.
FDs received through recv_fd_uxst() do not have FD_CLOEXEC set.
The equivalent sock_accept_conn() already handles this correctly:
any FD accepted or received in the master must be marked close-on-exec
to avoid leaking it across the execvp() performed on soft-reload.
This is currently triggering a leak in the master since 3.1: the worker
sends a socketpair fd to the master to issue the _send_status CLI
command, and recv_fd_uxst() receive it without setting FD_CLOEXEC. If a
re-exec is emitted before the master had the chance to close that fd, it
survives execvp() and appears as an untracked unnamed AF_UNIX socket in
the new master generation.
This must be backported to all maintained branches.
Add a NULL guard for the version field. This has no functional impact
since the master process never uses this field for its own mworker_proc
element, and should be the only one impacted. This avoid seeing "(null)"
in the version field when debugging.
Must be backported to 3.1 and later.
During a soft reload, a starting worker sends sock_pair[0] to the master
via send_fd_uxst(), then reads on sock_pair[1] waiting for the master to
acknowledge receipt. Because of a documented macOS sendmsg(2) bug, the
worker must keep sock_pair[0] open until the master confirms the fd was
received by the CLI applet. This means the read() on sock_pair[1] will
never return 0 (EOF), since the worker itself still holds a reference to
sock_pair[0]. The worker can only unblock when the master actively sends
a byte back. If the master crashes before doing so, the worker blocks
indefinitely in read().
Fix this by setting a 2-second SO_RCVTIMEO on sock_pair[1] before the
read(), so the worker can unblock and continue regardless of the master's
state.
This was introduced by d7f6819161 ("BUG/MEDIUM: mworker: fix startup
and reload on macOS").
This should be backported to 3.1 and later.
In mworker_proc_list_to_env(), a typo used '&=' instead of '&' when
checking PROC_O_TYPE_WORKER in child->options. This would corrupt the
options field by clearing all bits except PROC_O_TYPE_WORKER, but since
the function is called right before the master re-execs itself during a
reload, the corruption has no actual effect: the in-memory proc_list is
discarded by the exec, and the options field is not serialized to the
environment anyway.
This should be backported to all maintained versions.
We defer processing of the "-dt" options until after the configuration
file has been read. This will be useful if we ever allow trace sources
to be registered later, for instance with LUA.
No backport needed.
In master-worker mode, when a freshly forked worker looks up its own
entry in proc_list to send its "READY" status to the master, the loop
was breaking on the first process with pid == -1 regardless of its
type. If a non-worker process (e.g. a master or program) also had
pid == -1, the wrong entry could be selected, causing send_fd_uxst()
to use an invalid ipc_fd.
Fix this by adding a PROC_O_TYPE_WORKER check to the loop condition,
and add a BUG_ON() assertion to catch any case where the loop exits
without finding a valid worker entry.
Must be backported to 3.1.
"show profiling" supports "aggr" for tasks but it was ignored for
memory. Now that we're having many more entries, it makes sense to
have it to ignore the call path and merge similar operations.
Keywords registered out of an initcall will have a TH_EX_CTX_CLI_KWL
execution context pointing to the keyword list. The report will indicate
the 5 first words of the first command of the list, e.g.:
exec_ctx: cli kwl starting with 'debug counters '
This should also work for CLI keywords registered in Lua.
Now CLI keywords registered via an initcall will be tracked during
execution, by keeping a link to their initcall location. "show threads"
now shows "exec_ctx: kw registered at @debug.c:3093" which indeed
corresponds to the initcall for the debugging commands.
Till now the CLI didn't know what keyword was being processed after it
was parsed. In order to report the execution context, we'll need to
store it. And this may even help for post-mortem analysis to know the
exact keyword being processed, so let's store the pointer in the cli_ctx
part of the appctx.
It allows to know when a thread is currnetly running inside an applet.
For example now "show threads" will show "applet '<CLI>'" for the thread
issuing this command.
It now appears almost everywhere due to callbacks (e.g. ssl_sock_io_cb).
Muxes also become visible now on memory profiling. A small test on h1+ssl
yields 838 lines of statistics. The number of buckets should definitely
be increased, and more grouping criteria should be added.
A performance test was conducted to observe the possible effect of
setting the execution context on each task switch, and it didn't change
at all, remaining at about 1.01 billion ctxsw/s on a 128-thread EPYC.
Most calls to mux ops were instrumented with a CALL_MUX_WITH_RET() or
CALL_MUX_NO_RET() macro in order to make the current thread's context
point to the called mux and be able to track its allocations. Only
a bunch of harmless mux_ctl() and ->subscribe/unsubscribe calls were
left untouched since useless. But destroy/detach/shut/init/snd_buf
and rcv_buf are now tracked.
It will not show allocations performed in IO callback via tasklet
wakeups however.
In order to ease reading of the output, cmp_memprof_ctx() knows about
muxes and sorts based on the .subscribe function address instead of
the mux_ops address so as to keep various callers grouped.
In order to be able to track memory allocation performed from message
callbacks, let's set the thread execution context to a generic function
pointing to them during their call. This allows for example to observe
the share of SSL allocations caused by ssl_sock_parse_clienthello() when
SSL captures are enabled.
The release calls are automatic from the SSL library for these, and are
registered directly via SSL_get_ex_new_index(). Maybe we should improve
the internal API to wrap that function and systematically track free
calls as well. In this case, maybe even registering the message callback
registration could take both the callback and the release function.
There are few such users however, essentially capture and keylog.
Doing this allows to report the allocations/releases performed by filters
when running with memory profiling enabled. The flt_conf pointer is kept
and the report shows the filter name.
A bit similar to what was done for sample fetch functions and converters,
we now store with each action keyword the location of the initcall when
they're registered this way. Since there are many functions only calling
a LIST_APPEND() (one per ruleset), we now implement a dedicated function
to store the context in all keywords before doing the append.
However that's not sufficient, because keywords are not mandatory for
actions, so we cannot safely rely on rule->kw. Thus we then set the
exec_ctx per rule when they are all scanned in check_action_rules(),
based on the keyword if it exists, otherwise we make a context from
the action_ptr function if it is set (it should).
Finally at all call points we now check rule->exec_ctx.
The purpose here is to be able to spot certain callbacks, such as the
SSL message callbacks, which are difficult to associate to anything.
Thus we introduce a new context type, TH_EX_CTX_FUNC, for which the
context is just the function pointed to by the void *pointer. One
difficulty with callbacks is that the allocation and release contexts
will likely be different, so the code should be properly structured
to allow proper tracking, either by instrumenting all calls, or by
making sure that the free calls are easy to spot in a report.
With the two new context types TH_EX_CTX_SMPF/CONV, we can now also
report contexts corresponding to direct calls to sample_register_fetches()
and sample_register_convs(). In this case, the first word of the keyword
list is reported.
Now keywords are registered with an exec_ctx and this one is passed
when calling ->process. The ctx is of type INITCALL when passed via
an initcall where we know the file name and line number.
This was tested with and extra "malloc(15)" added in smp_fetch_path()
which shows that it works:
$ socat /tmp/sock1 - <<< "show profiling memory"|grep via
Calls | Tot Bytes | Caller and method [via]
1893399 0 60592592 0| 0x78b2ec task_run_applet+0x3339c malloc(32) [via initcall @http_fetch.c:2416]
When the execution context is set to TH_EX_CTX_INITCALL, the pointer
points to a valid initcall, and the decoder will show "kw registered
at %s:%d" with file and line number of the initcall declaration. It's
up to the caller to make the initcall pointer point to the one that was
set during the initcall. The purpose here is to be able to preserve and
pass that knowledge of an initcall down the chain so that future calls
to functions registered via the initcall are still assigned to it.
The INITCALL macros will now store the file and line number where they
are declared into the initcall struct, and RUN_INITCALLS() will assign
them to the global caller_file and caller_line variables, and will even
set caller_initcall to the current initall so that at any instant such
functions know where their caller declared them. This will help with
error messages and traces where a bit of context will be welcome.
Now we have one extra line saying "exec_ctx: something" in thread dumps
when it's known. It may help with warnings and panics to figure what
is ongoing.
The new function chunk_append_thread_ctx() appends to a buffer the given
execution context based on its type and pointer. The goal is to easily
use it in profiling output and thread dumps. For now it only handles
TH_EX_CTX_NONE (which prints nothing) and TH_EX_CTX_OTHER (which indicates
"other ctx" followed by the pointer). It will be extended by new types as
they arrive.
By passing "byctx" to "show profiling memory", it's possible to sort by
the calling context first, which could help group certain calls by
subsystem and ease the interpretation of the output.
This now allows to report the same function in multiple bins based on the
th_ctx's exec_ctx discriminant. It's also worth noting that the context is
not atomically committed, but this shouldn't be a problem since a single
entry can get it. In the worst case, a second thread trying to create the
same context in parallel would create a different bin just for this call,
which is harmless. The same situation already exists with the caller
pointer.
When two pointer hash to the same memprofile bin, we currently try again
with the same bin until we find a spare one or we reach the limit of 16.
Olivier suggested to try with a different step for different pointers so
as to limit the number of bins to visit in such a case, so let's split
the pointer hash calculation so that we keep the raw hash before reduction
and use its lowest bits as the retry step. We force lowest bit to 1 to
avoid integral multiples that would oscillate between only a few positions.
Quick tests with h1+h2 requests show that for ~744 distinct entries, we
used to have 1.17 retries per lookup before and 0.6 now so we're halving
the cost of hash collisions. A heavier workload that used to produce 920
entries with 2.01 retries per lookup now reaches 966 entries (94.3% usage
vs 89.8% before) with only 1.44 retries per lookup.
This should be safe to backport, but depends on this previous commit:
MINOR: tools: extend the pointer hashing code to ease manipulations
Historically, the data manipulated by "show profiling" were copied
onto the stack for sorting and aggregating, but not only this limits
the number of entries we can keep, but it also has an impact on CPU
usage (having to redo the whole copy+sort upon each resume) and the
output accuracy (if sorting changes lines, resume may happen from an
incorrect one).
Instead, let's dynamically allocate the work buffer and place it into
the service context. We only allocate it immediately before needing it
and release it immediately afterwards so that it doesn't stay long. It
also requires a release handler to release those allocates by interrupted
dumps, but that's all. The overall result is now much cleaner, more
accurate, faster and safer.
This patch may be backported to older LTS releases.
In check_config_validity() and proxy_finalize() we check the consistency
of all rule sets, but the quic_initial rules were not placed there. This
currently has little to no impact, however we're going to use that to
also finalize certain debugging info so better call the function. This
can be backported to 3.1 (proxy_finalize is 3.4-only).
In 3.1, per-DSO statistics were added to the memprofile output by
commit 401fb0e87a ("MINOR: activity/memprofile: show per-DSO stats").
However an strdup() is performed there on the .info field, that is
never freed when leaving the function. Let's do it each time we leave
it. Ironically, this was found thanks to "show profiling" showing
itself as an unbalanced caller of strdup().
This needs to be backported to 3.0 since that commit was backported
there.
To read early data with AWS-LC (and BoringSSL), we have to use
SSL_read(). But SSL_read() will also try to do the handshake if it
hasn't been done yet, and at some point will do the handshake and will
return data that are actually not early data. So use SSL_in_early_data()
to make sure that the data we received are actually early data, and only
if so add the CO_FL_EARLY_DATA flag. Otherwise any data first received will be
considered early, and a Early-data header will be added.
As this bug was introduced by 76ba026548,
it should be backported with it.
Upon _send_status, always stop the listener from which the request
was received, rather than looking it up from the proc_list entry via
fdtab[proc->ipc_fd[0]].owner.
A BUG_ON is added to verify that the listener which received the
request is the one expected for the reported PID.
This means it is no longer possible to send "_send_status READY XXX"
manually through the master CLI for testing, as that would trigger
the BUG_ON.
Must be backported as far as 3.1.
The API for early data is a bit different with BoringSSL and AWS-LC than
it is for OpenSSL. As it was implemented, early data would be accepted,
but would not be processed until the handshake is done. Change that by
doing something similar to what OpenSSL does, and, if 0RTT has been
enabled on the listener, use SSL_read() to try to get early data before
starting the handshake, and if there's any, provide them to the mux the
same way it is done for OpenSSL.
That replaces a bunch of #ifdef SSL_READ_EARLY_DATA_SUCCESS by
something specific to OpenSSL has to be done.
This should be backported to 3.3.
EVP_MD_CTX is allocated using EVP_MD_CTX_new() but was never freed.
ctx should be initialized to NULL otherwise EVP_MD_CTX_free(ctx) could
segfault.
Must be backported as far as 3.2.
With a config:
backend bk_app
http-check expect status 200 string "status: ok"
This now correctly emits the error:
config : parsing [./patch.cfg:2] : 'http-check expect' : only one pattern expected.
This line containing the typo is unchanged since at least HAProxy 2.2, the
patch should be backported into all supported branches.
Starting with OpenSSL 4.0, X509_get_subject_name(), X509_get_issuer_name(),
and X509_CRL_get_issuer() return a const-qualified X509_NAME pointer.
Similarly, X509_NAME_get_entry() returns a const X509_NAME_ENTRY *, and
X509_NAME_ENTRY_get_data() returns a const ASN1_STRING *.
Introduce the __X509_NAME_CONST__ macro (defined to 'const' for OpenSSL
>= 4.0.0, empty for WolfSSL and older OpenSSL version which lacks const
on these APIs) and use it to qualify X509_NAME * variables and the
parameters of the three DN helper functions ssl_sock_get_dn_entry(),
ssl_sock_get_dn_formatted(), and ssl_sock_get_dn_oneline(). This avoids
both const-qualifier warnings on OpenSSL 4.0 and discarded-qualifier
warnings on WolfSSL, without needing explicit casts at call sites.
In ssl_sock.c (ssl_get_client_ca_file) and ssl_gencert.c
(ssl_sock_do_create_cert), a __X509_NAME_CONST__ X509_NAME * variable was
being reused to store the result of X509_NAME_dup() and then passed to
mutating functions (X509_NAME_add_entry_by_txt, X509_NAME_free). Introduce
separate X509_NAME * variables (xn_dup, subject) to hold the mutable
duplicate.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
In OpenSSL 4.0, the ASN1_STRING struct was made opaque and direct access
to its members (->data, ->length, ->type) no longer compiles. Replace
these accesses in ssl_sock_get_serial(), ssl_sock_get_time(), and
asn1_generalizedtime_to_epoch() with the proper accessor functions
ASN1_STRING_get0_data(), ASN1_STRING_length(), and ASN1_STRING_type().
The old direct access is preserved under USE_OPENSSL_WOLFSSL since
WolfSSL does not provide these accessor functions.
Original patch from Alexandr Nedvedicky <sashan@openssl.org>:
https://www.mail-archive.com/haproxy@formilux.org/msg46696.html
When a master process is reloading, the HAPROXY_PROCESSES variable is
deserialized. In older version of the master-worker (< 1.9), no master
element was existing in this variable.
This is not suppose to happen anymore, and could have provoked problems
in the master anyway.
This patch changes the behavior by exiting the master with an alert if
mp master element was found in this variable.
When no endpoint is attached to a SC, it is unexpected to have I/O (receive
or send). But we honestly don't know if it happens or not. So a CHECK_IF()
is added to be able to track such calls.
Calls to se_shutdown were no the same between applets and mux endpoints.
Only the SHUTW flag was not the same. However, on the multiplexers are
sensitive to the true SHUTW flag. The applets handle all of them the same
way. So calls to se_shutdown() from sc_abort() and sc_shutdown() can be
merged to always use the multiplexer version.