Commit e67e36c9eb introduced
tune.h2.log-errors, that would let you pick if you wanted to know about
stream errors, connection errors, or no error.
However, a logic error made it so no error will be picked for any value
except for "none", in which case connection would be picked. Fix that by
just checking the strcmp() return value correctly.
This should be backported wherever e67e36c9eb
has been backported.
A malformed tcp option with an option length set to 0 can cause
an infinite loop on ip.fp converter.
The patch also forces the computation to use an unsigned char to
avoid a shift back during the parsing.
This fix should be backported on all versions including the ip.fp
converter.
The proxy error counter was not updated in h2c_frt_handle_headers() in
case of failure to decode a HEADERS frame. Make sure to keep it updated.
This can be backported to all stable versions.
Commit aab1a60977 ("BUG/MEDIUM: h2/htx: always fail on too large trailers")
explicitly returned an RST_STREAM on failure to decode some trailers, and
used the code H2_ERR_INTERNAL_ERROR. However there are multiple possible
causes for this failure to happen, and it turns out that it's much more
likely to be related to a protocol error than a decompression error. So
let's change this to PROTOCOL_ERROR, and count a protocol error on the
proxy and in the session.
This can be backported to all stable versions (with adjustments related
to these versions, maybe focusing on 3.2 max is reasonable).
A new type of transaction was introduced for master-cli streams. So
SF_TXN_PCLI flag and functions to allocate and destroy PCLI transactions
were added.
In the stream structure, all pcli_* members were moved in the pcli
transaction and the txn union was updated accordingly.
When it was ambiguous, a test on the transaction type was performed. For
instance to destroy the transaciton.
To be able to deal with different types of transaction for a stream, new
stream flags was added to know the transaction type when allocated. For now
only HTTP transactions can be allocated, so only SF_TXN_HTTP was
introduced. The mask SF_TXN_MASK must be used to get the transaction type.
The transaction type is set when it is allocated and removed when it is
destroyed.
The HTTP transaction is moved in an union. For now, it is the only possible
transaction that can be allocated. But that will change. Thanks to this
commit and the next one, it will be possible to deal with different kind of
transactions for a stream.
This patch looks quite huge, but it is more or less a renaming of all
accesses to "txn" field by "txn.http".
The maximum size allowed for the payload pattern was increase up to 64 bytes
(65 bytes because of the trailing \0), to be able to use a sha256 of random
data for instance. It could be useful to prevent any data smuggling on the
payload.
Note that on the CLI, it could be possible to have only the buffer size as a
limit, because the command line is only consumed once all commands are
executed. The payload pattern is only a pointer in the buffer where the
command line was copied. However, for the master CLI, the data are streamed
to the worker, so we must keep a copy of he payload pattern. This is why we
must limit its size.
It is now possible to deal with too big payload to fit in a buffer, without
changing the buffer size. By default, a payload up to 128 KB can be
dynamically allocated. "tune.cli.max-payload-size" global parameter can be
used to change this value, with some caution for huge values.
For CLI command handler functions, there is no change at all. A pointer on
the payload is still passed as parameter. Internally, an area is allocated
for the payload only if it is too big.
The payload pattern used to detect the end of the payload is part from the
allocated area.
The payload is now saved as a buffer in the CLI context instead of a simple
pointer. It is mandatory to be able to reallocate the payload if it is too
big.
Instead of copying the payload pattern in the CLI context, we now only save
a pointer on this pattern. It is possible because the command line is copied
in the CLI context. Arguments are already handled this way when the command
is processed.
There was a test below the "release" label on conn->owner to decide
whether to kill the connection or not. But this test is not needed,
because:
- for frontends, it's always set so the test never matches
- for backends, it was NULL on the second stream once a request
was being reused from an idle pool, so it couldn't be used to
discriminate between connections. In practice, the goal was to
try to detect certain dead connections but all cases leading to
such connections are either already handled in the tests before
(which don't reach this label), or are handled by the other
conditions.
Thus, let's remove this confusing test.
Some places use conn->owner to retrieve the session. It's valid because
each time it is done, it's on the frontend, though it's not always 100%
obvious and sometimes requires deep code analysis. Let's clarify these
points and even rely on an intermediary variable to make it clearer. One
case where the owner couldn't differ from the session without being NULL
was also eliminated.
When installing a mux on the backend, unless we have a good reason for
keeping the session set in conn->owner, we must reset it. Having the
session there just hides potential bugs and prevents certain tests from
being properly done.
Now it is much clearer: conn->owner remains set to the session on
frontend connections, is set to the session when the connection is
private or assimilated private and belongs to the session list, or
is NULL.
When an idle connection is private or considered private, session_add_conn()
is called to add it to the list of connections owned by the session. But
in case of allocation failure, the session is not set, which results in
a long list of possible situations that are all corner cases which are
difficult to test (and debug).
This commit relies on the fact that it is already permitted to have
conn->owner pointing to a session even if the connection couldn't be
added to the session's list, as this was already the case in
conn_backend_get() when dealing with HOL_RISK. Also as seen in commit
3aab17bd56 added in 2.4, it is already possible to have conn->owner
set with the connection not being in a list, and only the list element
is checked for this.
This commit modifies session_add_conn() to always set conn->onwer, even
if the list element couldn't be allocated. This way it's possible to
always refer to conn->owner to find the session owning a private conn
even in case of failure to allocate an entry. This requires to change
the checks on conn->owner to a check of the list element to see if the
connection belongs to a session, the pre-assignment of sess to
conn->owner in conn_backend_get() is no longer needed, same for the
pre-assignment in http_wait_for_response(), and that's all.
The H1 mux remained unchanged because since it cannot multiplex, in
case it fails to allocate a pconn, it instantly kills the connection.
The bytes_in, bytes_out, {req,res}.bytes_{in,out} sample fetch functions
are marked as internal dependencies only. But that's not exact, they are
statistics. Request traffic (bytes_in, req.bytes*) is usable starting
from the request, while response traffic (bytes_out, res.bytes*) is usable
as soon as a response begins to be received, and all are valid till the
end of the transaction.
The impact is that the log-format below:
log-format "req.bytes_in=%[req.bytes_in] req.bytes_out=%[req.bytes_out] res.bytes_in=%[res.bytes_in] res.bytes_out=%[res.bytes_out]"
is emitted too early and only logs zeroes when uploading 1MB and
downloading 1MB:
req.bytes_in=0 req.bytes_out=0 res.bytes_in=15288 res.bytes_out=0
This patch marks the request stats RQFIN and the response stats RSFIN,
so that they're valid at any moment and the logs backend knows it must
wait for the latest moment to emit such a line. With this change, the
line above now correctly produces:
req.bytes_in=1000157 req.bytes_out=1000157 res.bytes_in=1048629 res.bytes_out=1048629
This should be backported as far as the latest LTS probably, along with
these 2 previous patches:
BUG/MINOR: log: consider format expression dependencies to decide when to log
MINOR: sample: make RQ/RS stats available everywhere
Sample fetch functions working on the request/response stats were marked
as being only compatible with the log phase. This is a mistake because
by definitions, stats can be consulted anywhere from the moment they
start to appear. It's only that they are valid as far as the logs. At the
moment, no sample fetch function depends on RQFIN, and only res.timer.data
depends on RSFIN. But this will be needed to relax certain sample fetch
functions (and will need to be backported along with a few other patches).
Log-format properly takes into account the LW_* flags set by the log
aliases, however its consideration for the sample fetch expressions is
very minimalistic (HTTP y/n). It poses a problem because logging some
statistics doesn't work unless some log aliases are involved to force
the log to wait till the end.
Before this change, the following log-format:
log-format "res.timer.data=%[res.timer.data]"
would log "res.timer.data=0" regardless of the time taken to transfer
data, and the log would be emitted instantly. However, this line:
log-format "res.timer.data=%[res.timer.data] %B"
would properly log the time taken to transfer the data because %B which
carries the log flag LW_BYTES forces the log to wait till the end.
This patch makes sure that anything requiring response (headers or body)
waits for at least the response, and that anything requiring response body
or end of transfer (req/res) waits till the end (LW_BYTES). Thanks to
this, the log above is now correct even without the "%B" hack.
This should be backported at least till the latest LTS.
The ACME Profiles extension (draft-ietf-acme-profiles) allows a client
to request a specific certificate profile by including a "profile" field
in the newOrder request. This lets the CA select the appropriate
certificate issuance policy (e.g. "classic", "shortlived") for a given
order.
A new "profile" keyword is added to the acme section. When set, its
value is included in the newOrder JSON payload sent to the CA.
The target address type has been added to checks in commit
d759e60a32, but as part of that address
type is the "alt_proto" field, that was not properly set for dynamic
servers, That could lead to checks not working for any protocol that use
a non-zero alt_proto, such as QUIC. So set it properly.
gcc flags aead_tag_trash as potentially NULL at the chunk_memcpy call
inside the (!dec && gcm) block, because it cannot correlate the
condition with the allocation that only happens in that same branch. Add
an explicit NULL check to silence the warning.
This was caught by cross-zoo.yml:
In file included from include/haproxy/connection.h:28,
from src/ssl_sample.c:27:
In function ‘b_orig’,
inlined from ‘sample_conv_aes’ at src/ssl_sample.c:540:23:
include/haproxy/buf.h:80:17: error: potential null pointer dereference [-Werror=null-dereference]
80 | return b->area;
| ~^~~~~~
In function ‘b_data’,
inlined from ‘sample_conv_aes’ at src/ssl_sample.c:540:3:
include/haproxy/buf.h💯17: error: potential null pointer dereference [-Werror=null-dereference]
100 | return b->data;
| ~^~~~~~
When trying to read QMux transport parameters frame, the record length
is checked to ensure it is not bigger than the buffer size. The
objective is to detect as soon as possible when receiving data that
cannot be handled and to close the connection.
In fact, this check is not accurate, as it did not take into account the
size of the Record length field itself. This patch fixes the comparison
by substracting with the size of the decoded varint.
No need to backport.
QMux record lengths are encoded as a QUIC varint. Thus in theory, it
requires a 64bits integer to be able to read the whole value. In
practice, if the record is bigger than bufsize, read operation cannot be
completed and an error must be reported.
This patch fixes record length decoding both in xprt_qstrm layer, which
is now performed in two steps. The value is first read in a 64bits
integer instead of a size_t whose size is dependent on the architecture.
Result is then checked against bufsize and if inferior stored in the
previously used variable (xprt ctx rxrlen member).
This should partially fix build issue reported on github #3334.
No need to backport.
An idle backend connection is useless if a HTTP/3 GOAWAY frame has been
received. Indeed, it is forbid to open new stream on such connection.
Thus, this patch ensures such connections are removed as soon as
possible. This is performed via a new check in qcc_is_dead() on
QC_CF_CONN_SHUT flag for backend connections. This ensures that a shut
connection is released instead of being inserted in idle list on detach
operation.
This commits also completes qcc_recv() with a new call to qcc_is_dead()
on its ending. This is necessary if GOAWAY is received on an idle
connection. For now, this is only checked for backend connections as a
GOAWAY is without any real effect for frontend connections. Thus, this
extra protection ensures that we do not break by incident QUIC frontend
support.
qcc_io_recv() also performs qcc_decode_qcs(). However, an extra
qcc_is_dead() is not necessary in this case as the following
qcc_io_process() already performs it.
Implement the reception of a HTTP/3 GOAWAY frame. This is performed via
the new function h3_parse_goaway_frm(). The advertised ID is stored in
new <id_shut_r> h3c member. It serves to ensure that a bigger ID is not
advertised when receiving multiple GOAWAY frames.
GOAWAY frame reception is only really useful on the backend side for
haproxy. When this occurs, h3c is now flagged with H3_CF_GOAWAY_RECV.
Also, QCC is also updated with new flag QC_CF_CONN_SHUT. This flag
indicates that no new stream may be opened on the connection. Callback
avail_streams() is thus edited to report 0 in this case.
Rework GOAWAY emission handling at the HTTP/3 layer. Previously, h3c
member <id_goaway> were updated during the connection on each new
streams attach. This ID was finally reused when a GOAWAY was emitted.
However, this is unnecessary to keep an updated ID during the connection
lifetime. Indeed, <largest_bidi_r> QCC member can be used for the same
purpose. Note that this is only useful for the frontend side. For a
client connection, GOAWAY contains a PUSH ID, thus 0 can be used for
now.
Thus, <id_goaway> in h3c is renamed <id_shut_l>. Now it is only sent
when the GOAWAY is emitted. This allows to reject any streams with a
greater ID. This approach is considered simpler.
Note that <largest_bidi_r> is not strictly similar to the obsolete
<id_goaway>. Indeed, if an error occurs before the corresponding stream
layer allocation, the former would still be incremented. However,
this is not a real issue as GOAWAY specification is clear that lower IDs
are not guaranteed to being handled well, until either the stream is
closed or resetted, or the whole connection is teared down.
QUIC streams ID are encoded as 62-bit integer and cannot reuse an ID
within a connection. This is necessary to take into account this
limitation for backend connections.
This patch implements this via qmux_avail_streams() callback. In the
case where the connection is approaching the encoding limit, reduce the
advertised value until the limit is reached. Note that this is very
unlikely to happen as the value is pretty high.
This should be backported up to 3.3.
add comp_ prefix to all compression related functions, in anticipation
of decompression functions that will be integrated in the same file, so
we don't get mixed up between the two.
No change of behavior expected.
Originally, valid backend connections always used to have conn->owner
pointing to the owner session. In 1.9, commit 93c885 enforced this when
implementing backend H2 support by making sure that no orphaned connection
was left on its own with no remaining stream able to handle it.
Later, idle connections were reworked so that they were no longer
necessarily attached to a stream, but could be directly in the server,
accessed via a hash, so it started to become possible to have conn->owner
left to NULL when picking such a connection. It in fact happens for
http-reuse always, when the second stream picks the connection because
its owner is NULL and it's not changed.
More recently, a case was identified where it could be theoretically
possible to reinsert a dead connection into an idle list, and commit
59c599f3f0 ("BUG/MEDIUM: mux-h2: make sure not to move a dead
connection to idle") addressed that possibility in 3.3 by adding the
h2c_is_dead() test in h2_detach() before deciding to reinsert a
connection into the idle list.
Unfortunately, the combination of changes above results in the following
sequence being possible:
- a stream requires a connection, connect_server() creates one, sets
conn->owner to the session, then when the session is being set up,
the SSL stack calls conn_create_mux() which gets the session from
conn->owner, passes it to mux->init() (h2_init), which in turn
creates the backend stream and assigns it this session.
- when the stream ends, it detaches (h2_detach), and the call to
h2c_is_dead() returns false because h2c->conn->owner is set. The
connection is thus added into the server's idle list.
- a new stream comes, it finds the connection in the server's list,
which doesn't require to set conn->owner, the stream is added via
h2_attach() which passes the stream's session, and that one is
properly set on h2s again, but never on conn->owner.
- the stream finishes, detaches, and this time the call to h2c_is_dead()
sees the owner is NULL, thus indicates that the connection seems dead
so it's not added again to the idle list, and it's destroyed.
Note that this most only happens at low loads (at most one active stream
per connection, so typically at most than one active stream per thread),
where the H2 reuse ratio on a server configured with http-reuse always
or http-reuse aggressive is close to 50%. At high loads, this is much more
rare, though looking at the reuse stats for a server, it's visible that a
sustained load still shows around 1% of the connections being periodically
renewed.
Interestingly, for RHTTP the impact is more important because there
was already a work around for this test in h2c_is_dead() but it uses
conn_is_reverse(), which is never correct in this case (it should be
called conn_to_reverse() because it says the conn must be reversed
and has not yet been), so this extra test doesn't protect against the
NULL check, and connections are closed after each stream is terminated
(if there is no other stream left).
After a long analysis with Amaury and Olivier, it was concluded that:
- the h2c_is_dead() addition is finally not the best solution and
could be refined, however in the current state it's a bit tricky.
- the conn->owner test in h2c_is_dead() is no longer relevant,
probably since 2.4 when connections were stored using hash_nodes
in the servers and would no longer depend on a session, so that
test should be removed.
- the test conn_is_reverse() on the same line, that was added to
ignore the former for RHTTP, and which doesn't properly work either
should be removed as well.
Some further cleanups should be performed to clarify this situation.
This patch implements the points above, and it should be backported
wherever commit 59c599f3f0 was backported.
A lot of our subsystems start to be shared by thread groups now
(listeners, queues, stick-tables, stats, idle connections, LB algos).
This has allowed to recover the performance that used to be out of
reach on losely shared platforms (typically AMD EPYC systems), but in
parallel other large unified systems (Xeon and large Arm in general)
still suffer from the remaining contention when placing too many
threads in a group.
A first test running on a 64-core Neoverse-N1 processor with a single
backend with one server and no LB algo specifiied shows 1.58 Mrps with
64 threads per group, and 1.71 Mrps with 16 threads per group. The
difference is essentially spent updating stats counters everywhere.
Another test is the connection:close mode, delivering 85 kcps with
64 threads per group, and 172 kcps (202%) with 16 threads per group.
In this case it's mostly the more numerous listeners which improve
the situation as the change is mostly in the kernel:
max-threads-per-group 64:
# perf top
Samples: 244K of event 'cycles', 4000 Hz, Event count (approx.): 61065854708 los
Overhead Shared Object Symbol
10.41% [kernel] [k] queued_spin_lock_slowpath
10.36% [kernel] [k] _raw_spin_unlock_irqrestore
2.54% [kernel] [k] _raw_spin_lock
2.24% [kernel] [k] handle_softirqs
1.49% haproxy [.] process_stream
1.22% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1024 84560 83536 4761666 0 84k5 83k5 38M0 11.91m
2 1024 168736 167713 9559698 0 84k0 84k0 38M3 11.98m
3 1024 253865 252841 14412165 0 85k0 85k0 38M7 11.84m
4 1024 339143 338119 19272783 0 85k1 85k1 38M8 11.80m
5 1024 424204 423180 24121374 0 84k9 84k9 38M7 11.86m
max-threads-per-group 16:
# perf top
Samples: 1M of event 'cycles', 4000 Hz, Event count (approx.): 375998622679 lost
Overhead Shared Object Symbol
15.20% [kernel] [k] queued_spin_lock_slowpath
4.31% [kernel] [k] _raw_spin_unlock_irqrestore
3.33% [kernel] [k] handle_softirqs
2.54% [kernel] [k] _raw_spin_lock
1.46% haproxy [.] process_stream
1.12% [kernel] [k] _raw_spin_lock_bh
# h1load
time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1020 172230 171211 9759255 0 172k 171k 78M0 5.817m
2 1024 343482 342460 19520277 0 171k 171k 78M0 5.875m
3 1021 515947 514926 29350953 0 172k 172k 78M5 5.841m
4 1024 689972 688949 39270207 0 173k 173k 79M2 5.783m
5 1024 863904 862881 49184274 0 173k 173k 79M2 5.795m
So let's change the default value to 16. It also happens to match what's
used by default on EPYC systems these days.
This change was marked MEDIUM as it will increase the number of listening
sockets on some systems, to match their counter parts from other vendors,
which is easier for capacity planning.
When nbthread is set, the CPU policies are not used and do not set
nbthread nor nbtgroups. When back into thread_detect_count(), these
are set respectively to thr_max and 1. The problem which becomes very
visible with max-threads-per-group, is that setting this one in
combination with nbthreads results in only one group with the calculated
number of threads per group. And there's not even a warning. So basically
a configuration having:
global
nbthread 64
max-threads-per-group 8
would only start 8 threads.
In this case, grp_min remains valid and should be used, so let's just
change the assignment so that the number of groups is always correct.
A few ifdefs had to move because the calculations were only made for
the USE_CPU_AFFINITY case. Now these parts have been refined so that
all the logic continues to apply even without USE_CPU_AFFINITY.
One visible side effect is that setting nbthread above 64 will
automatically create the associated number of groups even when
USE_CPU_AFFINITY is not set. Previously it was silently changed
to match the per-group limit.
Ideally this should be backported to 3.2 where the issue was
introduced, though it may change the behavior of configs that were
silently being ignored (e.g. "nbthread 128"), so the backport should
be considered with care. At least 3.3 should have it because it uses
cpu-policy by default so it's only for failing cases that it would be
involved.
The nextreq label already implement setting http_state to ACME_HTTP_REQ
and setting ctx->state to st. It is only needed to set the st variable
before jumping to nextreq.
When the opportunistic initial DNS check (ACME_INITIAL_RSLV_READY) fails,
the state machine was incorrectly transitioning to ACME_RSLV_RETRY_DELAY
instead of ACME_CLI_WAIT. This caused the challenge to enter the DNS retry
loop rather than falling back to the normal cond_ready flow that waits for
the CLI signal.
Also reorder ACME_CLI_WAIT in the state enum and trace switch to reflect
the actual execution order introduced in the previous commit: it comes after
ACME_INITIAL_RSLV_READY, not before ACME_INITIAL_RSLV_TRIGGER.
No backport needed.
For dns-persist-01, the "_validation-persist.<domain>" TXT record is set once
and never changes between renewals. Add an initial opportunistic DNS check
(ACME_INITIAL_RSLV_TRIGGER / ACME_INITIAL_RSLV_READY states) that runs before
the challenge-ready conditions are evaluated. If all domains already have the
TXT record, the challenge is submitted immediately without going through the
cli/delay/dns challenge-ready steps, making renewals faster once the record is
in place.
The new ACME_RDY_INITIAL_DNS flag is automatically set for
dns-persist-01 in cond_ready.
Till now, threads were all started one at a time from thread 1. This
will soon cause us limitations once we want to reduce shared stuff
between thread groups.
Let's slightly change the startup sequence so that the first thread
starts one initial thread for each group, and that each of these
threads then starts all other threads from their group before switching
to the final task. Since it requires an intermediary step, we need to
store that threads' start function to access it from the group, so it
was put into the tgroup_info which still has plenty of room available.
It could also theoretically speed up the boot sequence, though in
practice it doesn't change anything because each thead's initialization
is made one at a time to avoid races during the early boot. However
ther is now a function in charge of starting all extra threads of a
group, and whih is called from this group.
This commit completes the previous one which implements a new setting to
limit the number of streams usable by a client on a QUIC connection.
When the connection becomes idle after reaching this limit, it is
immediately closed. This is implemented by extending checks in
qcc_is_dead(). This results in a CONNECTION_CLOSE emission, which is
useful to free resources as soon as possible.
Implement a new setting to limit the total number of bidirectional
streams that the client may use on a single connection. By default, it
is set to 0 which means it is not limited at all.
If a positive value is configured, the client can only open a fixed
number of request streams per QUIC connection. Internally, this is
implemented in two steps :
* First, MAX_STREAMS_BIDI flow control advertizing will be reduced when
approaching the limit before being completely turned off when reaching
it. This guarantees that the client cannot exceed the limit without
violating the flow control.
* Second, when attaching the latest stream with ID matching max-total
setting, connection graceful shutdown is initiated. In HTTP/3, this
results in a GOAWAY emission. This allows the remaining streams to be
completed before the connection becomes completely idle.
Adds a qcc_app_init() call in qcc_app_shutdown(). This is necessary if
shutdown is performed early, before any invokation of qcc_io_send().
Currently, this should never occur in practice. However, this will
become necessary with the new settings tune.quic.fe.stream.max-total.
Indeed, when using a very small value, app-ops layer may be closed early
in the connection lifetime.
Refactor code related to app-layer init/shutdown operations. In short,
qcc_shutdown() is renamed to qcc_app_shutdown(). It is also moved next
to qcc_app_init() to better reflect their link.
Complete function doc for qcs_attach_sc() by using the proper
terminology related to stream/stconn/sedesc. The purpose of this
function should be clearer now.
stksess_new has set the entry expire to the table expire delay,
if it is a new entry, set_entry inserts at that position in the expire
tree. There was a touch_remote updating the expire setting but the
tree's re-ordering is not designed to set back in the past resulting
to an entry that will be trashed only after a full table's expire delay
regardless the expire set on the stktsess.
This patch sets the newts expire before the call of 'set_entry'.
This way a new inserted entry is set directly at the right position
in the tree to trash the entry in time.
This patch should be backported on all supported branches and at
least v2.8
Some features can automatically turn on or off depending on CPU usage,
but it's not easy to measure it. Let's provide 3 new sample fetch functions
reporting the CPU usage as measured inside haproxy during the previous
polling loop, and reported in "idle" stats header / "show info", or used
by tune.glitches.kill.cpu-usage, or maxcompcpuusage:
- cpu_usage_thr: CPU usage between 0 and 100 of the current thread, used
by functions above
- cpu_usage_grp: CPU usage between 0 and 100, averaged over all threads of
the same group as the current one.
- cpu_usage_proc: CPU usage between 0 and 100, averaged over all threads
of the current process
Note that the value will fluctuate since it only covers a few tens to
hundreds of requests of the last polling loop, but it reports what is
being used to take decisions.
It could also be used to disable some non-essential debugging/processing
under too high loads for example.
Just like we have a sample fetch function that returns the number of the
current thread, let's have the same with the thread group number. This
can be useful for troubleshooting, given that certain things are currently
per thread-group (e.g. idle backend connections, certain LB algos etc).
The comment says "between 1 and nbthread" while it's in fact between 0 and
nbthread-1 and this is also documented like this in the config manual. No
backport needed though it cannot hurt.
Since thread groups were enabled by default in 3.3, it has become an
important element of diagnostic that we're missing in "show info". Let's
add it under "NbThreadGroups".
Add the CLIENT_RANDOM line for TLS1.2 in HAPROXY_KEYLOG_FC_LOG_FMT and
HAPROXY_KEY_LOG_BC_FMT. These are useful to produce a keylog file
compatible with both TLS1.3 and TLS1.2.
A regression was introduced by the commit a8887e55a ("BUG/MEDIUM: htx: Fix
function used to change part of a block value when defrag").
When a block value was replaced and a defragmentation was performed, the
delta between the old value and the new one was counted twice. htx_defrag()
already is responsible to set the new size for the HTX message. So it must
not be performed in htx_replace_blk_value().
This patch must be backported with the commit above. So theorically to all
stable versions.
A regression was introcuded by the commit 0c6f2207f ("MEDIUM: htx: Refactor
htx defragmentation to merge data blocks").
When a defragmentation is performed, it is possible to alter a block
size. The main usage is to prepare a block value replacement. However, since
the commit above, the change is no longer handled. The block info are
changed but the size of the message is not modified accordingly.
This patch depends on the commit "MINOR: htx: Add helper function to get
type and size from the block info field"
No backport needed.
The lack of mjson_next() prevents to iterate easily and need to hack by
iterating on a loop of snprintf + $.field[XXX] combined with
mjson_find().
This reintroduce mjson_next() so we could iterate without having to
build the string.
The patch does not reintroduce MJSON_ENABLE_NEXT so it could be used
without having to define it.
The ACME_INITIAL_DELAY state displays a message about 'dns-01', but this
state is also used for 'dns-persist-01'.
This patch displays the challenge that was configured instead of dns-01
This allows to use the `unique-id` fetch within `tcp-check` or `http-check`
ruleset. The format is taken from the checked server's backend (which is
naturally inherited from the corresponding `defaults` section).
This is particularly useful with
http-check send ... hdr request-id %[unique-id]
to ensure all requests sent by HAProxy have a unique ID header attached.
This resolves GitHub Issue #3307.
Reviewed-by: Volker Dusch <github@wallbash.com>
This implementation is directly modeled after `stream_generate_unique_id()` and
the corresponding `unique_id` field on `struct stream`.
It will be used in a future commit to enable the use of the `%[unique-id]`
fetch in check rules.
Use the return value of `stream_generate_unique_id()` instead of relying on the
`unique_id` field of `struct stream` when handling the `%ID` log placeholder.
This also allowed to unify the "stream available" and "stream not available"
paths.
Reviewed-by: Volker Dusch <github@wallbash.com>
With the introduction of the `generate_unique_id()` helper, the actual
complicated logic is sitting in a different file. Allow inlining of
`stream_generate_unique_id()`, so that callers can benefit from an abstraction
without hiding away the access of `strm->unique_id` behind a function call.
This new function will handle the actual generation of the unique ID according
to a format. The caller is responsible to check that no unique ID is stored
yet.
Commit 6d16b11022 ("BUG/MINOR: haterm: preserve the pipe size margin
for splicing") solved the issue of pipe size being sufficient for the
vmsplice() call, but as Christopher pointed out, the ratio was applied
to the default size of 64k, so now it's applied twice, giving 100k
instead of 80k. Let's drop it from there.
No backport needed.
Printing a "(null)" when NULL passed with the %s format specifier is a
GNU extension, so it must be avoided for portability reasons.
Must be backported as far as 3.2
The wildcard field was declared and used when building the dns-persist-01
TXT record value (policy=wildcard suffix), but was never populated from
the server's authorization response. Add the missing mjson_get_bool() call
to read $.wildcard before saving auth->dns.
Add challenge_type parameter to acme_rslv_start() to select the correct
DNS lookup prefix: _validation-persist.<domain> for dns-persist-01 and
_acme-challenge.<domain> for dns-01.
Default cond_ready to ACME_RDY_DNS|ACME_RDY_DELAY for dns-persist-01.
Extend ACME_CLI_WAIT to cover dns-persist-01 alongside dns-01.
In ACME_RSLV_READY, check only TXT record existence for dns-persist-01
since the resolver cannot parse multiple strings within a single TXT entry.
Implements draft DNS-PERSIST-01 challenge based on
https://datatracker.ietf.org/doc/html/draft-ietf-acme-dns-persist
Blog post: https://letsencrypt.org/2026/02/18/dns-persist-01
This challenge is designed to use preprovisioned DNS records,
unlike DNS-01 challenge it doesn't need per provider API integration.
In short instead of validating order by crafting a custom response
based on input recieved from ACME server, like other challenges do
in particular DNS-01, HTTP-01, TLS-ALPN-01, in this challenge you
authorize domain statically, ACME account key functions similar to
a private key and accounturi in the record functions like a public key,
ACME server verifies that account uri matches account key and authorizes
based on that. You only need to write DNS record one time,
accounturi binds to an account key, and will only change if new account
key is created, although it is possible to rotate account key without
changing account uri.
Main benefits of this challenge in contrast to DNS-01:
1. Security, no need to give reverse proxy write access to the DNS.
2. Simplicity, no complex per provider integrations like Lego needed.
3. Robustness, no worrying about DNS record cache each renewal.
It would be used like this:
1. generate an account key ahead of time
2. add required DNS record manually or automatically using IaC tools
3. start HAProxy with the same account key used
Intended way to use this challenge is with a code that will print
and maybe sets DNS records ahead of time. For example that could
be integrated into the IaC provisioning step. This challenge type
is extremely recent though, so those integrations are yet to be written.
It is possible to do this challenge without extra tools too,
with pebble / challtestsrv steps would be as following:
After starting HAProxy it will print required records in the logs.
With challtestsrv you can then set those records like this:
curl -d '{
"host":"_validation-persist.localhost.",
"value": "pebble.letsencrypt.org; accounturi=...; policy=wildcard"}
' http://localhost:8055/set-txt
After setting the records run renew with the name of the certificate:
echo "acme renew @cert/localhost.pem" \
| socat stdio tcp4-connect:127.0.0.1:9999
Or just restart HAProxy.
Unlike with DNS-01 you don't have to worry about DNS records changing,
if there is any problem with DNS records you can just retry.
Originally in httpterm we used to allocate 5/4 of the size of a pipe to
permit to use vmsplice because there's some fragmentation or overhead
internally that requires to use a bit of margin. While this was initially
applied to haterm as well, it was accidentally lost with commit fb82dece47
("BUG/MEDIUM: haterm: Properly initialize the splicing support for haterm"),
resulting in errors about vmsplice() whenever tune.pipesize is set. Let's
enforce the ratio again.
No backport is needed.
I noticed some strange checks for presence of errmsg. Called functions
generate non-empty error message in case of failure, so a non-NULL address
of the error message is enough.
No backport needed.
When command line is parsed, when the payload was too big the error was not
properly handled. Instead of leaving the parsing function to print the
error, we looped infinitly trying to parse remaining data.
When the command line is too big, we must exit the parsing function in
CLI_ST_PRINT_ERR state. Instead of exiting the function, we only left the
while loop, setting this way the cli applet in CLI_ST_PROMPT state.
This patch must be backported as far as 3.2.
Instead of relying on the implementation detail that
`stream_generate_unique_id()` will store the unique ID in `strm->unique_id` we
should use the returned value, especially since that one is already checked in
the `isttest()`.
Reviewed-by: Volker Dusch <github@wallbash.com>
The return value of the `if()` and `else` branch is identical. We can just move
it out of conditional paths.
Reviewed-by: Volker Dusch <github@wallbash.com>
`sess_build_logline_orig()` takes a `size_t maxsize` as input and accordingly
should also return `size_t` instead of `int` as the resulting length. In
practice most of the callers already stored the result in a `size_t` anyways.
The few places that used an `int` were adjusted.
This Coccinelle patch was used to check for completeness:
@@
type T != size_t;
T var;
@@
(
* var = build_logline(...)
|
* var = build_logline_orig(...)
|
* var = sess_build_logline(...)
|
* var = sess_build_logline_orig(...)
)
Reviewed-by: Volker Dusch <github@wallbash.com>
The following configuration:
defaults
unique-id-format TEST-%[srv_name]
frontend fe_http
mode http
bind :::8080 v4v6
Emitted the following error:
[ALERT] (219835) : Parsing [./patch.cfg:2]: failed to parse unique-id : sample fetch <srv_name]> may not be reliably used here because it needs 'server' which is not available here.
The `]` in the name of the sample fetch should not be there.
This bug exists since at least HAProxy 2.4, which is the oldest supported
version. The fix should be backported there.
Reviewed-by: Volker Dusch <github@wallbash.com>
Reuse QUIC transport parameters value set in xprt_qstrm layer in frame
builder function. Prior to this patch, mux_quic would use different
values from the advertised ones.
No need to backport.
When QMux was first implemented, values used for emitted transport
parameters in xprt_qstrm and local flow control in mux_quic were
initialized separately. This is error prone in particular if a value is
change in one layer but not the other.
This patch fixes this by using xprt_qstrm_lparams() in QMux init
function. Mux flow control is then loaded with these values. Thus all
values are now initialized in a single place which is xprt_qstrm_init().
The OpenTelemetry (OTel) filter enables distributed tracing of requests
across service boundaries, export of metrics such as request rates,
latencies and error counts, and structured logging tied to trace context,
giving operators a unified view of HAProxy traffic through any
OpenTelemetry-compatible backend.
The OTel filter is implemented using the standard HAProxy stream filter
API. Stream filters attach to proxies and intercept traffic at each stage
of processing: they receive callbacks on stream creation and destruction,
channel analyzer events, HTTP header and payload processing, and TCP data
forwarding. This allows the filter to collect telemetry data at every
stage of the request/response lifecycle without modifying the core proxy
logic.
This commit added the minimum set of files required for the filter to
compile: the addon Makefile with pkg-config-based detection of the
opentelemetry-c-wrapper library, header files with configuration
constants, utility macros and type definitions, and the source files
containing stub filter operation callbacks registered through
flt_otel_ops and the "opentelemetry" keyword parser entry point.
The filter uses the opentelemetry-c-wrapper library from HAProxy
Technologies, which provides a C interface to the OpenTelemetry C++ SDK.
This wrapper allows HAProxy, a C codebase, to leverage the full
OpenTelemetry observability pipeline without direct C++ dependencies
in the HAProxy source tree.
https://github.com/haproxytech/opentelemetry-c-wrapperhttps://github.com/open-telemetry/opentelemetry-cpp
Build options:
USE_OTEL - enable the OpenTelemetry filter
OTEL_DEBUG - compile the filter in debug mode
OTEL_INC - force the include path to the C wrapper
OTEL_LIB - force the library path to the C wrapper
OTEL_RUNPATH - add the C wrapper RUNPATH to the executable
Example build with OTel and debug enabled:
make -j8 USE_OTEL=1 OTEL_DEBUG=1 TARGET=linux-glibc
conn_recv_qstrm() may be called several times per connection if the read
data is too short and a truncated record is received.
Previously, record length was parsed every time the function is invoked.
However, this must only be performed if record length varint is
incomplete. Once read and parsed, data are removed from the buffer via
b_quic_dec_int(). Thus, next conn_recv_qstrm() run will reread an
invalid record length this time.
This patch fixes this by only parsing record length if <rxrlen> member
is null. Prior to it, parsing of QMux transport parameters would fail in
case of a first truncated read, which would prevent the connection
initialization.
No need to backport.
A QCC connection may be flagged with QC_CF_ERRL to trigger a
CONNECTION_CLOSE emission. However, for now error reporting is not
functional with QMux, as it relies on quic_conn layer access.
To prevent a crash in qcc_io_send() when using QMux, add a
conn_is_quic() check when QC_CF_ERRL is set to ensure no access will be
performed on quic_conn layer. In the future, this should be extended so
that QMux is also able to emit CONNECTION_CLOSE for connection closure.
No need to backport.
First, we must not emit any warning if splicing is not configured and the
global maxpipes value is 0. Then we must not remove GTUNE_USE_SPLICE flag
when we fail to allocate the haterm master pipe. Instead, we test it when we
negociate with the opposite side, to properly exclude the splicing if it is
not usable.
No backport needed.
This reverts commit 8056117e98.
Moving haterm init from haproxy is not the right way to fix the issue
because it should be possible to use a haterm configuration in haproxy.
So let's revert the commit above.
Add QUIC BLOCKED frames in the list of supported types in
qstrm_parse_frm(). Nothing is really implemented for them as for QUIC,
but this prevents a crash when receiving one of them via QMux.
No need to backport.
This patch implements emission of the new Record layer for QMux frames.
This handles mux-quic and xprt_qstrm layers as this is performed
similarly in both cases.
Currently, the simplest approach has been prefered : each frame is
encoded in its own record. This is not the most efficient in size but it
is extremely simple to implement for a first interop testing.
This patch implements the new QMux record layer parsing for xprt_qstrm.
This is mostly similar to the MUX code from the previous patch.
Along with this change, a new xprt_qstrm layer accessor exposes the
possible remaining record length after Transport parameters parsing.
This can only occur when xprt_qstrm Rx buffer is not completely emptied
due to other following frames. If stored in the same record, MUX layer
has to know the remaining record length.
Thus, xprt_qstrm_rxrlen() is now used in qmux_init() to preinitialize
<rx.rlen> QCC field.
This is the first patch of a serie which aims to support the new Record
layer defined by the draft 01 of QMux protocol.
https://www.ietf.org/archive/id/draft-ietf-quic-qmux-01.html#name-qmux-records
This patch deals with QMux reception at the MUX layer. The function
qcc_qstrm_recv() is adapted to read record headers before frame parsing.
This requires to keep the last record length read in a new QCC field
named <rx.rlen>.
Frames are only parsed once a full record is received. One of the
advantage of the record layer is that it can only contains whole frame
without truncation.
This patch implements proper connection error handling for xprt_qstrm
layer. Basically, processing is interrupted if CO_FL_ERROR is
encountered after either rcv_buf or snd_buf operations. Connectionn
error is set to the newly defined value CO_ER_QSTRM.
This commit adds buffering on transmission for xprt_qstrm layer. This is
necessary in the rare case where send syscall only emits partial data.
A new <txbuf> member is defined in xprt_qstrm context. On first send
invokation, buffer is allocated and then the QMux transport parameters
frame is encoded. Then emission is performed via snd_buf and each time
the send function is invoked.
Layer xprt_qstrm is responsible to read the initial QMux transport
parameters frame. However, it could receive more data if some other
frames follow it. This extra content can only be handled by the MUX
layer once initialized.
Theorically, it could have been implemented via MSG_PEEK. However, this
flag is currently ignored by SSL layer. Besides, it is tedious to
implement safely. A new approach has been prefered where the MUX layer
is responsible to retrieve remaining data via xprt_qstrm_rxbuf()
accessor function during its initialization.
Thus, qmux_init() now may retrieve the buffer from xprt_qstrm layer.
This is performed via b_xfer() which will result in a zero copy
transfer. If this happens, tasklet is immediately scheduled to start
demuxing.
Implement buffering for reception on xprt_qstrm layer. This is necessary
to handle reception of a truncated QMux transport parameters frame.
This is performed via a new dedicated <rxbuf> member in xprt_qstrm
context. Read is performed by reusing the buffer until a whole frame can
be read.
QUIC frame parsers functions take a <pos> pointer as input argument for
the data to be parsed. If parsing is successful, <pos> must be
incremented to point to the next data.
Increment was not performed when parsing QMux transport parameters
frame. This commit fixes this. Note that for now there is no real issue
as xprt_qstrm does not check the QMux frame length.
No need to backport.
In qcc_release(), <conn> may be NULL. Thus every access on it must be
tested.
With recent QMux introduction, a call to conn_is_quic() has been added
prior to registration of the stream rejection callback. It could lead to
NULL deref as <conn> is not tested there. Fix this by adding an extra
check on the pointer validity.
No need to backport.
hlua_applet_http_status() stored the result of luaL_optlstring()
directly in http_ctx->reason. The pointer references Lua-managed
string storage which is only guaranteed valid until the C function
returns to Lua. If the GC runs between applet:set_status(200, str)
and applet:start_response(), the pointer dangles.
hlua_applet_http_send_response() then calls ist(http_ctx->reason)
which does strlen() on freed memory, followed by memcpy into the
HTX status line. The freed-and-reallocated chunk contents are sent
verbatim to the HTTP client.
Trigger:
applet:set_status(200, table.concat({"Reason ", str:rep(50)}))
collectgarbage("collect"); collectgarbage("collect")
applet:start_response()
With heap grooming, adjacent allocation contents (session data, TLS
material from the same thread) leak into the response status line.
Anchor the Lua string in the registry keyed by the http_ctx field
address so it survives until the applet is done with it. The
registry entry is overwritten on each call (handles repeated
set_status) and naturally cleaned up when the lua_State is closed.
This patch should be backported to all stable versions.
FCGI content_length is a 16-bit field but fcgi_set_record_size()
is called with size_t/uint32_t arguments. With tune.bufsize >= 65544
(legal; cfgparse-global.c only enforces <= INT_MAX-16), a single
HTX DATA block or accumulated outbuf can exceed 65535 bytes. The
implicit conversion to uint16_t silently truncates the length field
while b_add(mbuf, outbuf.data) writes the full body.
A client posting ~99000 bytes can craft the body so that bytes
after the truncated length are parsed by PHP-FPM as fresh FCGI
records on the connection: a smuggled BEGIN_REQUEST + PARAMS with
arbitrary SCRIPT_FILENAME / PHP_VALUE bypasses all haproxy ACLs.
Fix the zero-copy path by refusing it when the block exceeds 65535
bytes (falls through to copy). Fix the copy path by capping
outbuf.size to 65535 + header so the data-fill loop naturally
stops at the FCGI maximum and emits the rest in a subsequent record.
The PARAMS path at line 2084 is similarly affected but harder to
trigger (requires combined header+param size > 65535) and is
covered by the same outbuf.size cap pattern if applied there.
This patch must be backported to all stable versions.
exp_replace() returns int and returns -1 when the back-reference
expansion overflows the output buffer (regex.c:51). output->data is
size_t, so -1 becomes SIZE_MAX. There was no error check.
The subsequent comparisons interpret SIZE_MAX as a huge length:
"output->data > b_room(trash)" tries to grow trash, then
"max > output->data" is false so max stays at trash->size, and
memcpy(trash, output->area, trash->size) copies the full chunk.
output->area is a pool_alloc()'d chunk that is NOT zeroed; the
bytes after the partial exp_replace output are stale data from a
prior pool user (request headers, response bodies from the same
worker thread).
Trigger with a backreference whose expansion exceeds bufsize:
http-request set-header X %[req.hdr(In),regsub('(.+)','\1\1')]
and a request with In: of ~9000 bytes. The X header sent to the
backend then contains ~9KB of stale heap data.
With tune.bufsize.large set, get_larger_trash_chunk() upgrades trash
and the memcpy reads up to ~50KB past the (smaller) output->area
allocation.
http_ana.c:2728 and http_act.c:551 already check exp_replace() for
-1; this call site was missed when backreferences were added.
This patch must be backported to all stable versions.
Samples of type SMP_T_METH were not properly handled in smp_dup(),
smp_is_safe() and smp_is_rw(). For "other" methods, for instance PATCH, a
fallback was performed on the SMP_T_STR type. Only the buffer considered
changed. "smp->data.u.meth.str" should be used for the SMP_T_METH samples
while smp->data.u.str should be used for SMP_T_STR samples. However, in
smp_dup(), the result was stored in wrong buffer, the string one instead of
the method one. In smp_is_safe() and smp_is_rw(), the method buffer was not
used at all.
We now take care to use the right buffer.
This patch must be backported to all stable versions.
When "Expect" header was found in request headers, "HTTP/1.1 100-continue"
was returned instead of "HTTP/1.1 100 continue". Let's fix it.
No backport needed.
http_action_set_headers_bin() decodes varint name and value lengths
from a binary sample but never validates that the decoded length
fits in the remaining sample data before constructing the ist.
If the value's varint decodes to a large number with only a few
bytes following, v.len exceeds the buffer and http_add_header()
memcpys past the sample, copying adjacent heap data into a header
sent to the backend (or client, with http-response).
The intended source for this action is the hdrs_bin sample fetch
which produces well-formed output, but nothing prevents an admin
from feeding it req.body or another untrusted source. With:
http-request set-var(txn.h) req.body
http-request add-headers-bin var(txn.h)
a POST body of [05]"X-Foo"[c8]"AB" produces v = {ptr="AB", len=200}
and 198 bytes of adjacent heap data go into X-Foo.
http_action_del_headers_bin() was fixed too.
Compare spoe_decode_buffer() which has the equivalent check.
Validate both name and value lengths against remaining data.
No backport needed.
Commit c84c15d393 ("BUG/MINOR: resolvers: Apply dns-accept-family
setting on additional records") converted a switch statement to an
if/else chain but left the break; in the AAAA branch. In the new
form, break exits the surrounding for loop instead of a switch case.
For every AAAA additional record in an SRV response:
- answer_record allocated at line 1460 is never freed and never
inserted into answer_tree -> ~580 bytes leaked per response
- all subsequent additional records in the response are silently
discarded
A DNS server controlling SRV responses for haproxy service discovery
can leak memory at MB/min rates given default resolution intervals.
Also breaks IPv6 SRV target resolution outright since the AAAA record
is leaked rather than attached to its SRV entry.
HAProxy has always called luaL_openlibs() unconditionally, which opens
all standard Lua libraries including io, os, package and debug. This
makes it impossible to prevent Lua scripts from executing binaries
(os.execute, io.popen), loading native C modules (package/require), or
bypassing any Lua-level sandbox via the debug library.
Add a new global directive tune.lua.openlibs that accepts a comma-separated
list of library names to load:
tune.lua.openlibs none # only base + coroutine
tune.lua.openlibs string,math,table,utf8 # safe libs only
tune.lua.openlibs all # default, same as before
The base and coroutine libraries are always loaded regardless: base provides
core Lua functions that HAProxy relies on, and coroutine is required because
HAProxy overrides coroutine.create() with its own safe implementation.
When all libraries are enabled (the default), the fast path still calls
luaL_openlibs() directly with no overhead. A parse error is returned if
the directive appears after lua-load or lua-load-per-thread (the Lua state
is already initialised at that point), or if 'none' is combined with other
library names. Note that fork() and new thread creation are already blocked
by default regardless of this setting (see "insecure-fork-wanted").
Literals are sent in two ways:
- in EOB state, unencoded and prefixed with their length
- in FIXED state, huffman-encoded
And references are only sent in FIXED state.
The API promises that the amount of data will not grow by more than
5 bytes every 65535 input bytes (the comment was adjusted to remind
this last point). This is guaranteed by the literal encoding in EOB
state (BT, LEN, NLEN + bytes), which is supposed to be the worst
case by design.
However, as reported by Greg KH, this is currently not true: the test
that decides whether or not to switch to FIXED state to send references
doesn't properly account for the number of bytes needed to roll back
to the *exact* same state in EOB, which means sending EOB, BT,
alignment, LEN and NLEN in addition to the referenced bytes, versus
sending the encoding for the reference. By not taking into account the
cost of returning to the initial state (BT+LEN+NLEN), it was possible
to stay too long in the FIXED state and to consume the extra bytes that
are needed to return to the EOB state, resulting in producing much more
data in case of multiple switchovers (up to 6.25% increase was measured
in tests, or 1/16, which matches worst case estimates based on the code).
And this check is only valid when starting from EOB (in order to restore
the same state that offers this guarantee). When already in FIXED state,
the encoded reference is always smaller than or same size as the data.
The smallest match length we support is 4 bytes, and when encoded this
is no more than 28 bits, so it is safe to stay in FIXED state as long
as needed while checking the possibility of switching back to EOB.
This very slightly reduces the compression ratio (-0.17% on a linux
kernel source) but makes sure we respect the API promise of no more
than 5 extra bytes per 65535 of input. A side effect of the slightly
simpler check is an ~7.5% performance increase in compression speed.
Many thanks to Greg for the detailed report allowing to reproduce
the issue.
This is libslz upstream commit 002e838935bf298d967f670036efa95822b6c84e.
Note: in haproxy's default configuration (tune.bufsize 16384,
tune.maxrewrite 1024), this problem cannot be triggered, because the
reserve limits input to 15360 bytes, and the overflow is maximum
960 bytes resulting in 16320 bytes total, which still fits into the
buffer. However, reducing tune.maxrewrite below 964, or tune.bufsize
above 17408 can result in overflows for specially crafted patterns.
A workaround for larger buffers consists in always setting tune.bufsize
to at least 1/16 of tune.bufsize.
Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://www.mail-archive.com/haproxy@formilux.org/msg46837.html
Storing the protocol directly into the check was not a good idea,
because the protocol may not be determined until after a DNS resolution
on the server, and may even change at runtime, if the DNS changes.
What we can, however, figure out at start up, is the net_addr_type,
which will contain all that we need to find out which protocol to use
later.
Also revert the changes made by commit 07edaed191
that would not reuse the server xprt if a different alpn is set for
checks. The alpn is just a string, and should not influence the choice
of the xprt.
We'll now make sure to use the server xprt, unless an address is
provided, in which case we'll use whatever xprt matches that address, or
a port, in which case we'll assume we want TCP, and use check_ssl to
know whetver we want the SSL xprt or not.
Now that the check contains all that is needed to know which protocol to
look up, always just use that when creating a new check connection if it
is the default check connection, and for now, always use TCP when a
tcp-check or http-check connect rule is used (which means those can't be
used for QUIC so far).
This should hopefully fix github issue #3324.
This reverts commit a03120e228.
A WIP version of the patch was applied before the actual patch by
accident. The correct patch is 2db801c ("BUG/MINOR: hlua: fix stack
overflow in httpclient headers conversion")
When the app_ops were removed, direct calls to the SC wake callback function
were replaced by tasklet wakeups. However, in conn_create_mux(), it was
replaced by a direct call to sc_conn_process(). However, sc_conn_process()
is only usable when the SC is attach to a stream. A backend mux can be
created for a healcheck. In this context, sc_conn_process() cannot be
called.
Because of this bug, crashes can be experienced when an error is triggered
during a SSL connection attempt from a healthcheck.
To fix the issue, the call to sc_conn_process() was replaced by a tasklet
wakeup.
This patch should fix the issue #3326. No backport needed.
When a peer sends a dictionary entry update with a value (the else
branch at line 2109), the entry id decoded from the wire was never
validated against dc->max_entries before being used as an array index
into dc->rx[].
A malicious peer can send id=N where N > 128 (PEER_STKT_CACHE_MAX_ENTRIES)
to:
- dc->rx[id-1].de at line 2123: OOB read followed by atomic decrement
and potential free of an attacker-controlled pointer via
dict_entry_unref()
- dc->rx[id-1].de = de at line 2124: OOB write of a heap pointer at
an attacker-controlled offset (16-byte stride, ~64 GiB range)
The bounds check was added to the key-only branch in commit f9e51beec
("BUG/MINOR: peers: Do not ignore a protocol error for dictionary
entries.") but was never added to the with-value branch. The bug has
been present since dictionary support was introduced in commit
8d78fa7def ("MINOR: peers: Make peers protocol support new
"server_name" data type.").
Reachable from any TCP client that knows the configured peer name
(no cryptographic authentication on the peers protocol). Requires a
stick-table with "store server_key" in the configuration.
Fix by hoisting the bounds check above the branch so it covers both
paths.
Must be backported as far as 2.6.
When the input chunk is already the large buffer (chk->size ==
large_trash_size), the <= comparison still matched and returned
another large buffer of the same size. Callers that retry on a
non-NULL return value (sample.c:4567 in json_query) loop forever.
The json_query infinite loop is trivially triggered: mjson_unescape()
returns -1 not only when the output buffer is too small but also for
any \uXXYY escape where XX != "00" (mjson.c:305) and for invalid
escapes like \q. The retry loop assumes -1 always means "grow the
buffer", so a 14-byte JSON body of {"k":"\u0100"} hangs the worker
thread permanently. Send N such requests to exhaust all worker
threads.
Use < instead of <= so a chunk that is already large yields NULL.
This also fixes the json converter overflow at sample.c:2869 where
no recheck happens after the "growth" returned a same-size buffer.
Introduced in commit ce912271db ("MEDIUM: chunk: Add support for
large chunks"). No backport needed.
A copy-paste error in alloc_trash_buffers_per_thread() passes
global.tune.bufsize_large to alloc_small_trash_buffers() instead of
global.tune.bufsize_small. This sets small_trash_size = bufsize_large.
When tune.bufsize.large is configured, get_larger_trash_chunk() then
incorrectly matches a large buffer against small_trash_size at line
169 and "grows" it to a regular (smaller) buffer. b_xfer() at line
179 attempts to copy the large buffer's contents into the smaller one:
- Default builds (DEBUG_STRICT=1): BUG_ON in __b_putblk() aborts
the process -> remote DoS
- DEBUG_STRICT=0 builds: BUG_ON becomes ASSUME() and the compiler
elides the check -> heap overflow with attacker-controlled bytes
Reachable via the json converter (sample.c:2862) when escaping
~bufsize_large/6 control characters in attacker-supplied data such
as a request header or body.
Introduced in commit 92a24a4e87 ("MEDIUM: chunk: Add support for
small chunks"). No backport needed.
hlua_error() is a printf-family function (calls vsnprintf), but
hlua_patref_set, hlua_patref_add, and _hlua_patref_add_bulk pass
errmsg directly as the format string. errmsg is built by pattern.c
helpers that embed the user-supplied key or value verbatim, e.g.
pat_ref_set_elt() generates "unable to parse '<value>'".
A Lua script calling:
ref:set("key", "%p.%p.%p.%p.%p.%p.%p.%p")
against a map with an integer output type (where the parse fails)
gets stack/register contents formatted into the (nil, err) return
value -> ASLR/canary leak. With %n and no _FORTIFY_SOURCE this
becomes an arbitrary write primitive.
This must be backported as far as the Patref Lua API exists.
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address.
[wla: this is only reachable if the Lua script passes more than
max_http_hdr header values, which requires access to the script itself]
This must be backported as far as the httpclient Lua API exists.
Signed-off-by: William Lallemand <wlallemand@haproxy.com>
hlua_httpclient_table_to_hdrs() declares a VLA of size
global.tune.max_http_hdr (default 101) on the stack but never checks
hdr_num against that bound. A Lua script that supplies a header table
with more than 101 values writes struct http_hdr entries (two ist =
two heap pointers + two lengths) past the end of the VLA, smashing
the stack frame.
Trigger from any Lua action/task/service:
local hc = core.httpclient()
local v = {}
for i = 1, 300 do v[i] = "x" end
hc:get{ url = "http://127.0.0.1/", headers = { ["X"] = v } }
Each out-of-bounds entry writes a heap pointer (controllable
allocation contents via istdup) plus an attacker-chosen length onto
the stack, overwriting the saved return address. With no stack
canary, this is direct RCE; with a canary, it requires a leak first.
Reachable from any deployment that loads Lua scripts. While Lua
scripts are nominally trusted, this turns "can edit Lua" into "can
execute arbitrary native code", which is a meaningful boundary in
many setups (Lua sandbox escape).
This must be backported as far as the httpclient Lua API exists.
When the secret argument to jwt_decrypt_secret is a variable
(ARGT_VAR) rather than a literal string, alloc_trash_chunk() is
called to hold the base64-decoded secret but the buffer is never
released. The end: label frees input, decrypted_cek, out, and the
decoded_items array but not secret.
Each request leaks one trash chunk (~tune.bufsize, default 16KB).
At ~65000 requests per GiB this allows slow memory exhaustion DoS
against any config of the form:
http-request set-var(txn.x) req.hdr(...),jwt_decrypt_secret(txn.key)
This must be backported as far as JWE support exists.
convert_ecdsa_sig() calls i2d_ECDSA_SIG(ecdsa_sig, &p) where p
points into signature->area, a trash chunk of tune.bufsize bytes
(default 16384). i2d writes with no output bound.
The raw R||S input can be up to bufsize bytes (filled by
base64urldec at jwt.c:520-527), giving bignum_len up to 8192. The
DER encoding adds a SEQUENCE header (2-4 bytes), two INTEGER headers
(2-4 bytes each), and up to two leading-zero sign-padding bytes when
the bignum high bit is set. With two 8192-byte bignums having the
high bit set, the encoding is ~16398 bytes, overflowing the 16384-
byte buffer by ~14 bytes.
Triggered by any JWT with alg=ES256/384/512 and a ~21830-character
base64url signature. The signature does not need to verify
successfully; the overflow happens before verification. Reachable
from any config using jwt_verify with an EC algorithm.
Also fixes the existing wrong check: i2d returns -1 on error which
became SIZE_MAX in the size_t signature->data, defeating the
"== 0" test.
This must be backported as far as JWT support exists.
In sample_conv_jwt_decrypt_secret(), when a JWE token has an empty
encrypted-key section but the algorithm is not "dir" (e.g. A128KW),
neither branch initializes decrypted_cek. The NULL pointer is then
passed to decrypt_ciphertext() which dereferences it:
- For GCM encodings: aes_process() calls b_orig(NULL) -> SIGSEGV
- For CBC encodings: b_data(NULL) at jwe.c:463 -> SIGSEGV
A single HTTP request with a crafted Authorization header crashes the
worker process. Trigger token (JOSE header {"alg":"A128KW","enc":"A128GCM"},
empty CEK section between the two dots):
eyJhbGciOiJBMTI4S1ciLCJlbmMiOiJBMTI4R0NNIn0..AAAAAAAAAAAAAAAA.AA.AA
Reachable in any configuration using the jwt_decrypt_secret converter.
The other two decrypt converters (jwt_decrypt_jwk, jwt_decrypt_cert)
already have the check.
This must be backported as far as JWE support exists.
The 16-bit name_len field is read directly from the ClientHello and
stored as the sample length without any validation against srv_len,
ext_len, or the channel buffer size. A 65-byte ClientHello with
name_len=0xffff produces a sample claiming 65535 bytes of data when
only ~4 bytes are actually present in the buffer.
Downstream consumers then read tens of kilobytes past the channel
buffer:
- pattern.c:741 XXH3() hashes 65535 bytes -> ~50KB OOB heap read
- sample.c smp_dup memcpy if large trash configured
- log-format %[req.ssl_sni] leaks heap contents to logs/headers
Reachable pre-authentication on any TCP frontend using req.ssl_sni
(req_ssl_sni), which is the documented way to do SNI-based content
switching in TCP mode. No SSL handshake is required; the parser
runs on raw buffer contents in tcp-request content rules.
Bug introduced in commit d4c33c8889 (2013). The ALPN parser in
the same file at line 1044 has the equivalent check; SNI never did.
This must be backported to all supported versions.
When the healthcheck section support was added, the tcpcheck type was moved
into the tcpcheck ruleset. However, conn_install_mux_chk() function was not
updated accordingly. So the TCP mode was always returned.
No backport needed. This patch is related to #3324 but it is not the root
cause of the issue.
As reported by GH @phihos on GH #3320, using the shm-stats-file feature
with objects exceeding 127 chars would result in object name being
unexpectedly truncated, while GUID API supports up to 128 chars.
Indeed, with the config below, and shm-stats-file enabled:
server s1 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_1 disabled
server s10 127.0.0.1:1 guid srv:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:SRV_10 disabled
haproxy would store the second server object with the same id as the first
one, but upon reload, only the first one would be restored, which would
eventually cause shm-stats-file slot exhaustion with repetitive reloads.
@phihos, found out the underlying issue, in counters.c we used snprintf()
with sizeof(shm_obj->guid) - 1 as <size> parameter, while we should have
use sizeof(shm_obj->guid) instead since shm_obj->guid already takes the
terminating NULL byte into account.
So we simply apply the fix suggested by @phihos, and hopefully this should
solve the shm-stats-file slot leak that was observed.
Unfortunately, for now, we cannot warn the user that a duplicate
shm-stats-file object was found, because we accept duplicate objects
by design for 2 reasons. The first one is for a new process to be able
to change the object type for a previously known GUID while allowing
previous processes to use the old object as long as they are alive.
The second reason is that upon startup we cannot afford to scan the
whole object list, as soon as we find a match (type + GUID), we bind
the object, and this way we avoid unnecessary lookup time.
Perhaps we have room for improvement in the future, but for now let's
keep it this way.
It should be backported to 3.3
Big thanks to @phihos for the bug description, analysis and
suggestions.
The parsing of the "healthcheck" parameter for dynamic servers was not
finished. The post-config was missing, leading to a crash because the
ruleset pointer was NULL.
To fix the issue, check_server_tcpcheck() function is called in
cli_parse_add_server().
No backport needed.
When an early response is sent to the client and the H1 connection is
switched to the draining state, we must take care to disable the 0-copy data
forwarding because the backend side is no longer here. It is an issue
because this prevent any regular receive to be performed.
This patch should fix the issue #3316. It must be backported as far as 3.0.
Functions used to initialize haterm (the splicing and the response buffers)
were defined and registered in haterm.c. The problem is that this file in
compiled with haproxy. So it may be an issue. And for the splicing part,
warnings may be emitted when haproxy is started.
To avoid any issue during haproxy startup and to avoid to initialize some
part of haterm, all init functions were moved into haterm_init.c file.
No backport needed.
This is another pre-requisite work for upcoming decompression filter.
In this patch we implement the "filter-sequence" directive which can be
used in proxy section (frontend,backend,listen) and takes 2 parameters
The first one is the direction (request or response), the second one
is a comma separated list of filter names previously declared on the
proxy using the "filter" keyword.
The main goal of this directive is to be able to instruct haproxy in which
order the filters should be executed on request and response paths,
especially if the ordering between request and response handling must
differ, and without relying on the filter declaration ordering (within
the proxy) which is used by default by haproxy.
Another benefit of this feature is that it becomes possible to "ignore"
a previously declared filter on the proxy. Indeed, when filter-sequence
is defined for a given direction (request/response), then it will be used
over the implicit filter ordering, but if a filter which was previously
declared is not specified in the related filter-sequence, it will not be
executed on purpose. This can be used as a way to temporarily disable a
filter without completely removing its configuration.
Documentation was updated (check examples for more info)
flt_conf struct stores the filter id, which is used internally to check
match the filter against static pointer identifier, and also used as
descriptive text to describe the filter. But the id is not consistent
with the public name as used in the configuration (for instance when
selecting filter through the 'filter' directive).
What we do in this patch is that we add flt_conf->name member, which
stores the real filter name as seen in the configuration. This will
allow to select filters by their name from other directives in the
configuration.
The ssl_crtname_index was registered with SSL_get_ex_new_index() but the
certificate name is stored on a SSL_CTX object via SSL_CTX_set_ex_data().
The free callback is only invoked for the object type matching the index
registration, so the strdup'd name was never freed when the SSL_CTX was
released.
Fix this by using SSL_CTX_get_ex_new_index() instead, which ensures the
free callback fires when the SSL_CTX is destroyed.
No backport needed.
Following request options are now handled as flags:
- ?k=1 => flag HS_ST_OPT_CHUNK_RES is set
- ?c=0 => flag HS_ST_OPT_NO_CACHE is set
- ?R=1 => flag HS_ST_OPT_RANDOM_RES is set
- ?A=A => flag HS_ST_OPT_REQ_AFTER_RES is set.
By default, none is set.
The support for the splicing was added and enabled by default, if
supported. The command line option '-dS' was also added to disable the
feature.
When the splicing can be used and the front multiplexer agrees to proceed,
tee() is used to "copy" data from the master pipe to the client pipe.
Now the zero-copy data forwarding is supported, we will add the splicing
support. To do so, we first create a master pipe with vmsplice() during
haterm startup. It is only performed if the splicing is supported. And its
size can be configured by setting "tune.pipesize" global parameter.
This master pipe will be used to fill the pipe with the client.
The support for the zero-copy data forwarding was added and enabled by
default. The command line option '-dZ' was also added to disable the
feature.
Concretely, when haterm pushes the response payload, if the zero-copy
forwarding is supported, a dedicated function is used to do so.
hstream_ff_snd() will rely on se_nego_ff() to know how many data can send
and at the end, on se_done_ff() to really send data.
hstream_add_ff_data() function was added to perform the raw copy of the
payload in the sedesc I/O buffer.
hstream_add_data() function is renamed to hstream_add_htx_data() because
there will be a similar function to add data in zero-copy forwarding
mode. The function was also adapted to take the data length to add in
parameter and to return the number of written bytes.
This new sample fetch returns the name of the certificate selected for
an incoming SSL/TLS connection, as it would appear in "show ssl cert".
It may be a filename with its relative or absolute path, or an alias,
depending on how the certificate was declared in the configuration.
The certificate name is stored as ex_data on the SSL_CTX at load time
in ckch_inst_new_load_store(), and freed via a dedicated free callback.
The "feature" predicate takes an argument name. Not passing one will
cause strstr() to always find something, including at the end of the
string, and to read past end that ASAN detects. We need to check that
we didn't reach end before proceeding.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499133314
The issue is present since 2.4 with commit 58ca706e16 ("MINOR: config:
add predicate "feature" to detect certain built-in features") so this
fix must be backported to all stable versions.
Using awslc_api_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the
error message being filled. Let's always fill the error message on error.
This was introduced in 3.3 with commit 3d15c07ed0 ("MINOR: cfgcond: add
"awslc_api_atleast" and "awslc_api_before""), and this fix must be
backported to 3.3.
Using openssl_version_before() with an invalid argument results in "(null)"
appearing in the error message due to -1 being returned without the error
message being filled. Let's always fill the error message on error.
This was introduced in 2.5 with commit 3aeb3f9347 ("MINOR: cfgcond:
implements openssl_version_atleast and openssl_version_before"), and
this fix must be backported to 2.6.
cfg_eval_condition() says that the <errptr> pointer will be set upon
error. However, cfg_eval_cond_expr() can fail (e.g. failure to handle
a dynamic argument) but would branch to "done" and leave errptr unset.
Let's check for this case as well.
This bug was reported by OSS Fuzz here:
https://issues.oss-fuzz.com/issues/499135825
The bug was introduced in 2.5 around commit ca81887599 ("MINOR:
cfgcond: insert an expression between the condition and the term") so
the fix must be backported as far as 2.6.
The previous ACME_RSLV_WAIT state served a dual role: it applied the
initial dns-delay before the first DNS probe and also handled the
delay between retries. There was no way to simply wait a fixed delay
before submitting the challenge without also triggering DNS pre-checks.
Replace ACME_RSLV_WAIT with two distinct states:
- ACME_INITIAL_DELAY: an optional initial wait before proceeding,
only applied when "challenge-ready" includes the new "delay" keyword
- ACME_RSLV_RETRY_DELAY: the delay between resolution retries, always
applied when DNS pre-checks are in progress
The new "delay" keyword in "challenge-ready" can be used standalone
(wait then submit the challenge directly) or combined with "dns" (wait
then start the DNS pre-checks). When "delay" is not set, the first DNS
probe fires immediately.
Update the documentation accordingly.
The TASK_WOKEN_TIMER check that previously handled the case where
RSLV_TRIGGER was reached directly from the CLI command is therefore dead
code and can be removed.
Fix the following build warning from obsolete compilers for <orig_frm>
variable in qcc_qstrm_send_frames() function :
src/mux_quic_qstrm.c:266:17: warning: 'orig_frm' may be used
uninitialized in this function [-Wmaybe-uninitialized]
The variable is now explicitely initialized to NULL on each loop, which
should prevent this warning. Note that for code clarity, the variable is
renamed <next_frm>.
No need to backport.
Previously the dns timeout timer was initialized in ACME_RSLV_WAIT,
before the initial dns-delay expires. This meant the countdown started
before any DNS request was actually sent, so the effective timeout was
shorter than expected by one dns-delay period.
Move the initialization to ACME_RSLV_TRIGGER so the timer starts only
when the first DNS resolution attempt is triggered. Update the
documentation to clarify this behaviour.
Defines an API for xprt_qstrm so that the QMux transport parameters can
be retrieved by the MUX layer on its initialization. This concerns both
local and remote parameters.
Functions xprt_qstrm_lparams/rparams() are defined and exported for
this. They are both used in qmux_init() if QMux protocol is active.
On SSL handshake completion, MUX layer can be initialized if not already
the case. However, for QMux protocol, it is necessary first to perform
transport parameters exchange, via the new xprt_qstrm layer. This patch
ensures this is performed if any flag CO_FL_QSTRM_* is set on the
connection.
Also, SSL layer registers itself via add_xprt. This ensures that it can
be used by xprt_qstrm for the emission/reception of the necessary
frames.
This patch implements QMux emission of transport parameters via
xprt_qstrm. Similarly to receive, this is performed in conn_send_qstrm()
which uses lower xprt snd_buf operation. The connection must first be
flagged with CO_FL_QSTRM_SEND to trigger this step.
Extend xprt_qstrm to implement the reception of QMux transport
parameters. This is performed via conn_recv_qstrm() which relies on the
lower xprt rcv_buf operation. Once received, parameters are kept in
xprt_qstrm context, so that the MUX can retrieve them on init.
For the reception of parameters to be active, the connection must first
be flagged with CO_FL_QSTRM_RECV.
Add get_alpn operation support for xprt_qstrm. This simply acts as a
passthrough method to the underlying XPRT layer.
This function is necessary for QMux when running above SSL, as mux-quic
will access ALPN during its initialization in order to instantiate the
proper application protocol layer.
Define a new XPRT layer for the new QMux protocol. Its role will be to
perform the initial exchange of transport parameters.
On completion, contrary to XPRT handshake, xprt_qstrm will first init
the MUX and then removes itself. This will be necessary so that the
parameters can be retrieved by the MUX during its initialization.
This patch only declares the new xprt_qstrm along with basic operations.
Future commits will implement the proper reception/emission steps.
Similarly to reception, a new buffer is defined in QCC connection to
handle emission for QMux protocol. This replaces the trash buffer usage
in qcc_qstrm_send_frames().
This buffer is necessary to handle partial emission. On retry, the
buffer must be completely emitted before starting to send new frames.
Each time a QUIC frame is emitted, mux-quic layer is notified via a
callback to update the underlying QCS. For QUIC, this is performed via
qc_stream_desc element.
In QMux protocol, this can be simplified as there is no
qc_stream_desc/quic_conn layer interaction. Instead, each time snd_buf
is called, QCS can be updated immediately using its return value. This
is performed via a new function qstrm_ctrl_send().
Its work is similar to the QUIC equivalent but in a simpler mode. In
particular, sent data can be immediately removed from the Tx buffer as
there is no need for retransmission when running above TCP.
This patchs implement mux-quic reception for the new QMux protocol. This
is performed via the new function qcc_qstrm_send_frames(). Its interface
is similar to the QUIC equivalent : it takes a list of frames and
encodes them in a buffer before sending it via snd_buf.
Contrary to QUIC, a check on CO_FL_ERROR flag is performed prior to
every qcc_qstrm_send_frames() invokation to interrupt emission. This is
necessary as the transport layer may set it during snd_buf. This is not
the case currently for quic_conn layer, but maybe a similar mechanism
should be implemented as well for QUIC in the future.