Commit graph

88 commits

Author SHA1 Message Date
Willy Tarreau
b3a84800b4 MINOR: task: maintain a per-thread indicator of the peak run-queue size
The new field th_ctx->rq_tot_peak contains the computed peak run queue
length averaged over the last 512 calls. This is computed when entering
process_runnable_tasks. It will not take into account new tasks that are
created or woken up during this round nor those which are evicted, which
is the reason why we're using a peak measurement to increase chances to
observe transient high values. Tests have shown that 512 samples are good
to provide a relatively smooth average measurement while still fading
away in a matter of milliseconds at high loads. Since this value is
only updated once per round, it cannot be used as a statistic and
shouldn't be exposed, it's only for internal use (self-regulation).
2026-03-19 16:24:31 +01:00
Willy Tarreau
17cbec485a MINOR: cli: implement execution context for manually registered keywords
Keywords registered out of an initcall will have a TH_EX_CTX_CLI_KWL
execution context pointing to the keyword list. The report will indicate
the 5 first words of the first command of the list, e.g.:

     exec_ctx: cli kwl starting with 'debug counters   '

This should also work for CLI keywords registered in Lua.
2026-03-12 18:06:38 +01:00
Willy Tarreau
9cb11d0859 MINOR: applet: set execution context on applet calls
It allows to know when a thread is currnetly running inside an applet.
For example now "show threads" will show "applet '<CLI>'" for the thread
issuing this command.
2026-03-12 18:06:38 +01:00
Willy Tarreau
c0bf395cde MINOR: task: set execution context on task/tasklet calls
It now appears almost everywhere due to callbacks (e.g. ssl_sock_io_cb).
Muxes also become visible now on memory profiling. A small test on h1+ssl
yields 838 lines of statistics. The number of buckets should definitely
be increased, and more grouping criteria should be added.

A performance test was conducted to observe the possible effect of
setting the execution context on each task switch, and it didn't change
at all, remaining at about 1.01 billion ctxsw/s on a 128-thread EPYC.
2026-03-12 18:06:38 +01:00
Willy Tarreau
ec7b07b650 MINOR: connection: track mux calls to report their allocation context
Most calls to mux ops were instrumented with a CALL_MUX_WITH_RET() or
CALL_MUX_NO_RET() macro in order to make the current thread's context
point to the called mux and be able to track its allocations. Only
a bunch of harmless mux_ctl() and ->subscribe/unsubscribe calls were
left untouched since useless. But destroy/detach/shut/init/snd_buf
and rcv_buf are now tracked.

It will not show allocations performed in IO callback via tasklet
wakeups however.

In order to ease reading of the output, cmp_memprof_ctx() knows about
muxes and sorts based on the .subscribe function address instead of
the mux_ops address so as to keep various callers grouped.
2026-03-12 18:06:38 +01:00
Willy Tarreau
3fb8659d04 MINOR: filters: set the exec context to the current filter config
Doing this allows to report the allocations/releases performed by filters
when running with memory profiling enabled. The flt_conf pointer is kept
and the report shows the filter name.
2026-03-12 18:06:38 +01:00
Willy Tarreau
43b56c22c7 MINOR: actions: also report execution contexts registered directly
This now reports directly registered actions using new type
TH_EX_CTX_ACTION which will report the first keyword of the
list.
2026-03-12 18:06:38 +01:00
Willy Tarreau
261cae3b6d MINOR: tools: support an execution context that is just a function
The purpose here is to be able to spot certain callbacks, such as the
SSL message callbacks, which are difficult to associate to anything.
Thus we introduce a new context type, TH_EX_CTX_FUNC, for which the
context is just the function pointed to by the void *pointer. One
difficulty with callbacks is that the allocation and release contexts
will likely be different, so the code should be properly structured
to allow proper tracking, either by instrumenting all calls, or by
making sure that the free calls are easy to spot in a report.
2026-03-12 18:06:38 +01:00
Willy Tarreau
aa4d5dd217 MINOR: sample: also report contexts registered directly
With the two new context types TH_EX_CTX_SMPF/CONV, we can now also
report contexts corresponding to direct calls to sample_register_fetches()
and sample_register_convs(). In this case, the first word of the keyword
list is reported.
2026-03-12 18:06:38 +01:00
Willy Tarreau
2cd0cd84c6 MINOR: tools: support decoding ha_caller type exec context
The TH_EX_CTX_CALLER type takes an ha_caller pointer which allows a
caller to mark its caller's location using MK_CALLER().
2026-03-12 18:06:38 +01:00
Willy Tarreau
6e75da7a91 MINOR: tools: decode execution context TH_EX_CTX_INITCALL
When the execution context is set to TH_EX_CTX_INITCALL, the pointer
points to a valid initcall, and the decoder will show "kw registered
at %s:%d" with file and line number of the initcall declaration. It's
up to the caller to make the initcall pointer point to the one that was
set during the initcall. The purpose here is to be able to preserve and
pass that knowledge of an initcall down the chain so that future calls
to functions registered via the initcall are still assigned to it.
2026-03-12 18:06:38 +01:00
Willy Tarreau
b7c8fab507 MINOR: tinfo: start to add basic thread_exec_ctx
We have the struct made of a type and a pointer in the th_ctx and a
function to switch it for the current thread. Two macros are provided
to enclose a callee within a temporary context. For now only type OTHER
is supported (only a generic pointer).
2026-03-12 18:06:37 +01:00
Willy Tarreau
e2631ee5f7 MEDIUM: activity: apply and use new finegrained task profiling settings
In continuity of previous patch, this one makes use of the new profiling
flags. For this, based on the global "profiling" setting, when switching
profiling on, we set or clear two flags on the thread context,
TH_FL_TASK_PROFILING_L and TH_FL_TASK_PROFILING_M to indicate whether
lock profiling and/or malloc profiling are desired when profiling is
enabled. These flags are checked along with TH_FL_TASK_PROFILING to
decide when to collect time around a lock or a malloc. And by default
we're back to the behavior of 3.2 in that neither lock nor malloc times
are collected anymore.

This is sufficient to see the CPU usage spent in the VDSO to significantly
drop from 22% to 2.2% on a highly loaded system.

This should be backported to 3.3 along with the previous patch.
2026-02-10 17:52:59 +01:00
Olivier Houchard
a209c35f30 MEDIUM: thread: Turn the group mask in thread set into a group counter
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
If we want to be able to have more than 64 thread groups, we can no
longer use thread group masks as long.
One remaining place where it is done is in struct thread_set. However,
it is not really used as a mask anywhere, all we want is a thread group
counter, so convert that mask to a counter.
2026-01-15 05:24:53 +01:00
Maxime Henrion
6eedd0d485 CLEANUP: more conversions and cleanups for alignment
Some checks are pending
Contrib / build (push) Waiting to run
alpine/musl / gcc (push) Waiting to run
VTest / Generate Build Matrix (push) Waiting to run
VTest / (push) Blocked by required conditions
Windows / Windows, gcc, all features (push) Waiting to run
- Convert additional cases to use the automatic alignment feature for
  the THREAD_ALIGN(ED) macros. This includes some cases that are less
  obviously correct where it seems we wanted to align only in the
  USE_THREAD case but were not using the thread specific macros.
- Also move some alignment requirements to the structure definition
  instead of having it on variable declaration.
2025-12-09 17:40:58 +01:00
Amaury Denoyelle
49e6fca51b MINOR: quic: use separate global quic_conns FE/BE lists
Each quic_conn instance is stored in a global list. Its purpose is to be
able to loop over all known connections during "show quic".

Split this into two separate lists for frontend and backend usage.
Another change is that closing backend connections do not move into
quic_conns_clo list. They remain instead in their original list. The
objective of this patch is to reduce the contention between the two
sides.

Note that this prevents backend connections to be listed in "show quic"
now. This will be adjusted in a future patch.
2025-11-25 14:30:18 +01:00
Willy Tarreau
9d8c2a888b MINOR: activity: collect CPU time spent on memory allocations for each task
When task profiling is enabled, the pool alloc/free code will measure the
time it takes to perform memory allocation after a cache miss or memory
freeing to the shared cache or OS. The time taken with the thread-local
cache is never measured as measuring that time is very expensive compared
to the pool access time. Here doing so costs around 2% performance at 2M
req/s, only when task profiling is enabled, so this remains reasonable.
The scheduler takes care of collecting that time and updating the
sched_activity entry corresponding to the current task when task profiling
is enabled.

The goal clearly is to track places that are wasting CPU time allocating
and releasing too often, or causing large evictions. This appears like
this in "show profiling tasks aggr":

  Tasks activity over 11.428 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   mem_avg   lat_avg
    process_stream             44183891   16.47m    22.36us   491.0ns   1.154us   1.000ns   101.1us
    h1_io_cb                   57386064   4.011m    4.193us   20.00ns   16.00ns      -      29.47us
    sc_conn_io_cb              42088024   49.04s    1.165us      -         -         -      54.67us
    h1_timeout_task              438171   196.5ms   448.0ns      -         -         -      100.1us
    srv_cleanup_toremove_conns       65   1.468ms   22.58us   184.0ns   87.00ns      -      101.3us
    task_process_applet               3   508.0us   169.3us      -      107.0us   1.847us   29.67us
    srv_cleanup_idle_conns            6   225.3us   37.55us   15.74us   36.84us      -      49.47us
    accept_queue_process              2   45.62us   22.81us      -         -      4.949us   54.33us
2025-09-11 16:32:34 +02:00
Willy Tarreau
98cc815e3e MINOR: activity: collect time spent with a lock held for each task
When DEBUG_THREAD > 0 and task profiling enabled, we'll now measure the
time spent with at least one lock held for each task. The time is
collected by locking operations when locks are taken raising the level
to one, or released resetting the level. An accumulator is updated in
the thread_ctx struct that is collected by the scheduler when the task
returns, and updated in the sched_activity entry of the related task.

This allows to observe figures like this one:

  Tasks activity over 259.516 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   15466589   2.574m    9.984us      -         -      33.45us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               8047994   8.325s    1.034us      -         -      870.1us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              7734689   4.356m    33.79us   1.990us   1.641us   1.554ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              7734292   46.74m    362.6us   278.3us   132.2us   972.0us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               7733158   46.88s    6.061us      -         -      68.78us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet         6603593   4.484m    40.74us   16.69us   34.00us   96.47us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    task_process_applet         4761796   3.420m    43.09us   18.79us   39.28us   138.2us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    process_table_expire        4710662   4.880m    62.16us   9.648us   53.95us   158.6us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates   4171868   6.786s    1.626us      -      1.487us   47.94us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    h1_io_cb                    2871683   1.198s    417.0ns   70.00ns   69.00ns   1.005ms <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup
    process_peer_sync           2304957   5.368s    2.328us      -      1.156us   68.54us <- stktable_add_pend_updates@src/stick_table.c:873 task_wakeup
    process_peer_sync           1388141   3.174s    2.286us      -      1.130us   52.31us <- run_tasks_from_lists@src/task.c:671 task_queue
    stktable_add_pend_updates    463488   3.530s    7.615us   2.000ns   7.134us   771.2us <- stktable_touch_with_exp@src/stick_table.c:654 tasklet_wakeup

Here we see that almost the entirety of stktable_add_pend_updates() is
spent under a lock, that 1/3 of the execution time of process_stream()
was performed under a lock and that 2/3 of it was spent waiting for a
lock (this is related to the 10 track-sc present in this config), and
that the locking time in process_peer_sync() has now significantly
reduced. This is more visible with "show profiling tasks aggr":

  Tasks activity over 475.354 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lkd_avg   lat_avg
    h1_io_cb                   25742539   3.699m    8.622us   11.00ns   10.00ns   188.0us
    sc_conn_io_cb              22565666   1.475m    3.920us      -         -      473.9us
    process_stream             21665212   1.195h    198.6us   140.6us   67.08us   1.266ms
    task_process_applet        16352495   11.31m    41.51us   17.98us   36.55us   112.3us
    process_peer_sync           7831923   17.15s    2.189us      -      1.107us   41.27us
    process_table_expire        6878569   6.866m    59.89us   9.359us   51.91us   151.8us
    stktable_add_pend_updates   6602502   14.77s    2.236us      -      2.060us   119.8us
    h1_timeout_task                 801   703.4us   878.0ns      -         -      185.7us
    srv_cleanup_toremove_conns      347   12.43ms   35.82us   240.0ns   70.00ns   1.924ms
    accept_queue_process            142   1.384ms   9.743us      -         -      340.6us
    srv_cleanup_idle_conns           74   475.0us   6.418us   896.0ns   5.667us   114.6us
2025-09-11 16:32:34 +02:00
Willy Tarreau
4b23b2ed32 MINOR: thread: add a lock level information in the thread_ctx
The new lock_level field indicates the number of cumulated locks that
are held by the current thread. It's fed as soon as DEBUG_THREAD is at
least 1. In addition, thread_isolate() adds 128, so that it's even
possible to check for combinations of both. The value is also reported
in thread dumps (warnings and panics).
2025-09-11 16:32:34 +02:00
Willy Tarreau
503084643f MINOR: activity: collect time spent waiting on a lock for each task
When DEBUG_THREAD > 0, and if task profiling is enabled, then each
locking attempt will measure the time it takes to obtain the lock, then
add that time to a thread_ctx accumulator that the scheduler will then
retrieve to update the current task's sched_activity entry. The value
will then appear avearaged over the number of calls in the lkw_avg column
of "show profiling tasks", such as below:

  Tasks activity over 48.298 sec till 0.000 sec ago:
    function                      calls   cpu_tot   cpu_avg   lkw_avg   lat_avg
    h1_io_cb                    3200170   26.81s    8.377us      -      32.73us <- sock_conn_iocb@src/sock.c:1099 tasklet_wakeup
    sc_conn_io_cb               1657841   1.645s    992.0ns      -      853.0us <- sc_app_chk_rcv_conn@src/stconn.c:844 tasklet_wakeup
    process_stream              1600450   49.16s    30.71us   1.936us   1.392ms <- sc_notify@src/stconn.c:1206 task_wakeup
    process_stream              1600321   7.770m    291.3us   209.1us   901.6us <- stream_new@src/stream.c:585 task_wakeup
    sc_conn_io_cb               1599928   7.975s    4.984us      -      65.77us <- h1_wake_stream_for_recv@src/mux_h1.c:3633 tasklet_wakeup
    task_process_applet          997609   46.37s    46.48us   16.80us   113.0us <- sc_app_chk_snd_applet@src/stconn.c:1043 appctx_wakeup
    process_table_expire         922074   48.79s    52.92us   7.275us   181.1us <- run_tasks_from_lists@src/task.c:670 task_queue
    stktable_add_pend_updates    705423   1.511s    2.142us      -      56.81us <- stktable_add_pend_updates@src/stick_table.c:869 tasklet_wakeup
    task_process_applet          683511   34.75s    50.84us   18.37us   153.3us <- __process_running_peer_sync@src/peers.c:3579 appctx_wakeup
    h1_io_cb                     535395   198.1ms   370.0ns   72.00ns   930.4us <- h1_takeover@src/mux_h1.c:5659 tasklet_wakeup

It now makes it pretty obvious which tasks (hence call chains) spend their
time waiting on a lock and for what share of their execution time.
2025-09-11 16:32:34 +02:00
Willy Tarreau
b8a1c2380b MEDIUM: threads: keep history of taken locks with DEBUG_THREAD > 0
by only storing a word in each thread context, we can keep the history
of all taken/dropped locks by label. This is expected to be very cheap
and to permit to store up to 8 consecutive lock operations in 64 bits.
That should significantly help detect recursive locks as well as figure
what thread was likely to hinder another one waiting for a lock.

For now we only store the final state of the lock, we don't store the
attempt to get it. It's just a matter of space since we already need
4 ops (rd,sk,wr,un) which take 2 bits, leaving max 64 labels. We're
already around 45. We could also multiply by 5 and still keep 8 bits
total per lock, that would limit us to 51 locks max. It seems that
most of the time if we get a watchdog panic, anyway the victim thread
will be perfectly located so that we don't need a specific value for
this. Another benefit is that we perform a single memory write per
lock.
2025-04-28 16:50:34 +02:00
Willy Tarreau
874ba2afed CLEANUP: debug: no longer set nor use TH_FL_DUMPING_OTHERS
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
2025-04-17 16:25:47 +02:00
Willy Tarreau
6d8a523d14 MINOR: tinfo: keep a copy of the pointer to the thread dump buffer
Instead of using the thread dump buffer for post-mortem analysis, we'll
keep a copy of the assigned pointer whenever it's used, even for warnings
or "show threads". This will offer more opportunities to figure from a
core what happened, and will give us more freedom regarding the value of
the thread_dump_buffer itself. For example, even at the end of the dump
when the pointer is reset, the last used buffer is now preserved.
2025-04-17 16:25:47 +02:00
Willy Tarreau
fb7874c286 MINOR: tinfo: split the signal handler report flags into 3
While signals are not recursive, one signal (e.g. wdt) may interrupt
another one (e.g. debug). The problem this causes is that when leaving
the inner handler, it removes the outer's flag, hence the protection
that comes with it. Let's just have 3 distinct flags for regular signals,
debug signal and watchdog signal. We add a 4th definition which is an
aggregate of the 3 to ease testing.
2025-02-24 13:37:52 +01:00
Willy Tarreau
ddd173355c MINOR: tinfo: add a new thread flag to indicate a call from a sig handler
Signal handlers must absolutely not change anything, but some long and
complex call chains may look innocuous at first glance, yet result in
some subtle write accesses (e.g. pools) that can conflict with a running
thread being interrupted.

Let's add a new thread flag TH_FL_IN_SIG_HANDLER that is only set when
entering a signal handler and cleared when leaving them. Note, we're
speaking about real signal handlers (synchronous ones), not deferred
ones. This will allow some sensitive call places to act differently
when detecting such a condition, and possibly even to place a few new
BUG_ON().
2025-02-21 17:41:38 +01:00
Willy Tarreau
7ddcdff33f BUG/MEDIUM: debug: close a possible race between thread dump and panic()
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.

This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.

In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.

This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
2025-02-10 18:34:26 +01:00
Amaury Denoyelle
9d155ca706 MINOR: trace: implement tracing disabling API
Define a set of functions to temporarily disable/reactivate tracing for
the current thread. This could be useful when wanting to quickly remove
tracing output for some code parts.

The API relies on a disable/resume set of functions, with a thread-local
counter. This counter is tested under __trace_enabled(). It is a
cumulative value so that the same count of resume must be issued after
several disable usage. There is also the possibility to force reset the
counter to 0 before restoring the old value.

This should be backported up to 3.1.
2024-12-18 09:52:06 +01:00
Willy Tarreau
12969c1b17 MINOR: tinfo/clock: turn sched_call_date to 64-bits
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
2024-11-19 20:13:41 +01:00
Willy Tarreau
973c81ceec CLEANUP: tinfo: move sched_*_date/*_mono_time to the thread-local area
These ones are never atomically accessed, they have nothing to do in
the atomic ops cache line, let's move them to the thread-local area.
2024-11-19 20:13:41 +01:00
Willy Tarreau
278b9613a3 MEDIUM: debug: on panic, make the target thread automatically allocate its buf
One main problem with panic dumps is that they're filling the dumping
thread's trash, and that the global thread_dump_buffer is too small to
catch enough of them.

Here we're proceeding differently. When dumping threads for a panic, we're
passing the magic value 0x2 as the buffer, and it will instruct the target
thread to allocate its own buffer using get_trash_chunk() (which is signal
safe), so that each thread dumps into its own buffer. Then the thread will
wait for the buffer to be consumed, and will assign its own thread_dump_buffer
to it. This way we can simply dump all threads' buffers from gdb like this:

  (gdb) set $t=0
        while ($t < global.nbthread)
          printf "%s\n", ha_thread_ctx[$t].thread_dump_buffer.area
          set $t=$t+1
        end

For now we make it wait forever since it's only called on panic and we
want to make sure the thread doesn't leave and continues to use that trash
buffer or do other nasty stuff. That way the dumping thread will make all
of them die.

This would be useful to backport to the most recent branches to help
troubleshooting. It backports well to 2.9, except for some trivial
context in tinfo-t.h for an updated comment. 2.8 and older would also
require TAINTED_PANIC. The following previous patches are required:

   MINOR: debug: make mark_tainted() return the previous value
   MINOR: chunk: drop the global thread_dump_buffer
   MINOR: debug: split ha_thread_dump() in two parts
   MINOR: debug: slightly change the thread_dump_pointer signification
   MINOR: debug: make ha_thread_dump_done() take the pointer to be used
   MINOR: debug: replace ha_thread_dump() with its two components
2024-10-19 16:01:52 +02:00
Christopher Faulet
273d322b6f MINOR: stream/stats: Expose the total number of streams ever created in stats
A shared counter is added in the thread context to track the total number of
streams created on the thread. This number is then reported in stats. It
will be a useful information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Christopher Faulet
18ee22ff76 MINOR: stream/stats: Expose the current number of streams in stats
A shared counter is added in the thread context to track the current number
of streams. This number is then reported in stats. It will be a useful
information to diagnose some bugs.
2024-09-30 16:55:53 +02:00
Willy Tarreau
f793845f4a MEDIUM: clock: collect the monotonic time in clock_local_update_date()
Now we collect this clock in clock_local_update_date(), the closest from
the poller, which is also used when busy-polling, and the values is set
into the thread's curr_mono_time which did not exist before. Later,
clock_leaving_poll() just sets the prev_mono_time value from the curr_
one instead of retrieving the time at this specific point. It also means
that the monotonic time will now also cover the time needed to update
the global time, which should be negligible. Note that we don't collect
the CPU time in the clock_local_update_date() function even though it's
tempting, because when doing busy-polling, it would be collected on each
round while being useless.

Doing so will make sure that the local time always knows the monotonic
time when it is available.
2024-09-17 09:08:10 +02:00
Willy Tarreau
0ce51dc93b MEDIUM: dynbuf: implement emergency buffers
The buffer reserve set by tune.buffers.reserve has long been unused, and
in order to deal gracefully with failed memory allocations we'll need to
resort to a few emergency buffers that are pre-allocated per thread.

These buffers are only for emergency use, so every time their count is
below the configured number a b_free() will refill them. For this reason
their count can remain pretty low. We changed the default number from 2
to 4 per thread, and the minimum value is now zero (e.g. for low-memory
systems). The tune.buffers.limit setting has always been a problem when
trying to deal with the reserve but now we could simplify it by simply
pushing the limit (if set) to match the reserve. That was already done in
the past with a static value, but now with threads it was a bit trickier,
which is why the per-thread allocators increment the limit on the fly
before allocating their own buffers. This also means that the configured
limit is saner and now corresponds to the regular buffers that can be
allocated on top of emergency buffers.

At the moment these emergency buffers are not used upon allocation
failure. The only reason is to ease bisecting later if needed, since
this commit only has to deal with resource management.
2024-05-10 17:18:13 +02:00
Willy Tarreau
53461e4d94 CLEANUP: tinfo: better align fields in thread_ctx
The introduction of buffer_wq[] in thread_ctx pushed a few fields around
and the cache line alignment is less satisfying. And more importantly, even
before this, all the lists in the local parts were 8-aligned, with the first
one split across two cache lines.

We can do better:
  - sched_profile_entry is not atomic at all, the data it points to is
    atomic so it doesn't need to be in the atomic-only region, and it can
    fill the 8-hole before the lists
  - the align(2*void) that was only before tasklets[] moves before all
    lists (and it's a nop for now)

This now makes the lists and buffer_wq[] start on a cache line boundary,
leaves 48 bytes after the lists before the atomic-only cache line, and
leaves a full cache line at the end for 128-alignment. This way we still
have plenty of room in both parts with better aligned fields.
2024-05-10 17:18:13 +02:00
Willy Tarreau
a5d6a79986 MEDIUM: dynbuf: make the buffer_wq an array of list heads
Let's turn the buffer_wq into an array of 4 list heads. These are chosen
by criticality. The DB_CRIT_TO_QUEUE() macro maps each criticality level
into one of these 4 queues. The goal here clearly is to make it possible
to wake up the most critical queues in priority in order to let some tasks
finish their job and release buffers that others can use.

In order to avoid having to look up all queues, a bit map indicates which
queues are in use, which also allows to avoid looping in the most common
case where queues are empty..
2024-05-10 17:18:13 +02:00
Willy Tarreau
6c1b29d06f MINOR: ring: make the number of queues configurable
Now the rings have one wait queue per group. This should limit the
contention on systems such as EPYC CPUs where the performance drops
dramatically when using more than one CCX.

Tests were run with different numbers and it was showed that value
6 outperforms all other ones at 12, 24, 48, 64 and 80 threads on an
EPYC, a Xeon and an Ampere CPU. Value 7 sometimes comes close and
anything around these values degrades quickly. The value has been
left tunable in the global section.

This commit only introduces everything needed to set up the queue count
so that it's easier to adjust it in the forthcoming patches, but it was
initially added after the series, making it harder to compare.

It was also shown that trying to group the threads in queues by their
thread groups is counter-productive and that it was more efficient to
do that by applying a modulo on the thread number. As surprising as it
seems, it does have the benefit of well balancing any number of threads.
2024-03-25 17:34:19 +00:00
Amaury Denoyelle
a3187fe06c MINOR: rhttp: add count of active conns per thread
Add a new member <nb_rhttp_conns> in thread_ctx structure. Its purpose
is to count the current number of opened reverse HTTP connections
regarding from their listeners membership.

This patch will be useful to support multi-thread for active reverse
HTTP, in order to select the less loaded thread.

Note that despite access to <nb_rhttp_conns> are only done by the
current thread, atomic operations are used. This is because once
multi-thread support will be added, external threads will also retrieve
values from others.
2023-11-23 17:43:01 +01:00
Willy Tarreau
2268f10dd6 DEBUG: tinfo: store the pthread ID and the stack pointer in tinfo
When debugging a core, it's difficult to match a given gdb thread number
against an internal thread. Let's just store the pthread ID and the stack
pointer in each tinfo. This could help in the future by allowing to just
glance over them and pick the right one depending what info is found
first.
2023-11-23 14:32:55 +01:00
Willy Tarreau
844a3bc25b MEDIUM: checks: implement a queue in order to limit concurrent checks
The progressive adoption of OpenSSL 3 and its abysmal handshake
performance has started to reveal situations where it simply isn't
possible anymore to succesfully run health checks on many servers,
because between the moment all the checks are started and the moment
the handshake finally completes, the timeout has expired!

This also has consequences on production traffic which gets
significantly delayed as well, all that for lots of checks. While it's
possible to increase the check delays, it doesn't solve everything as
checks still take a huge amount of time to converge in such conditions.

Here we take a different approach by permitting to enforce the maximum
concurrent checks per thread limitation and implementing an ordered
queue. Thanks to this, if a thread about to start a check has reached
its limit, it will add the check at the end of a queue and it will be
processed once another check is finished. This proves to be extremely
efficient, with all checks completing in a reasonable amount of time
and not being disturbed by the rest of the traffic from other checks.
They're just cycling slower, but at the speed the machine can handle.

One must understand however that if some complex checks perform multiple
exchanges, they will take a check slot for all the required duration.
This is why the limit is not enforced by default.

Tests on SSL show that a limit of 5-50 checks per thread on local
servers gives excellent results already, so that could be a good starting
point.
2023-09-01 14:00:04 +02:00
Willy Tarreau
00de9e0804 MINOR: checks: maintain counters of active checks per thread
Let's keep two check counters per thread:
  - one for "active" checks, i.e. checks that are no more sleeping
    and are assigned to the thread. These include sleeping and
    running checks ;

  - one for "running" checks, i.e. those which are currently
    executing on the thread.

By doing so, we'll be able to spread the health checks load a bit better
and refrain from sending too many at once per thread. The counters are
atomic since a migration increments the target thread's active counter.
These numbers are reported in "show activity", which allows to check
per thread and globally how many checks are currently pending and running
on the system.

Ideally, we should only consider checks in the process of establishing
a connection since that's really the expensive part (particularly with
OpenSSL 3.0). But the inner layers are really not suitable to doing
this. However knowing the number of active checks is already a good
enough hint.
2023-09-01 08:26:06 +02:00
Willy Tarreau
9a6ecbd590 MEDIUM: debug: simplify the thread dump mechanism
The thread dump mechanism that is used by "show threads" and by the
panic dump is overly complicated due to an initial misdesign. It
firsts wakes all threads, then serializes their dumps, then releases
them, while taking extreme care not to face colliding dumps. In fact
this is not what we need and it reached a limit where big machines
cannot dump all their threads anymore due to buffer size limitations.

What is needed instead is to be able to dump *one* thread, and to let
the requester iterate on all threads.

That's what this patch does. It adds the thread_dump_buffer to the
struct thread_ctx so that the requester offers the buffer to the
thread that is about to be dumped. This buffer also serves as a lock.
A thread at rest has a NULL, a valid pointer indicates the thread is
using it, and 0x1 (NULL+1) is used by the dumped thread to tell the
requester it's done. This makes sure that a given thread is dumped
once at a time. In addition to this, the calling thread decides
whether it accesses the thread by itself or via the debug signal
handler, in order to get a backtrace. This is much saner because the
calling thread is free to do whatever it wants with the buffer after
each thread is dumped, and there is no dependency between threads,
once they've dumped, they're free to continue (and possibly to dump
for another requester if needed). Finally, when the THREAD_DUMP
feature is disabled and the debug signal is not used, the requester
accesses the thread by itself like before.

For now we still have the buffer size limitation but it will be
addressed in future patches.
2023-05-04 19:15:44 +02:00
Willy Tarreau
97da942ba6 MINOR: thread: keep a bitmask of enabled groups in thread_set
We're only checking for 0, 1, or >1 groups enabled there, and we'll soon
need to be more precise and know quickly which groups are non-empty.
Let's just replace the count with a mask of enabled groups. This will
allow to quickly spot the presence of any such group in a set.
2023-04-13 16:57:51 +02:00
Amaury Denoyelle
efed86c973 MINOR: quic: create a global list dedicated for closing QUIC conns
When a CONNECTION_CLOSE is emitted or received, a QUIC connection enters
respectively in draining or closing state. These states are a loose
equivalent of TCP TIME_WAIT. No data can be exchanged anymore but the
connection is maintained during a certain timer to handle packet
reordering or loss.

A new global list has been defined for QUIC connections in
closing/draining state inside thread_ctx structure. Each time a
connection enters in one of this state, it will be moved from the
default global list to the new closing list.

The objective of this patch is to quickly filter connections on
closing/draining. Most notably, this will be used to wake up these
connections and avoid that haproxy process stopping is delayed by them.

A dedicated function qc_detach_th_ctx_list() has been implemented to
transfer a quic-conn from one list instance to the other. This takes
care of back-references attach to a quic-conn instance in case of a
running "show quic".

This should be backported up to 2.7.
2023-03-08 14:39:48 +01:00
Willy Tarreau
3e820a1056 MINOR: threads: add flags to know if a thread is started and/or running
Several times during debugging it has been difficult to find a way to
reliably indicate if a thread had been started and if it was still
running. It's really not easy because the elements we look at are not
necessarily reliable (e.g. harmless bit or idle bit might not reflect
what we think during a signal). And such notions can be subjective
anyway.

Here we define two thread flags, TH_FL_STARTED which is set as soon as
a thread enters run_thread_poll_loop() and drops the idle bit, and
another one, TH_FL_IN_LOOP, which is set when entering run_poll_loop()
and cleared when leaving it. This should help init/deinit code know
whether it's called from a non-initialized thread (i.e. tid must not
be trusted), or shared functions know if they're being called from a
running thread or from init/deinit code outside of the polling loop.
2023-02-17 16:01:34 +01:00
Amaury Denoyelle
15c74702d5 MINOR: quic: implement a basic "show quic" CLI handler
Implement a basic "show quic" CLI handler. This command will be useful
to display various information on all the active QUIC frontend
connections.

This work is heavily inspired by "show sess". Most notably, a global
list of quic_conn has been introduced to be able to loop over them. This
list is stored per thread in ha_thread_ctx.

Also add three CLI handlers for "show quic" in order to allocate and
free the command context. The dump handler runs on thread isolation.
Each quic_conn is referenced using a back-ref to handle deletion during
handler yielding.

For the moment, only a list of raw quic_conn pointers is displayed. The
handler will be completed over time with more information as needed.

This should be backported up to 2.7.
2023-02-09 18:11:00 +01:00
Willy Tarreau
bef43dfa60 MINOR: thread: add a simple thread_set API
The purpose is to be able to store large thread sets, defined by ranges
that may cross group boundaries, as well as define lists of groups and
masks. The thread_set struct implements the storage, and the parser is
in parse_thread_set(), with a focus on "bind" lines, but not only.
2023-02-03 18:00:21 +01:00
Willy Tarreau
6be8d09a61 OPTIM: global: move byte counts out of global and per-thread
During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.

Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.

After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
2023-01-12 16:37:45 +01:00
Willy Tarreau
fc50b9dd14 BUG/MAJOR: sched: protect task during removal from wait queue
The issue addressed by commit fbb934da9 ("BUG/MEDIUM: stick-table: fix
a race condition when updating the expiration task") is still present
when thread groups are enabled, but this time it lies in the scheduler.

What happens is that a task configured to run anywhere might already
have been queued into one group's wait queue. When updating a stick
table entry, sometimes the task will have to be dequeued and requeued.

For this a lock is taken on the current thread group's wait queue lock,
but while this is necessary for the queuing, it's not sufficient for
dequeuing since another thread might be in the process of expiring this
task under its own group's lock which is different. This is easy to test
using 3 stick tables with 1ms expiration, 3 track-sc rules and 4 thread
groups. The process crashes almost instantly under heavy traffic.

One approach could consist in storing the group number the task was
queued under in its descriptor (we don't need 32 bits to store the
thread id, it's possible to use one short for the tid and another
one for the tgrp). Sadly, no safe way to do this was figured, because
the race remains at the moment the thread group number is checked, as
it might be in the process of being changed by another thread. It seems
that a working approach could consist in always having it associated
with one group, and only allowing to change it under this group's lock,
so that any code trying to change it would have to iterately read it
and lock its group until the value matches, confirming it really holds
the correct lock. But this seems a bit complicated, particularly with
wait_expired_tasks() which already uses upgradable locks to switch from
read state to a write state.

Given that the shared tasks are not that common (stick-table expirations,
rate-limited listeners, maybe resolvers), it doesn't seem worth the extra
complexity for now. This patch takes a simpler and safer approach
consisting in switching back to a single wq_lock, but still keeping
separate wait queues. Given that shared wait queues are almost always
empty and that otherwise they're scanned under a read lock, the
contention remains manageable and most of the time the lock doesn't
even need to be taken since such tasks are not present in a group's
queue. In essence, this patch reverts half of the aforementionned
patch. This was tested and confirmed to work fine, without observing
any performance degradation under any workload. The performance with
8 groups on an EPYC 74F3 and 3 tables remains twice the one of a
single group, with the contention remaining on the table's lock first.

No backport is needed.
2022-11-22 09:10:08 +01:00
Willy Tarreau
1efddfa6bf MINOR: sched: store the current profile entry in the thread context
The profile entry that corresponds to the current task/tasklet being
profiled is now stored into the thread's context. This will allow it
to be accessed from the tasks themselves. This is needed for an upcoming
fix.
2022-09-08 14:19:15 +02:00