haproxy/include
Willy Tarreau 4a9ec66fd8 MEDIUM: tools: switch the main PRNG to a thread-local xoshiro256**
The current PRNG is xoroshiro128**, it was introduced in 2.2 with
commit 52bf83939 ("BUG/MEDIUM: random: implement a thread-safe and
process-safe PRNG").  It features a 2^128 sequence and can perform
2^64 or 2^96 jumps, though only the 2^96 jump is implemented. It
was initially designed to support both processes and threads, and
implements a shared state between threads instead of allocating
distinct sequences based on PID and thread numbers.

Since then, the PRNG's usage grew and processes have disappeared,
but the lock or the DWCAS are still there due to its shared nature,
and it's possible to trigger watchdog warnings by issuing 100 UUIDs
in a single log-format string.

Also, UUID and QUIC retry tokens now consume 128 bits from the PRNG
in two 64-bit calls, and used to weaken the PRNG by rapidly disclosing
its internal state on reasonably idle systems. This indicates that
most of the time we now need 128 bits.

This patch modernizes the internal generator by switching to xoshiro256**,
which has comparable properties (it's even faster), and features even
longer 2^256 periods, still returning 64 bits per call. It can be
initialized with 2^128 and 2^192 jumps. More details here:

   https://prng.di.unimi.it/
   https://prng.di.unimi.it/xoshiro256starstar.c

Here we implement a thread-local state instead of the old shared one,
so there is no more need for synchronization. The state is seeded at
boot, and each thread performs as many 2^192 jumps as their TID is
large. The master process performs a 2^128 jump where it used to
perform a 2^96 jump so that it doesn't overlap with any worker thread.
However a cleaner approach could be to perform a 2^128 jump for each
fork() (here the worker) and 2^192 for each thread. This might be for
a future improvement.

ha_random64_internal() is now the new PRNG, so that everything else
remains totally transparent. _ha_random64_pair_hashed() continues to
hash the first 128 bits of the state.

A simple config generating 100 UUID on 20 threads jumps from 135k to
1.25M req/s, which translates to a bump from 13.5M to 125M UUID/s,
or 9 times faster. And there is no more DWCAS can be seen anymore
in perf top:

Before: 13.5M/s
Overhead  Shared Object            Symbol
  99.04%  haproxy       [.] ha_random64_internal
   0.66%  haproxy       [.] _ha_random64_pair_hashed
   0.03%  libc-2.42.so  [.] __printf_buffer
   0.02%  [kernel]      [k] _raw_spin_lock
   0.01%  libc-2.42.so  [.] __strchrnul_avx2
   0.01%  [kernel]      [k] ktime_get
   0.01%  [kernel]      [k] lapic_next_deadline
   0.01%  haproxy       [.] sample_process
   0.01%  haproxy       [.] chunk_printf
   0.01%  libc-2.42.so  [.] __printf_buffer_write
   0.01%  [kernel]      [k] hrtimer_active
   0.01%  libc-2.42.so  [.] __memmove_avx_unaligned_erms
   0.01%  libc-2.42.so  [.] _itoa_word

After: 125M/s
  18.84%  libc-2.42.so      [.] __printf_buffer
   9.84%  haproxy           [.] sample_process
   8.33%  libc-2.42.so      [.] __strchrnul_avx2
   6.61%  libc-2.42.so      [.] __memmove_avx_unaligned_erms
   6.06%  libc-2.42.so      [.] __printf_buffer_write
   4.43%  haproxy           [.] strlcpy2
   4.09%  libc-2.42.so      [.] _itoa_word
   2.62%  haproxy           [.] sess_build_logline_orig
   2.12%  haproxy           [.] _ha_random64_pair_hashed
   1.28%  haproxy           [.] pool_put_to_cache
   1.06%  haproxy           [.] __pool_alloc
   1.00%  haproxy           [.] smp_fetch_uuid
   0.93%  haproxy           [.] lf_text_len
   0.82%  haproxy           [.] ha_generate_uuid_v4
2026-05-26 13:13:24 +02:00
..
haproxy MEDIUM: tools: switch the main PRNG to a thread-local xoshiro256** 2026-05-26 13:13:24 +02:00
import MINOR: mjson: reintroduce mjson_next() 2026-04-14 10:57:21 +02:00
make BUILD: makefile: add a qinfo macro to pass info in quiet mode 2025-01-08 11:26:05 +01:00