The fix in 4a9e3e102e ("BUG/MINOR: haproxy: only tid 0 must not sleep
if got signal") had the nasty side effect of breaking the graceful
reload operations: threads whose id is non-zero could quit too early and
not process incoming traffic, which is visible with broken connections
during reloads. They just need to ignore the the stopping condition
until the signal queue is empty. In any case, it's the thread in charge
of the signal queue which will notify them once it receives the signal.
It was verified that connections are no longer broken with this fix,
and that the issue that required it (#2537, looping threads on reload)
does not re-appear with the reproducer, while it still did without the
fix above. Since the fix above was backported to every stable version,
this one will also have to.
Fix read return value unused result.
src/haproxy.c: In function ‘main’:
src/haproxy.c:3630:17: error: ignoring return value of ‘read’ declared with attribute ‘warn_unused_result’ [-Werror=unused-result]
3630 | read(sock_pair[1], &c, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Must be backported where d7f6819 is backported.
Since the mworker rework in haproxy 3.1, the worker need to tell the
master that it is ready. This is done using the sockpair protocol by
sending a _send_status message to the master.
It seems that the sockpair protocol is buggy on macOS because of a known
issue around fd transfer documented in sendmsg(2):
https://man.freebsd.org/cgi/man.cgi?sendmsg(2) BUGS section
Because sendmsg() does not necessarily block until the data has been
transferred, it is possible to transfer an open file descriptor across
an AF_UNIX domain socket (see recv(2)), then close() it before it has
actually been sent, the result being that the receiver gets a closed
file descriptor. It is left to the application to implement an
acknowledgment mechanism to prevent this from happening.
Indeed the recv side of the sockpair is closed on the send side just
after the send_fd_uxst(), which does not implement an acknowledgment
mechanism. So the master might never recv the _send_status message.
In order to implement an acknowledgment mechanism, a blocking read() is
done before closing the recv fd on the sending side, so we are sure that
the message was read on the other side.
This was only reproduced on macOS, meaning the master CLI is also
impacted on macOS. But no solution was found on macOS for it.
Implementing an acknowledgment mechanism would complexify too much the
protocol in non-blocking mode.
The problem was reported in ticket #3045, reproduced and analyzed by
@cognet.
Must be backported as far as 3.1.
When pre-check and post-check postparsing hooks= are evaluated in
step_init_2() potential fatal errors are ignored during the iteration
and are only taken into account at the end of the loop. This is not ideal
because some errors (ie: memory errors) could cause multiple alert
messages in a row, which could make troubleshooting harder for the user.
Let's stop as soon as a fatal error is encountered for post parsing
hooks, as we use to do everywhere else.
Add a new global option, "noktls", as well as a command line option,
"-dT", to totally disable ktls usage, even if it is activated on servers
or binds in the configuration.
That makes it easier to quickly figure out if a problem is related to
ktls or not.
The random seed used in ha_random functions needs to be first
initialized by calling ha_random_boot. This function was called rather
late in the init process, after the init functions (INITCALLS) are
called and after the configuration parsing for instance which means that
any ha_random call in an init function would return 0. This was the case
in 'vars_init' and 'cache_init' which tried to build seeds for specific
hash calculations but ended up not being seeded.
This patch can be backported on all stable branches.
Between 3.2 and 3.3-dev we noticed a noticeable performance regression
due to stats handling. After bisecting, Willy found out that recent
work to split stats computing accross multiple thread groups (stats
sharding) was responsible for that performance regression. We're looking
at roughly 20% performance loss.
More precisely, it is the added indirections, multiplied by the number
of statistics that are updated for each request, which in the end causes
a significant amount of time being spent resolving pointers.
We noticed that the fe_counters_shared and be_counters_shared structures
which are currently allocated in dedicated memory since a0dcab5c
("MAJOR: counters: add shared counters base infrastructure")
are no longer huge since 16eb0fab31 ("MAJOR: counters: dispatch counters
over thread groups") because they now essentially hold flags plus the
per-thread group id pointer mapping, not the counters themselves.
As such we decided to try merging fe_counters_shared and
be_counters_shared in their parent structures. The cost is slight memory
overhead for the parent structure, but it allows to get rid of one
pointer indirection. This patch alone yields visible performance gains
and almost restores 3.2 stats performance.
counters_fe_shared_get() was renamed to counters_fe_shared_prepare() and
now returns either failure or success instead of a pointer because we
don't need to retrieve a shared pointer anymore, the function takes care
of initializing existing pointer.
This patch removes completely the support for the program section, the
parsing of the section as well as the internals in the mworker does not
support it anymore.
The program section was considered dysfonctional and not fully
compatible with the "mworker V3" model. Users that want to run an
external program must use their init system.
The documentation is cleaned up in another patch.
Most fe and be counters are good candidates for being shared between
processes. They are now grouped inside "shared" struct sub member under
be_counters and fe_counters.
Now they are properly identified, they would greatly benefit from being
shared over thread groups to reduce the cost of atomic operations when
updating them. For this, we take the current tgid into account so each
thread group only updates its own counters. For this to work, it is
mandatory that the "shared" member from {fe,be}_counters is initialized
AFTER global.nbtgroups is known, because each shared counter causes the stat
to be allocated lobal.nbtgroups times. When updating a counter without
concurrency, the first counter from the array may be updated.
To consult the shared counters (which requires aggregation of per-tgid
individual counters), some helper functions were added to counter.h to
ease code maintenance and avoid computing errors.
Shareable counters are not tagged as shared counters and are dynamically
allocated in separate memory area as a prerequisite for being stored
in shared memory area. For now, GUID and threads groups are not taken into
account, this is only a first step.
also we ensure all counters are now manipulated using atomic operations,
namely, "last_change" counter is now read from and written to using atomic
ops.
Despite the numerous changes caused by the counters being moved away from
counters struct, no change of behavior should be expected.
REGISTER_POST_PROXY_CHECK() used to iterate over "main" proxies to run
registered callbacks. This means hidden proxies (and their servers) did
not get a chance to get post-checked and could cause issues if some post-
checks are expected to be executed on all proxies no matter their type.
Instead we now rely on the global proxies list. Another side effect is that
the REGISTER_POST_SERVER_CHECK() now runs as well for servers from proxies
that are not part of the main proxies list.
It was mentioned during the development of glitches that it would be
nice to support not killing misbehaving connections below a certain
CPU usage so that poor implementations that routinely misbehave without
impact are not killed. This is now possible by setting a CPU usage
threshold under which we don't kill them via this parameter. It defaults
to zero so that we continue to kill them by default.
When thread support is disabled ("USE_THREAD=" or "USE_THREAD=0" when
building), soft-stop doesn't work as haproxy never ends after stopping
the proxies.
This used to work fine in the past but suddenly stopped working with
ef422ced91 ("MEDIUM: thread: make stopping_threads per-group and add
stopping_tgroups") because the "break;" instruction under the stopping
condition is never executed when support for multithreading is disabled.
To fix the issue, let's add an "else" block to run the "break;"
instruction when USE_THREAD is not defined.
It should be backported up to 2.8
Define a new settings tune.quic.frontend.max-tot-window. It contains a
size argument which can be used to set a limit on the sum of all QUIC
connections congestion window. This is applied both on
quic_cc_path_set() and quic_cc_path_inc().
Note that this limitation cannot reduce a congestion window more than
the minimal limit which is set to 2 datagrams.
In order to ease troubleshooting and testing, the new "-4" command line
argument enforces queries and processing of "A" DNS records only, i.e.
those representing IPv4 addresses. This can be useful when a host lack
end-to-end dual-stack connectivity. This overrides the global
"dns-accept-family" directive and is equivalent to value "ipv4".
TH_FL_DUMPING_OTHERS was being used to try to perform exclusion between
threads running "show threads" and those producing warnings. Now that it
is much more cleanly handled, we don't need that type of protection
anymore, which was adding to the complexity of the solution. Let's just
get rid of it.
Define a new global configuration tune.quic.frontend.max-data. This
allows users to explicitely set the value for the corresponding QUIC TP
initial-max-data, with direct impact on haproxy memory consumption.
A new structure quic_tune has recently been defined. Its purpose is to
store global options related to QUIC. Previously, only the tunable to
toggle pacing was stored in it.
This commit moves several QUIC related tunable from global to quic_tune
structure. This better centralizes QUIC configuration option and gives
room for future generic options.
It's important that we don't leave unassigned IDs in the topology,
because the selection mechanism is based on index-based masks, so an
unassigned ID will never be kept. This is particularly visible on
systems where we cannot access the CPU topology, the package id, node id
and even thread id are set to -1, and all CPUs are evicted due to -1 not
being set in the "only-cpu" sets.
Here in new function "cpu_fixup_topology()", we assign them with the
smallest unassigned value. This function will be used to assign IDs
where missing in general.
Due to the previous commit we can end up with cores not assigned
any cluster ID. For this, at the end we sort the CPUs by topology
and assign cluster IDs to remaining CPUs based on pkg/node/llc.
For example an 14900 now shows 5 clusters, one for the 8 p-cores,
and 4 of 4 e-cores each.
The local cluster numbers are per (node,pkg) ID so that any rule could
easily be applied on them, but we also keep the global numbers that
will help with thread group assignment.
We still need to force to assign distinct cluster IDs to cores
running on a different L3. For example the EPYC 74F3 is reported
as having 8 different L3s (which is true) and only one cluster.
Here we introduce a new function "cpu_compose_clusters()" that is called
from the main init code just after cpu_detect_topology() so that it's
not OS-dependent. It deals with this renumbering of all clusters in
topology order, taking care of considering any distinct LLC as being
on a distinct cluster.
By mutually refining the thread count and group count, we can try
to detect the most suitable setup for the current machine. Taskset
is implicitly handled correctly. tgroups automatically adapt to the
configured number of threads. cpu-map manages to limit tgroups to
the smallest supported value.
The thread-limit is enforced. Just like in cfgparse, if the thread
count was forced to a higher value, it's reduced and a warning is
emitted. But if it was not set, the thr_max value is bound to this
limit so that further calculations respect it.
We continue to default to the max number of available threads and 1
tgroup by default, with the limit. This normally allows to get rid
of that test in check_config_validity().
During development, everything related to CPU binding and the CPU topology
is debugged using state dumps at various places, but it does make sense to
have a real command line option so that this remains usable in production
to help users figure why some CPUs are not used by default. Let's add
"-dc" for this. Since the list of global.tune.options values is almost
full and does not 100% match this option, let's add a new "tune.debug"
field for this.
This uses the publicly available information from /sys to figure the cache
and package arrangements between logical CPUs and fill ha_cpu_topo[], as
well as their SMT capabilities and relative capacity for those which expose
this. The functions clearly have to be OS-specific.
Now before trying to resolve the thread assignment to groups, we detect
which CPUs are not bound at boot so that we can mark them with
HA_CPU_F_EXCLUDED. This will be useful to better know on which CPUs we
can count later. Note that we purposely ignore cpu-map here as we
don't know how threads and groups will map to cpu-map entries, hence
which CPUs will really be used.
It's important to proceed this way so that when we have no info we
assume they're all available.
CAP_SYS_ADMIN support was added, in order to access sockets in namespaces. So
let's adjust the alert at startup, where we check preserved capabilities from
global.last_checks. Let's mention here cap_sys_admin as well.
In check_config_validity() we evaluate some sample fetch expressions
(log-format, server rules, etc). These expressions may use external files like
maps.
If some particular 'default-path' was set in the global section before, it's no
longer applied to resolve file pathes in check_config_validity(). parse_cfg()
at the end of config parsing switches back to the initial cwd.
This fixes the issue #2886.
This patch should be backported in all stable versions since 2.4.0, including
2.4.0.
Doing unit tests with haproxy was always a bit difficult, some of the
function you want to test would depend on the buffer or trash buffer
initialisation of HAProxy, so building a separate main() for them is
quite hard.
This patch adds a way to register a function that can be called with the
"-U" parameter on the command line, will be executed just after
step_init_1() and will exit the process with its return value as an exit
code.
When using the -U option, every keywords after this option is passed to
the callback and could be used as a parameter, letting the capability to
handle complex arguments if required by the test.
HAProxy need to be built with DEBUG_UNIT to activate this feature.
In patch 2fe4cbd8e ("MINOR: startup: allow hap_register_feature() to
enable a feature in the list"), the ability to overwrite a '-' in the
feature list was added. However the code was not tokenizing correctly
the string, and partial feature name found in the name could result in
having the same feature name multiple time.
This patch rewrites the lookup of the string by tokenizing it correctly.
This patch allows hap_register_feature() to enable a feature in the list
which was already registered and marked disabled.
This way we could enable automatically some features under certain
condition without the need of the USE argument with make and correctly
report its activation.
The rework of the thread dumping mechanism in 2.8 with commit 9a6ecbd590
("MEDIUM: debug: simplify the thread dump mechanism") opened a small
race, which is that a thread in the process of dumping other ones may
block the other one from panicing while it's looping at the end of
ha_thread_dump_fill(), or any other sequence involving the currently
dumped one.
This was emphasized in 3.1 with commit 148eb5875f ("DEBUG: wdt: better
detect apparently locked up threads and warn about them") that allowed
to emit warnings about long-stuck threads, because in this case, what
happens is that sometimes a thread starts to emit a warning (or a set
of warnings), and while the warning is being awaited for, a panic
finally happens and interrupts either the dumping thread, which never
finishes and waits for the target's pointer to become NULL which will
never happen since it was supposed to do it itself, or the currently
dumped thread which could wait for the dumping thread to become ready
while this one has not released the former.
In order to address this, first we now make sure never to dump a thread
that is already in the process of dumping another one. We're adding a
new thread flag to know this situation, that is set in ha_thread_dump_fill()
and cleared in ha_thread_dump_done(). And similarly, we don't trigger
the watchdog on a thread waiting for another one to finish its dump,
as it's likely a case of warning (and maybe even a panic) that makes
them wait for each other and we don't want such cases to be reentrant.
Finally, we check in the main polling loop that the flag never accidentally
leaked (e.g. wrong flag manipulation) as this would be difficult to spot
with bad consequences.
This should be backported at least to 2.8, and should resolve github
issue #2860. Thanks to Chris Staite for the very informative backtrace
that exhibited the problem.
It is not rare to see configurations with a large number of "tcp-request
content" or "http-request" rules for instance. A large number of rules
combined with cpu-demanding actions (e.g.: actions that work on content)
may create thread contention as all the rules from a given ruleset are
evaluated under the same polling loop if the evaluation is not interrupted
Thus, in this patch we add extra logic around "tcp-request content",
"tcp-response content", "http-request" and "http-response" rulesets, so
that when a certain number of rules are evaluated under the single polling
loop, we force the evaluating function to yield. As such, the rule which
was about to be evaluated is saved, and the function starts evaluating
rules from the save pointer when it returns (in the next polling loop).
We use task_wakeup(task, TASK_WOKEN_MSG) to explicitly wake the task so
that no time is wasted and the processing is resumed ASAP. TASK_WOKEN_MSG
is mandatory here because process_stream() expects TASK_WOKEN_MSG for
explicit analyzers re-evaluation.
rules_bcount stream's attribute was added to count how manu rules were
evaluated since last interruption (yield). Also, SF_RULE_FYIELD flag
was added to know that the s->current_rule was assigned due to forced
yield and not regular yield.
By default haproxy will enforce a yield every 50 rules, this behavior
can be configured using the "tune.max-rules-at-once" global keyword.
There is a limitation though: for now, if the ACT_OPT_FINAL flag is set
on act_opts, we consider it is not safe to yield (as it is already the
case for automatic yield). In this case instead of yielding an taking
the risk of not being called back, we skip the yield and hope it will
not create contention. This is something we should ideally try to
improve in order to yield in all conditions.
Remove pacing experimental status, so it's not required anymore to use
expose-experimental-directives to enable it.
Along this change, pacing is now activated by default. As such, pacing
configuration is transformed into its final form. The global on/off
setting is turned into a disable setting without argument.
Pacing support was previously activated on each bind line individually,
via an optional argument of quic-cc-algo keyword. Remove this optional
argument and introduce a global setting to enable/disable pacing. Pacing
activation is still flagged as experimental.
One important change is that previously BBR usage automatically
activated pacing support. This is not the case anymore, so users should
now always explicitely activate pacing if BBR is selected. A new warning
message will be displayed if this is not the case.
Another consequence of this change is that now pacing_inter callback is
always defined for every quic_cc_algo types. As such, QUIC MUX uses
global.tune.options to determine if pacing is required.
This should be backported up to 3.1, after a period of observation.
For both proxies and servers, properly calculates queueslength, which is
the total number of element in each queues (as they currently are only
using one queue, it is equivalent to the number of element of that
queue), and use it instead of the queue's length.
version.c tries to centralize all variables conveying version information,
but there's still an issue with the BUILD_* variables which are only
passed to haproxy.o and are only updated when that one is rebuilt. This
is not very logical given that we can end up with values there which
contradict info from version.c.
Better move all of these to version.c which is systematically rebuilt.
Most of these variables only end up as string concatenation at the
moment. Some of them are even duplicated. In version.c we now have one
variable (or constant) for each of them and haproxy.c references them
in messages. This is much more logical and easier to maintain in a
consistent state.
The patch looks a bit large but it really only moves the ifdefed string
assignment from one file to another, placing them into variables.
This environment variable was added by commit d4c0be6b20 ("MINOR: startup:
HAPROXY_STARTUP_VERSION contains the version used to start"). However, it's
set from the macro that is passed during the build process instead of being
set from the variable that's kept up to date in version.c. The difference
is visible only during debugging/bisecting because only changed files and
version.o are rebuilt, but not necessarily haproxy.o, which is where the
environment variable is set. This means that the version exposed in the
environment is not necessarily the same as the one presented in
"haproxy -v" during such debugging sessions.
This should be backported to 2.8. It has no impact at all on regularly
built binaries.
Traces can be activated on startup via -dt command line argument. To
facilitate its usage, display a usage description and examples when
"help" is specified.
Released version 3.2-dev3 with the following main changes :
- DOC: config: add missing "track-sc0" in action keywords matrix
- BUG/MINOR: stktable: invalid use of stkctr_set_entry() with mixed table types
- BUG/MAJOR: mux-quic: fix BUG_ON on empty STREAM emission
- BUG/MEDIUM: mux-h2: Count copied data when looping on RX bufs in h2_rcv_buf()
- Revert "BUG/MAJOR: mux-quic: fix BUG_ON on empty STREAM emission"
- BUG/MAJOR: mux-quic: properly fix BUG_ON on empty STREAM emission
- MINOR: mux-quic: add traces on sd attach
- BUG/MEDIUM: mux-quic: do not attach on already closed stream
- BUG/MINOR: compression: handle a possible strdup() failure
- BUG/MINOR: pool: handle a possible strdup() failure
- BUG/MINOR: cfgparse-tcp: handle a possible strdup() failure
- BUG/MINOR: log: Allow to use if/unless conditionnals for do-log action
- MINOR: config: Alert about extra arguments for errorfile and errorloc
- BUG/MINOR: mux-quic: fix wakeup on qcc_set_error()
- MINOR: mux-quic: change return value of qcs_attach_sc()
- BUG/MINOR: mux-quic: handle closure of uni-stream
- BUG/MEDIUM: promex/resolvers: Don't dump metrics if no nameserver is defined
- BUG/MAJOR: ssl/ocsp: fix NULL conn object dereferencing to access QUIC TLS counters
- MEDIUM: errors: get rid of shm_open()
- BUILD: makefile: do not clean standalone binaries on a simple "make clean"
- BUILD: makefile: add a qinfo macro to pass info in quiet mode
- DEV: ncpu: add a simple utility to help with NUMA development
- DEV: ncpu: implement a wrapper mode
- DEV: ncpu: make the wrapper work both as a lib and executable
- BUG/MEDIUM: h1-htx: Properly handle bodyless messages
- MINOR: tools: add a few functions to simply check for a file's existence
Let's encapsulate the code, which checks the applied nofile limit into
a separate helper check_nofile_lim_and_prealloc_fd(). Let's keep in this new
function scope the block, which tries to create a copy of FD with the highest
number, if prealloc-fd is set in the configuration.
In step_init_3() we try to apply provided or calculated earlier haproxy
maxsock and memmax limits.
Let's encapsulate these code blocks in dedicated functions:
apply_nofile_limit() and apply_memory_limit() and let's move them into
limits.c. Limits.c gathers now all the logic for calculating and setting
system limits in dependency of the provided configuration.
Let's encapsulate the code, which calculates global.maxconn and
global.maxsslconn into a dedicated function set_global_maxconn() and let's
move this function in limits.c. In limits.c we keep helpers to calculate and
check haproxy internal limits, based on the system nofile and memory limits.
Define a new build mode DEBUG_STRESS. This will be used to stress some
code parts which cannot be reproduce easily with an alternative
suboptimal code.
First, a global <mode_stress> is set either to 1 or 0 depending on
DEBUG_STRESS compilation. A new global keyword "stress-level" is also
defined. It allows to specify a level from 0 to 9, to increase the
stress incurred on the code.
Helper macro STRESS_RUN* are defined for each stress level. This allows
to easily specify an instruction in default execution and a stress
counterpart if running on the corresponding stress level.
Some master process' initialization steps are conditioned by receiving the
READY message from worker (pidfile creation, forwarding READY message to the
launching parent). So, master process can not do these initialization routines
before.
If the master process fails, while creating pid or forwarding the READY to the
parent in daemon mode, he exits with a proper alert message. In daemon mode we
no longer see such message, as process is already detached from the tty.
To fix this, as these alerts could be very useful, let's detach the master
process from the tty after his last initialization steps in _send_status.
Due to master-worker rework, daemonization fork happens now before parsing
and applying the configuration. This makes impossible to report correctly all
warnings and alerts to shell's stdout. Daemonzied process fails, while being
already in background, exit code reported by shell via '$?' equals to 0, as
it's the exit code of his parent.
To fix this, let's create a pipe between parent and daemonized child. The
child will send into this pipe a "READY" message, when it finishes his
initialization. The parent will wait on the "read" end of the pipe until
receiving something. If read() fails, parent obtains the status of the
exited child with waitpid(). So, the parent can correctly report the error to
the stdout and he can exit with child's exitcode.
This fix should be backported only in 3.1.
Due to master-worker refactoring, daemonization fork happens now very early,
before parsing and verifying the configuration. For the moment there is no
any specific syntax, which needs for the daemon mode to be really applied in
order to perform the tests.
So, it's better not to do the daemonization fork, if 'daemon' keyword is
presented in the config (or -D option), when we started with -c (MODE_CHECK).
Like this, during the config verification, the process will always stay in
foreground and all warning or errors will be delivered to the stdout.
This fix should be backported only in 3.1.
Update comment and condition. nb_oldpids it's not a pointer, but a signed int,
which keeps the max number of elements in oldpids array. So, it's a good
practice to check, if it's strictly positive here.
If master process can't open a pidfile, there is no sense to send SIGTTIN to
oldpids, as it will exit. So, old workers will terminate as well. It's better
to send the last alert to the log about unrecoverable error, because master is
already in its polling loop.
For the standalone mode we should keep the previous logic in this case: send
SIGTTIN to old process and unbind listeners for the new one. So, it's better
to put this error path in main(), as it's done when other configuration settings
can't be applied.
This patch should be backported only in 3.1.
In 3.1-dev10, commit 8dd4efe42f ("MAJOR: mworker: move master-worker
fork in init()") made the fork_poller() code unconditional, while it
is only desirable for processes that have been forked from a parent
(standalone daemon mode) or from a master (master-worker mode). The
call can be expensive in some cases as it will create a new poller,
scan and try to migrate to it all existing FDs till the highest known
one. With very high numbers of FDs, this can take several seconds to
start.
This should be backported to 3.1.
Commit 8dd4efe42f ("MAJOR: mworker: move master-worker fork in init()")
introduced some sensitive changes to the startup code (which was
expected), and one sensitive change is that the second call to setsid()
was accidentally made unconditional. As such it even applies to foreground
processes, resulting in foreground processes being detached from the
terminal and no longer responding to Ctrl-C nor Ctrl-Z. An example of
this simply consists in start haproxy -db under sudo. Then a new shell
is required to stop it.
This patch removes this second setsid(), as it is already done in
apply_daemon_mode().
This must be backported to 3.1.
Pidfile should be created at the latest initialization stage, when we are
sure, that process is able to start successfully, otherwise PID value, written
in this file is no longer valid.
So, for the standalone mode, let's move the block, which opens the pidfile and
let's put it just before applying "chroot". In master-worker mode, master
doesn't perform chroot. So it creates the pidfile, only when the "READY"
message from the newly forked worker is received.
This should be backported only in 3.1
After master-worker mode refactoring, global.pidfile is only used in
handle_pidfile(), which opens the provided file and writes the PID into it. So,
it's more appropriate to perform the close(pidfd) and ha_free(&global.pidfile)
also in this function.
This commit prepares the fix of the pidfile creation, as it's created now very
early, when we are not sure, that process has successfully started. In
master-worker mode handle_pidfile() can be called in the master process context.
So, let's make it accessible from other compilation units via global.h.
This should be backported only in 3.1.
Let's move mworker_reexec() and mworker_reload() in mworker.c. mworker_reload()
is called only within the functions, which are already in mworker.c. So, this
reorganization allows to declare mworker_reload() as a static.
mworker_run_master() is called only in master mode. mworker_loop() is static
and called only in mworker_run_master(). So let's move these both functions in
mworker.c.
We also need here to make run_thread_poll_loop() accessible from other units,
as it's used in mworker_loop().
This commit prepares the move of mworker_run_master() in mworker.c.
Let's remove from it's definition the code, which adjusts verbosity in
dependency of other global run time modes (daemon or foreground). This part
should stay in main(), where all verbosity modes are handeled for
different mode combinations.
mworker_prepare_master() performs some preparation routines for the new worker
process, which will be forked during the startup. It's called only in
master-worker mode, so let's move it in mworker.c.
mworker_on_new_child_failure() performs some routines for the worker process,
if it has failed the reload. As it's called only in mworker_catch_sigchld()
from mworker.c, let's move mworker_on_new_child_failure() in mworker.c as well.
Like this it could also be declared as a static.
This patch prepares the moving of on_new_child_failure definition into
mworker.c. So, let's rename it accordingly and let's also update its
description.
In 3.1-dev10, commit 8dd4efe42f ("MAJOR: mworker: move master-worker
fork in init()"), the FD associated to /dev/null was made CLOEXEC
using O_CLOEXEC. Unfortunately this is not portable on older OSes,
doesn't build on Solaris for example, and was even reported as breaking
moderately old Linux OSes for other projects. Better not use it unless
absolutely certain it will work (currently we only use it for Linux
namespaces, which are optional), and use the conventional FD_CLOEXEC
instead.
No backport is needed.
This fixes the commit d6ccd1738b
("MINOR: startup: set HAPROXY_LOCALPEER only once").
Comment "/* preset some environment variables */" is now useless here as
HAPROXY_LOCALPEER is set later during the initialization stage and only once.
This should not be backported, as related to the latest master-worker
refactoring.
This fixes the commit d6ccd1738b
("MINOR: startup: set HAPROXY_LOCALPEER only once"). HAPROXY_LOCALPEER could
be checked in the configuration to set some servers settings or listeners. So,
we need to set it just before we read the configuration at the second time.
Let's mark HAPROXY_LOCALPEER as "usable" in the configuration in the related
documentation chapter.
This should not be backported, as related to the latest master-worker
refactoring.
Let's store progname in the global variable, as it is handy to use it in
different parts of code to format messages sent to stdout.
This reduces the number of arguments, which we should pass to some functions.
In the init_early() global.log_tag is initialized to the string from progname
pointer and global.log_tag.area points to this pointer.
If log-tag keyword is provided in the configuration, its parser at first frees
global.log_tag.area and then it does a new memory allocation to copy
there the argument of log-tag. So, progname no longer points to the valid
memory.
To fix this, let's always keep progname and global.log_tag.area at separate
memory areas. If log_tag will be redefined in the configuration, its parser will
free the memory allocated for the default value in chunk_destroy(). Memory
allocated for progname will be freed in deinit().
This should not be backported as related to the latest master-worker
refactoring.
Before this patch HAPROXY_LOCALPEER variable could be set in init_early(),
in init_args() and in cfg_parse_global(). In master-worker mode, if localpeer
keyword set in the global section, HAPROXY_LOCALPEER in the worker
environment is set to this keyword's value, but in the master environment it
still keeps the default, a localhost name. This is confusing.
To fix it, let's set HAPROXY_LOCALPEER only once, when a worker or process in a
standalone mode has finished to parse its configuration. And let's set this
variable only for the worker process or for the process in a standalone mode,
because the master doesn't need it.
HAPROXY_LOCALPEER takes the value saved in localpeer global variable, which is
always set by default in init_early() to the local hostname. Then, localpeer
could be reset in init_args (-L option) and in cfg_parse_global() (while
parsing "localpeer" keyword).
Since sd_notify() is now implemented in src/systemd.c, there is no need
anymore to build its support conditionnally with USE_SYSTEMD.
This patch add supports for -Ws for every build and removes the
USE_SYSTEMD build option. It also remove every reference to USE_SYSTEMD
in the documentation and the CI.
This also allows to run the reg-tests in -Ws with the new VTest support.
Before this patch HAPROXY_BRANCH was unset just after configuration parsing.
Let's keep it, as it could be used in conditional blocks and some
configuration directives and it's handy to check its runtime value via "show
env".
In master-worker mode, this variable is set to the same value for both
processes.
proxy auth_uri struct was manually cleaned up during deinit, but the logic
behind was kind of akward because it was required to find out which ones
were shared or not. Instead, let's switch to a proper refcount mechanism
and free the auth_uri struct directly in proxy_free_common().
When uri_auth admin rules were implemented in 474be415
("[MEDIUM] stats: add an admin level") no attempt was made to free the
list of allocated rules, which makes valgrind unhappy upon deinit when
"stats admin" is used in the config.
To fix the issue, let's cleanup the admin rules list upon deinit where
uri_auth freeing is already handled.
While this could be backported to every stable versions, given how minor
this is and has no impact on the dying process, it is probably not worth
the effort.
load_cfg() is called only once before the first reading of the configuration
(we parse here only the global section). Then, before reading the rest of the
sections (second call of read_cfg()), we call clean_env(). As
HAPROXY_CFGFILES is set in load_cfg(), which is called only once, clean_env()
erases it. Thus, it's not longer shown in "show env" output.
To fix this, let's set HAPROXY_CFGFILES in read_cfg(). Like this in
master-worker mode it is set for master and for worker processes, as it was
before the refactoring.
This fix doesn't need to be backported as related to the latest master-worker
architecture change.
After master-worker refactoring, master performs re-exec only once up to
receiving "reload" command or USR2 signal. There is no more the second
master's re-exec to free unused memory. Thus, there is no longer need to export
environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This
status can be simply saved in a global variable load_status.
cfg_program_postparser() contains 2 parts:
- check the combination of MODE_MWORKER and "program" section. if
"program" section was parsed, MODE_MWORKER is mandatory;
- check "command" keyword, which is mandatory for this section as
well.
This is more appropriate now, after the master-worker refactoring, do the
first part in read_cfg_in_discovery_mode, where we already check the
combination of MODE_MWORKER and -S option.
We need to do the second part just below, in read_cfg_in_discovery_mode() as
well, because it's only the master process, who parses now program section and
programs are forked before running postparser functions in step_init_2.
Otherwise, mworker_ext_launch_all() will emit a log message, that program is
started, but actually nothing has been launched, if 'command' keyword is
absent.
This not needs to be backported, as related to the master-worker refactoring.
This commit introduces the tune.renice.startup and tune.renice.runtime
global keywords that allows to change the priority with setpriority().
tune.renice.startup is parsed and applied in the worker or the standalone
process for configuration parsing. If this keyword is used alone, the
nice value is changed to the previous one after configuration parsing.
tune.renice.runtime is applied after configuration parsing, so in the
worker or a standalone process. Combined with tune.renice.startup it
allows to have a different nice value during configuration parsing and
during runtime.
The feature was discussed in github issue #1919.
Example:
global
tune.renice.startup 15
tune.renice.runtime 0
As master-worker fork happens now before step_init_2(), when pollers are
initialized and polling settings and dumped then in verbose and in debug modes
to stdout, it turns out that master and worker dump its same polling
settings separately. This creates long and messy output in these modes.
Polling settings are the same for master and for worker process for the moment.
Even if they would diverge in future we are interested here in worker's
settings. So, when started in the master-worker mode let's dump it only in the
worker context.
This doesn't need to be backported as related to the latest master-worker
refactoring.
If haproxy was started with -W -dK*, after master-worker refactoring, we dump
registered keywords to stdout twice in master and in worker processes. This
information is redundant and output has no longer the right format. So, as the
keyword registration happens very early before the fork, let's dump keywords
only in the worker context, if haproxy was launched with -W.
This does not need to be backported, as related to the latest master-worker
refactoring.
If haproxy was started with -W -dL, after master-worker refactoring we dump
libs to stdout twice in master and in worker processes. This is information is
redundant. So let's show linked libraries only in the worker context, if
haproxy was started also with -W.
This does not need to be backported, as related to the latest master-worker
rework.
Don't do master-worker fork if MODE_CHECK is detected from the command line along
with the master-worker mode. We should exit in MODE_CHECK, after the
configuration parsing and validation. So, with the new master-worker architecture
it's better to align this mode with the standalone.
This patch does not need to be backported, as related to the latest
master-worker rework.
Flag MODE_STARTING should be unset for master just before freeing the startup
logs ring, as it triggers the copy of process logs to this ring, see the code
of print_message().
Moreover with this flag set, if startup logs ring pointer is NULL, any
print_message() triggered just before the execvp in mworker_reexec() will call
startup_logs_init(). So ring will be allocated again "discretely" and after
execvp we will lost its address, as in step_init_1() we will call again
startup_logs_init().
No need to backport this fix as it's related to the latest master-worker
refactoring.
During re-execution master keeps always opened "reload" sockpair FDs and
shared sockpair ipc_fd[0], the latter is using to transfert listeners sockets
from the previously forked worker to the new one. So, these master's FDs are
inherited in the newly forked worker and must be closed in its context.
"reload" sockpair inherited FDs and shared sockpair FD (ipc_fd[0]) are closed
separately, becase master doesn't recreate "reload" sockpair each time after
its re-exec. It always keeps the same FDs for this "reload" sockpair. So in
worker context it can be closed immediately after the fork.
At contrast, shared sockpair is created each time after reload, when the new
worker will be forked. So, if N previous workers are still exist at this moment,
the new worker will inherit N ipc_fd[0] from master. So, it's more save to
close all these FDs after get_listeners_fd() and bind_listeners() calls.
Otherwise, early closed FDs in the worker context will be immediately bound to
listeners and we could potentially have some bugs.
There are two parts in mworker_cli_proxy_create(): allocating and setting up
MASTER proxy and allocating and setting up servers on ipc_fd[0] of the
sockpairs shared with workers.
So, let's split mworker_cli_proxy_create() into two functions respectively.
Each of them takes **errmsg as an argument to write an error message, which may
be triggered by some subcalls. The content of this errmsg will allow to extend
the final alert message shown to user, if these new functions will fail.
The main goals of this split is to allow to move these two parts independantly
in future and makes the code of haproxy initialization in haproxy.c more
transparent.
Before refactoring master-worker architecture, resources to setup master CLI
for the new worker process (shared sockpair, entry in proc_list) were created
in init() before parsing the configuration and binding listening sockets. So,
master during its re-exec has had to cleanup the new worker's ressources in
a case, when it fails at some initialization step before the fork.
Now fork happens very early and worker parses its configuration by itself. If
it fails during the initialization stage, all clean ups (deleting the fds of
the shared sockpair, proc_list cleanup) are performed in SIGCHLD handler up to
catching the SIGCHLD corresponded to this new worker. So, there is no longer
need to call mworker_cleanup_proc() in mworker_reexec().
As for mworker_cleanlisteners(), there is no longer need to call this function.
Master parses now only "global" and "program" sections, so it allocates only
MASTER proxy, which is stopped in mworker_reexec() by mworker_cli_proxy_stop().
Let's keep the definitions of mworker_cleanlisteners() and
mworker_cleanup_proc() in mworker.c for the moment. We may reuse parts of its
code later.
As master-worker fork happens now at early init stage and worker then parses
its configuration and performs all initialization steps, let's duplicate
startup logs ring for it, just before the moment when it enters in its pollong
loop. Startup logs ring content is shown as an output of the "reload" master
CLI command and we should be able to dump here worker initialization logs.
Log messages are written in startup logs ring only, when mode MODE_STARTING is
set (see print_message()). So, to be able to keep in startup logs the last
worker alerts, let's withdraw MODE_STARTING and let's reset user messages
context respectively just before entering in polling loop.
This fix does not need to be backported as it is a part of previous patches
from this version, which refactor master-worker architecture.
This patch simplifies the code of startup_logs_init_shm(). We no longer re-exec
master process twice after each reload to free its unused memory, which it had
to allocate, because it has parsed all configuration sections. So, there is no
longer need to keep SHM fd opened between the first and the next reloads. We
can completely remove HAPROXY_STARTUPLOGS_FD.
In step_init_1() we continue to call startup_logs_init_shm() to open SHM and to
allocate startup logs ring area within it. In master-worker mode, worker
duplicates initial startup logs ring after sending its READY state to master.
Sharing the same ring between two processes until the worker finishes its
initialization allows to show at master CLI output worker's startup logs.
During the next reload master process should free the memory allocated for the
ring structure. Then after the execvp() it will reopen and map SHM area again
and it will reallocate again the ring structure.
When master enters in recovery mode after unsuccessfull reload
HAPROXY_LOAD_SUCCESS should be set as 0. Like this
cli_io_handler_show_cli_sock() could dump in master CLI its warnings and alerts,
saved in startup logs ring.
No need to backport this fix, as this is related to the previous patches in
this version to refactor master-worker architecture.
In case of daemon mode now daemonization fork happens in the early init stage
before parsing and applying the configuration, so we can't close
stdio/stderr/stdout immediately after forking. We keep it open until the most
of configuration, including chroot are applied in order to show alerts, if
there are some problems. To achieve this /dev/null is opened just before calling
chroot(), and after the chroot block it's used to close all standard outputs
and stdin. At this point we no longer need the fd of /dev/null, so we can close
it as well.
setenv/resetenv/presetenv/unsetenv keywords in the configuration modify the
process environment. In case of master-worker and programs we need to restore
the initial process environment before reload, as the configuration could
change in between and newly forked workers and programs should be launched
in the environment corresponded to this new configuration.
To achieve this we backup the initial process environment before the first
configuration read, when 'global' and 'program' sections are read. And then we
clean up master process environment and restore the initial one from the backup
in mworker_reexec().
Let's reintroduce systemd support in the refactored master-worker mode.
As for now, the master-worker fork happens during early initialization steps and
then the master process receieves the "READY" status message from the newly
forked worker, that has successfully loaded. Let's propagate this "READY" status
message at this moment to the systemd from the master process context
(_send_status()). We use the master process to send messages to systemd,
because it is only the process, monitored by systemd.
In master recovery mode, we also need to send to the systemd the "READY"
message, but with the status "Reload failed". "READY" will signal to systemd,
that master process is still alive, because it doesn't exit in recovery mode
and it keeps the existed worker. Status "Reload failed" will signal to user,
that something wrong has happened with the configuration. Same message logic
was originally preserved for the case, when the worker fails to read its
configuration, see on_new_child_failure() for more details.
This patch is a part of series to reintroduce the program support in the new
master-worker architecture.
Let's add here mworker_ext_launch_all() call before master-worker fork to
start external programs. We keep the order and the place of these two forks
(program and master-worker) the same as before the refactoring, in order to
avoid regressions.
This patch is a part of series to reintroduce the program support in the new
master-worker architecture.
For the moment we keep the order of program and worker forks the same as before
the refactoring, as we need to be sure that this won't introduce regressions.
So, programs are forked before the new worker process.
Before the program's fork we already need deserialized processes list to find
the programs launched before reload and to stop them. Processes list saved
before the reload in HAPROXY_PROCESSES variable. It should be deserialized
before the first configuration read in discovery mode, because resetenv keyword
could be presented in the global section.
So, let's move mworker_env_to_proc_list() from mworker_create_master_cli() to
main(). We need to call it only after reload in master-worker mode, thus
HAPROXY_MWORKER_REEXEC and HAPROXY_PROCESSES should be still presented in the
re-executing process environment before the first configuration read.
Let's encapsulate the logic to set verbosity modes (MODE_DEBUG and MODE_VERBOSE)
in a separate function set_verbosity(). This makes the code of main() more
readable and this allows to call set_verbosity() for master process in recovery
mode. So, in this mode, verbosity settings before the master re-execution will
be re-applied to master. set_verbosity() will be extended in future commits to
reduce the verbosiness of master in order not to dump pollers list and filters,
if it was started with -V or -d.
In this commit we add run_master_in_recovery_mode(), which groups all necessary
initialization steps, which master should perform to be able to enter in its
polling loop (run_master()), when it fails while parsing its new config.
As exit_on_failure() is now adapted for master recovery mode. Let's register
it as atexit handler, when master enters in this mode. And let's remove
atexit_flag variable for master, because we no longer use it.
We also slightly refactor here read_cfg_in_discovery_mode() in order to call
run_master_in_recovery_mode() for the case, described above. Warning messages
are mandatory before calling the run_master_in_recovery_mode() as this allows
to stop haproxy with error, if it was launched in zero-warning mode.
So, in recovery mode master does not launch any worker. It just performs its
necessary initialization routines and enters in its polling loop to continue
to monitor the existed worker process.
Master recovery mode replaces the former wait-mode with a difference, that
master in this case doesn't try to fork the new worker process. But it still
needs to enter to its polling loop in order to monitor the previous worker.
Master performs some initialization steps for this and it recreates its master
CLI. During its initialization steps, master could potentially fail again.
As we use for the moment for master init steps some common routines
(step_init_2() and step_init_3()), there is no way there to signal to user that
failure has happened for the master and in addition, in its recovery mode. So,
in such case exit_on_failure() can be still useful in order to print an
appropriate alert, as we can register this function as atexit handler for the
master.
Let's encapsulate here the code to load and to read the configuration at the
first time in MODE_DISCOVERY. This makes the code of main() more readable and
this adds the structure for adding necessary master initializations routines
to support master recovery mode.
Let's encapsulate master's code (steps which it does before entering in its
polling loop and deinitialization routines after) in a separate run_master()
function. This makes the code of main() more readable. In future we plan to put
in run_master() more master process related code, in order to clean completely
init_step_2(), init_step_3() and init_step_4().