haproxy/src/ev_epoll.c

402 lines
9.6 KiB
C
Raw Permalink Normal View History

/*
* FD polling functions for Linux epoll
*
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
* Copyright 2000-2014 Willy Tarreau <w@1wt.eu>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/time.h>
#include <sys/types.h>
#include <haproxy/activity.h>
#include <haproxy/api.h>
#include <haproxy/fd.h>
#include <haproxy/global.h>
#include <haproxy/signal.h>
#include <haproxy/ticks.h>
#include <haproxy/time.h>
#include <haproxy/tools.h>
/* private data */
static THREAD_LOCAL struct epoll_event *epoll_events = NULL;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
static int epoll_fd[MAX_THREADS]; // per-thread epoll_fd
#ifndef EPOLLRDHUP
/* EPOLLRDHUP was defined late in libc, and it appeared in kernel 2.6.17 */
#define EPOLLRDHUP 0x2000
#endif
/*
* Immediately remove file descriptor from epoll set upon close.
* Since we forked, some fds share inodes with the other process, and epoll may
* send us events even though this process closed the fd (see man 7 epoll,
* "Questions and answers", Q 6).
*/
static void __fd_clo(int fd)
{
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
if (unlikely(fdtab[fd].cloned)) {
unsigned long m = polled_mask[fd].poll_recv | polled_mask[fd].poll_send;
struct epoll_event ev;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
int i;
for (i = global.nbthread - 1; i >= 0; i--)
if (m & (1UL << i))
epoll_ctl(epoll_fd[i], EPOLL_CTL_DEL, fd, &ev);
}
}
static void _update_fd(int fd)
{
int en, opcode;
struct epoll_event ev = { };
en = fdtab[fd].state;
/* Try to force EPOLLET on FDs that support it */
if (fdtab[fd].et_possible) {
/* already done ? */
if (polled_mask[fd].poll_recv & polled_mask[fd].poll_send & tid_bit)
return;
/* enable ET polling in both directions */
_HA_ATOMIC_OR(&polled_mask[fd].poll_recv, tid_bit);
_HA_ATOMIC_OR(&polled_mask[fd].poll_send, tid_bit);
opcode = EPOLL_CTL_ADD;
ev.events = EPOLLIN | EPOLLRDHUP | EPOLLOUT | EPOLLET;
goto done;
}
OPTIM: epoll: always poll for recv if neither active nor ready The cost of enabling polling in one direction with epoll is very high because it requires one syscall per FD and per direction change. In addition we don't know about input readiness until we either try to receive() or enable polling and watch the result. With HTTP keep-alive, both are equally expensive as it's very uncommon to see the server instantly respond (unless it's a second stage of the same process on localhost, which has become much less common with threads). But when a connection is established it's also quite usual to have to poll for sending (except on localhost or UNIX sockets where it almost always instantly works). So this cost of polling could be factored out with the second step if both were enabled together. This is the idea behind this patch. What it does is to always enable polling for Rx if it's not ready and at least one direction is active. This means that if it's not explicitly disabled, or if it was but in a state that causes the loss of the information (rx ready cannot be guessed), then let's take any opportunity for a polling change to enable it at the same time, and learn about rx readiness for free. In addition the FD never gets unregistered for Rx unless it's ready and was blocked (buffer full). This avoids a lot of the flip-flop behaviour at beginning and end of requests. On a test with 10k requests in keep-alive, the difference is quite noticeable: Before: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 83.67 0.010847 0 20078 epoll_ctl 16.33 0.002117 0 2231 epoll_wait 0.00 0.000000 0 20 20 connect ------ ----------- ----------- --------- --------- ---------------- 100.00 0.012964 22329 20 total After: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 96.35 0.003351 1 2644 epoll_wait 2.36 0.000082 4 20 20 connect 1.29 0.000045 0 66 epoll_ctl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.003478 2730 20 total It may also save a recvfrom() after connect() by changing the following sequence, effectively saving one epoll_ctl() and one recvfrom() : before | after -----------------------------+---------------------------- - connect() | - connect() - epoll_ctl(add,out) | - epoll_ctl(add, in|out) - sendto() | - epoll_wait() = out - epoll_ctl(mod,in|out) | - send() - epoll_wait() = out | - epoll_wait() = in|out - recvfrom() = EAGAIN | - recvfrom() = OK - epoll_ctl(mod,in) | - recvfrom() = EAGAIN - epoll_wait() = in | - epoll_ctl(mod, in) - recvfrom() = OK | - epoll_wait() - recvfrom() = EAGAIN | - epoll_wait() | (...) Now on a 10M req test on 16 threads with 2k concurrent conns and 415kreq/s, we see 190k updates total and 14k epoll_ctl() only.
2019-12-26 10:40:24 -05:00
/* if we're already polling or are going to poll for this FD and it's
* neither active nor ready, force it to be active so that we don't
* needlessly unsubscribe then re-subscribe it.
*/
if (!(en & FD_EV_READY_R) &&
((en & FD_EV_ACTIVE_W) ||
((polled_mask[fd].poll_send | polled_mask[fd].poll_recv) & tid_bit)))
en |= FD_EV_ACTIVE_R;
if ((polled_mask[fd].poll_send | polled_mask[fd].poll_recv) & tid_bit) {
if (!(fdtab[fd].thread_mask & tid_bit) || !(en & FD_EV_ACTIVE_RW)) {
/* fd removed from poll list */
opcode = EPOLL_CTL_DEL;
if (polled_mask[fd].poll_recv & tid_bit)
_HA_ATOMIC_AND(&polled_mask[fd].poll_recv, ~tid_bit);
if (polled_mask[fd].poll_send & tid_bit)
_HA_ATOMIC_AND(&polled_mask[fd].poll_send, ~tid_bit);
}
else {
if (((en & FD_EV_ACTIVE_R) != 0) ==
((polled_mask[fd].poll_recv & tid_bit) != 0) &&
((en & FD_EV_ACTIVE_W) != 0) ==
((polled_mask[fd].poll_send & tid_bit) != 0))
return;
if (en & FD_EV_ACTIVE_R) {
if (!(polled_mask[fd].poll_recv & tid_bit))
_HA_ATOMIC_OR(&polled_mask[fd].poll_recv, tid_bit);
} else {
if (polled_mask[fd].poll_recv & tid_bit)
_HA_ATOMIC_AND(&polled_mask[fd].poll_recv, ~tid_bit);
}
if (en & FD_EV_ACTIVE_W) {
if (!(polled_mask[fd].poll_send & tid_bit))
_HA_ATOMIC_OR(&polled_mask[fd].poll_send, tid_bit);
} else {
if (polled_mask[fd].poll_send & tid_bit)
_HA_ATOMIC_AND(&polled_mask[fd].poll_send, ~tid_bit);
}
/* fd status changed */
opcode = EPOLL_CTL_MOD;
}
}
else if ((fdtab[fd].thread_mask & tid_bit) && (en & FD_EV_ACTIVE_RW)) {
/* new fd in the poll list */
opcode = EPOLL_CTL_ADD;
if (en & FD_EV_ACTIVE_R)
_HA_ATOMIC_OR(&polled_mask[fd].poll_recv, tid_bit);
if (en & FD_EV_ACTIVE_W)
_HA_ATOMIC_OR(&polled_mask[fd].poll_send, tid_bit);
}
else {
return;
}
/* construct the epoll events based on new state */
if (en & FD_EV_ACTIVE_R)
ev.events |= EPOLLIN | EPOLLRDHUP;
if (en & FD_EV_ACTIVE_W)
ev.events |= EPOLLOUT;
done:
ev.data.fd = fd;
epoll_ctl(epoll_fd[tid], opcode, fd, &ev);
}
/*
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
* Linux epoll() poller
*/
static void _do_poll(struct poller *p, int exp, int wake)
{
int status;
int fd;
int count;
int updt_idx;
int wait_time;
int old_fd;
MAJOR: polling: centralize calls to I/O callbacks In order for HTTP/2 not to eat too much memory, we'll have to support on-the-fly buffer allocation, since most streams will have an empty request buffer at some point. Supporting allocation on the fly means being able to sleep inside I/O callbacks if a buffer is not available. Till now, the I/O callbacks were called from two locations : - when processing the cached events - when processing the polled events from the poller This change cleans up the design a bit further than what was started in 1.5. It now ensures that we never call any iocb from the poller itself and that instead, events learned by the poller are put into the cache. The benefit is important in terms of stability : we don't have to care anymore about the risk that new events are added into the poller while processing its events, and we're certain that updates are processed at a single location. To achieve this, we now modify all the fd_* functions so that instead of creating updates, they add/remove the fd to/from the cache depending on its state, and only create an update when the polling status reaches a state where it will have to change. Since the pollers make use of these functions to notify readiness (using fd_may_recv/fd_may_send), the cache is always up to date with the poller. Creating updates only when the polling status needs to change saves a significant amount of work for the pollers : a benchmark showed that on a typical TCP proxy test, the amount of updates per connection dropped from 11 to 1 on average. This also means that the update list is smaller and has more chances of not thrashing too many CPU cache lines. The first observed benefit is a net 2% performance gain on the connection rate. A second benefit is that when a connection is accepted, it's only when we're processing the cache, and the recv event is automatically added into the cache *after* the current one, resulting in this event to be processed immediately during the same loop. Previously we used to have a second run over the updates to detect if new events were added to catch them before waking up tasks. The next gain will be offered by the next steps on this subject consisting in implementing an I/O queue containing all cached events ordered by priority just like the run queue, and to be able to leave some events pending there as long as needed. That will allow us *not* to perform some FD processing if it's not the proper time for this (typically keep waiting for a buffer to be allocated if none is available for an recv()). And by only processing a small bunch of them, we'll allow priorities to take place even at the I/O level. As a result of this change, functions fd_alloc_or_release_cache_entry() and fd_process_polled_events() have disappeared, and the code dedicated to checking for new fd events after the callback during the poll() loop was removed as well. Despite the patch looking large, it's mostly a change of what function is falled upon fd_*() and almost nothing was added.
2014-11-19 13:43:05 -05:00
/* first, scan the update list to find polling changes */
for (updt_idx = 0; updt_idx < fd_nbupdt; updt_idx++) {
fd = fd_updt[updt_idx];
_HA_ATOMIC_AND(&fdtab[fd].update_mask, ~tid_bit);
if (!fdtab[fd].owner) {
activity[tid].poll_drop_fd++;
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
continue;
}
_update_fd(fd);
}
fd_nbupdt = 0;
/* Scan the global update list */
for (old_fd = fd = update_list.first; fd != -1; fd = fdtab[fd].update.next) {
if (fd == -2) {
fd = old_fd;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
continue;
}
else if (fd <= -3)
fd = -fd -4;
if (fd == -1)
break;
if (fdtab[fd].update_mask & tid_bit)
done_update_polling(fd);
else
continue;
if (!fdtab[fd].owner)
continue;
_update_fd(fd);
}
thread_harmless_now();
/* now let's wait for polled events */
wait_time = wake ? 0 : compute_poll_timeout(exp);
tv_entering_poll();
activity_count_runtime();
MINOR: polling: add an option to support busy polling In some situations, especially when dealing with low latency on processors supporting a variable frequency or when running inside virtual machines, each time the process waits for an I/O using the poller, the processor goes back to sleep or is offered to another VM for a long time, and it causes excessively high latencies. A solution to this provided by this patch is to enable busy polling using a global option. When busy polling is enabled, the pollers never sleep and loop over themselves waiting for an I/O event to happen or for a timeout to occur. On multi-processor machines it can significantly overheat the processor but it usually results in much lower latencies. A typical test consisting in injecting traffic over a single connection at a time over the loopback shows a bump from 4640 to 8540 connections per second on forwarded connections, indicating a latency reduction of 98 microseconds for each connection, and a bump from 12500 to 21250 for locally terminated connections (redirects), indicating a reduction of 33 microseconds. It is only usable with epoll and kqueue because select() and poll()'s API is not convenient for such usages, and the level of performance they are used in doesn't benefit from this anyway. The option, which obviously remains disabled by default, can be turned on using "busy-polling" in the global section, and turned off later using "no busy-polling". Its status is reported in "show info" to help troubleshooting suspicious CPU spikes.
2018-11-22 12:07:59 -05:00
do {
int timeout = (global.tune.options & GTUNE_BUSY_POLLING) ? 0 : wait_time;
status = epoll_wait(epoll_fd[tid], epoll_events, global.tune.maxpollevents, timeout);
tv_update_date(timeout, status);
if (status) {
activity[tid].poll_io++;
MINOR: polling: add an option to support busy polling In some situations, especially when dealing with low latency on processors supporting a variable frequency or when running inside virtual machines, each time the process waits for an I/O using the poller, the processor goes back to sleep or is offered to another VM for a long time, and it causes excessively high latencies. A solution to this provided by this patch is to enable busy polling using a global option. When busy polling is enabled, the pollers never sleep and loop over themselves waiting for an I/O event to happen or for a timeout to occur. On multi-processor machines it can significantly overheat the processor but it usually results in much lower latencies. A typical test consisting in injecting traffic over a single connection at a time over the loopback shows a bump from 4640 to 8540 connections per second on forwarded connections, indicating a latency reduction of 98 microseconds for each connection, and a bump from 12500 to 21250 for locally terminated connections (redirects), indicating a reduction of 33 microseconds. It is only usable with epoll and kqueue because select() and poll()'s API is not convenient for such usages, and the level of performance they are used in doesn't benefit from this anyway. The option, which obviously remains disabled by default, can be turned on using "busy-polling" in the global section, and turned off later using "no busy-polling". Its status is reported in "show info" to help troubleshooting suspicious CPU spikes.
2018-11-22 12:07:59 -05:00
break;
}
MINOR: polling: add an option to support busy polling In some situations, especially when dealing with low latency on processors supporting a variable frequency or when running inside virtual machines, each time the process waits for an I/O using the poller, the processor goes back to sleep or is offered to another VM for a long time, and it causes excessively high latencies. A solution to this provided by this patch is to enable busy polling using a global option. When busy polling is enabled, the pollers never sleep and loop over themselves waiting for an I/O event to happen or for a timeout to occur. On multi-processor machines it can significantly overheat the processor but it usually results in much lower latencies. A typical test consisting in injecting traffic over a single connection at a time over the loopback shows a bump from 4640 to 8540 connections per second on forwarded connections, indicating a latency reduction of 98 microseconds for each connection, and a bump from 12500 to 21250 for locally terminated connections (redirects), indicating a reduction of 33 microseconds. It is only usable with epoll and kqueue because select() and poll()'s API is not convenient for such usages, and the level of performance they are used in doesn't benefit from this anyway. The option, which obviously remains disabled by default, can be turned on using "busy-polling" in the global section, and turned off later using "no busy-polling". Its status is reported in "show info" to help troubleshooting suspicious CPU spikes.
2018-11-22 12:07:59 -05:00
if (timeout || !wait_time)
break;
if (signal_queue_len || wake)
MINOR: polling: add an option to support busy polling In some situations, especially when dealing with low latency on processors supporting a variable frequency or when running inside virtual machines, each time the process waits for an I/O using the poller, the processor goes back to sleep or is offered to another VM for a long time, and it causes excessively high latencies. A solution to this provided by this patch is to enable busy polling using a global option. When busy polling is enabled, the pollers never sleep and loop over themselves waiting for an I/O event to happen or for a timeout to occur. On multi-processor machines it can significantly overheat the processor but it usually results in much lower latencies. A typical test consisting in injecting traffic over a single connection at a time over the loopback shows a bump from 4640 to 8540 connections per second on forwarded connections, indicating a latency reduction of 98 microseconds for each connection, and a bump from 12500 to 21250 for locally terminated connections (redirects), indicating a reduction of 33 microseconds. It is only usable with epoll and kqueue because select() and poll()'s API is not convenient for such usages, and the level of performance they are used in doesn't benefit from this anyway. The option, which obviously remains disabled by default, can be turned on using "busy-polling" in the global section, and turned off later using "no busy-polling". Its status is reported in "show info" to help troubleshooting suspicious CPU spikes.
2018-11-22 12:07:59 -05:00
break;
if (tick_isset(exp) && tick_is_expired(exp, now_ms))
break;
} while (1);
tv_leaving_poll(wait_time, status);
thread_harmless_end();
if (sleeping_thread_mask & tid_bit)
_HA_ATOMIC_AND(&sleeping_thread_mask, ~tid_bit);
/* process polled events */
for (count = 0; count < status; count++) {
struct epoll_event ev;
unsigned int n, e;
e = epoll_events[count].events;
fd = epoll_events[count].data.fd;
#ifdef DEBUG_FD
_HA_ATOMIC_ADD(&fdtab[fd].event_count, 1);
#endif
if (!fdtab[fd].owner) {
activity[tid].poll_dead_fd++;
continue;
}
if (!(fdtab[fd].thread_mask & tid_bit)) {
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
/* FD has been migrated */
activity[tid].poll_skip_fd++;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
epoll_ctl(epoll_fd[tid], EPOLL_CTL_DEL, fd, &ev);
_HA_ATOMIC_AND(&polled_mask[fd].poll_recv, ~tid_bit);
_HA_ATOMIC_AND(&polled_mask[fd].poll_send, ~tid_bit);
continue;
}
n = ((e & EPOLLIN) ? FD_EV_READY_R : 0) |
((e & EPOLLOUT) ? FD_EV_READY_W : 0) |
((e & EPOLLRDHUP) ? FD_EV_SHUT_R : 0) |
((e & EPOLLHUP) ? FD_EV_SHUT_RW : 0) |
((e & EPOLLERR) ? FD_EV_ERR_RW : 0);
if ((e & EPOLLRDHUP) && !(cur_poller.flags & HAP_POLL_F_RDHUP))
_HA_ATOMIC_OR(&cur_poller.flags, HAP_POLL_F_RDHUP);
fd_update_events(fd, n);
}
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
/* the caller will take care of cached events */
}
static int init_epoll_per_thread()
{
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
int fd;
epoll_events = calloc(1, sizeof(struct epoll_event) * global.tune.maxpollevents);
if (epoll_events == NULL)
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
goto fail_alloc;
if (MAX_THREADS > 1 && tid) {
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
epoll_fd[tid] = epoll_create(global.maxsock + 1);
if (epoll_fd[tid] < 0)
goto fail_fd;
}
/* we may have to unregister some events initially registered on the
* original fd when it was alone, and/or to register events on the new
* fd for this thread. Let's just mark them as updated, the poller will
* do the rest.
*/
for (fd = 0; fd < global.maxsock; fd++)
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
updt_fd_polling(fd);
return 1;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
fail_fd:
free(epoll_events);
fail_alloc:
return 0;
}
static void deinit_epoll_per_thread()
{
if (MAX_THREADS > 1 && tid)
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
close(epoll_fd[tid]);
free(epoll_events);
epoll_events = NULL;
}
/*
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
* Initialization of the epoll() poller.
* Returns 0 in case of failure, non-zero in case of success. If it fails, it
* disables the poller by setting its pref to 0.
*/
static int _do_init(struct poller *p)
{
p->private = NULL;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
epoll_fd[tid] = epoll_create(global.maxsock + 1);
if (epoll_fd[tid] < 0)
goto fail_fd;
hap_register_per_thread_init(init_epoll_per_thread);
hap_register_per_thread_deinit(deinit_epoll_per_thread);
return 1;
fail_fd:
p->pref = 0;
return 0;
}
/*
MAJOR: polling: rework the whole polling system This commit heavily changes the polling system in order to definitely fix the frequent breakage of SSL which needs to remember the last EAGAIN before deciding whether to poll or not. Now we have a state per direction for each FD, as opposed to a previous and current state previously. An FD can have up to 8 different states for each direction, each of which being the result of a 3-bit combination. These 3 bits indicate a wish to access the FD, the readiness of the FD and the subscription of the FD to the polling system. This means that it will now be possible to remember the state of a file descriptor across disable/enable sequences that generally happen during forwarding, where enabling reading on a previously disabled FD would result in forgetting the EAGAIN flag it met last time. Several new state manipulation functions have been introduced or adapted : - fd_want_{recv,send} : enable receiving/sending on the FD regardless of its state (sets the ACTIVE flag) ; - fd_stop_{recv,send} : stop receiving/sending on the FD regardless of its state (clears the ACTIVE flag) ; - fd_cant_{recv,send} : report a failure to receive/send on the FD corresponding to EAGAIN (clears the READY flag) ; - fd_may_{recv,send} : report the ability to receive/send on the FD as reported by poll() (sets the READY flag) ; Some functions are used to report the current FD status : - fd_{recv,send}_active - fd_{recv,send}_ready - fd_{recv,send}_polled Some functions were removed : - fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai() The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers knows it can try to access the file descriptor to get this information. In order to simplify the conditions to add/remove cache entries, a new function fd_alloc_or_release_cache_entry() was created to be used from pollers while scanning for updates. The following pollers have been updated : ev_select() : done, built, tested on Linux 3.10 ev_poll() : done, built, tested on Linux 3.10 ev_epoll() : done, built, tested on Linux 3.10 & 3.13 ev_kqueue() : done, built, tested on OpenBSD 5.2
2014-01-10 10:58:45 -05:00
* Termination of the epoll() poller.
* Memory is released and the poller is marked as unselectable.
*/
static void _do_term(struct poller *p)
{
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
if (epoll_fd[tid] >= 0) {
close(epoll_fd[tid]);
epoll_fd[tid] = -1;
}
p->private = NULL;
p->pref = 0;
}
/*
* Check that the poller works.
* Returns 1 if OK, otherwise 0.
*/
static int _do_test(struct poller *p)
{
int fd;
fd = epoll_create(global.maxsock + 1);
if (fd < 0)
return 0;
close(fd);
return 1;
}
/*
* Recreate the epoll file descriptor after a fork(). Returns 1 if OK,
* otherwise 0. It will ensure that all processes will not share their
* epoll_fd. Some side effects were encountered because of this, such
* as epoll_wait() returning an FD which was previously deleted.
*/
static int _do_fork(struct poller *p)
{
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
if (epoll_fd[tid] >= 0)
close(epoll_fd[tid]);
epoll_fd[tid] = epoll_create(global.maxsock + 1);
if (epoll_fd[tid] < 0)
return 0;
return 1;
}
/*
* It is a constructor, which means that it will automatically be called before
* main(). This is GCC-specific but it works at least since 2.95.
* Special care must be taken so that it does not need any uninitialized data.
*/
__attribute__((constructor))
static void _do_register(void)
{
struct poller *p;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
int i;
if (nbpollers >= MAX_POLLERS)
return;
BUG/MEDIUM: epoll/threads: use one epoll_fd per thread There currently is a problem regarding epoll(). While select() and poll() compute their polling state on the fly upon each call, epoll() keeps a shared state between all threads via the epoll_fd. The problem is that once an fd is registered on *any* thread, all other threads receive events for that FD as well. It is clearly visible when binding a listener to a single thread like in the configuration below where all 4 threads will work, 3 of them simply spinning to skip the event : global nbthread 4 frontend foo bind :1234 process 1/1 The worst case happens when some slow operations are in progress on a busy thread, preventing it from processing its task and causing the other ones to wake up not being able to do anything with this event. Typically computing a large TLS key will delay processing of next events on the same thread while others will still wake up. All this simply shows that the poller must remain thread-specific, with its own events and its own ability to sleep when it doesn't have anyhing to do. This patch does exactly this. For this, it proceeds like this : - have one epoll_fd per thread instead of one per process - initialize these epoll_fd when threads are created. - mark all known FDs as updated so that the next invocation of _do_poll() recomputes their polling status (including a possible removal of undesired polling from the original FD) ; - use each fd's polled_mask to maintain an accurate status of the current polling activity for this FD. - when scanning updates, only focus on events whose new polling status differs from the existing one - during updates, always verify the thread_mask to resist migration - on __fd_clo(), for cloned FDs (typically listeners inherited from the parent during a graceful shutdown), run epoll_ctl(DEL) on all epoll_fd. This is the reason why epoll_fd is stored in a shared array and not in a thread_local storage. Note: maybe this can be moved to an update instead. Interestingly, this shows that we don't need the FD's old state anymore and that we only use it to convert it to the new state based on stable information. It appears clearly that the FD code can be further improved by computing the final state directly when manipulating it. With this change, the config above goes from 22000 cps at 380% CPU to 43000 cps at 100% CPU : not only the 3 unused threads are not activated, but they do not disturb the activity anymore. The output of "show activity" before and after the patch on a 4-thread config where a first listener on thread 2 forwards over SSL to threads 3 & 4 shows this a much smaller amount of undesired events (thread 1 doesn't wake up anymore, poll_skip remains zero, fd_skip stays low) : // before: 400% CPU, 7700 cps, 13 seconds loops: 11380717 65879 5733468 5728129 wake_cache: 0 63986 317547 314174 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 63986 317547 314174 poll_drop: 1 0 49981 48893 poll_dead: 65514 0 31334 31934 poll_skip: 46293690 34071 22867786 22858208 fd_skip: 66068135 174157 33732685 33825727 fd_lock: 0 2 2809 2905 fd_del: 0 494361 80890 79464 conn_dead: 0 0 0 0 stream: 0 407747 50526 49474 empty_rq: 11380718 1914 5683023 5678715 long_rq: 0 0 0 0 // after: 200% cpu, 9450 cps, 11 seconds loops: 17 66147 1001631 450968 wake_cache: 0 66119 865139 321227 wake_tasks: 0 0 0 0 wake_applets: 0 0 0 0 wake_signal: 0 0 0 0 poll_exp: 0 66119 865139 321227 poll_drop: 6 5 38279 60768 poll_dead: 0 0 0 0 poll_skip: 0 0 0 0 fd_skip: 54 172661 4411407 2008198 fd_lock: 0 0 10890 5394 fd_del: 0 492829 58965 105091 conn_dead: 0 0 0 0 stream: 0 406223 38663 61338 empty_rq: 18 40 962999 390549 long_rq: 0 0 0 0 This patch presents a few risks but fixes a real problem with threads, and as such it needs be backported to 1.8. It depends on previous patch ("MINOR: fd: add a bitmask to indicate that an FD is known by the poller"). Special thanks go to Samuel Reed for providing a large amount of useful debugging information and for testing fixes.
2018-01-18 13:16:02 -05:00
for (i = 0; i < MAX_THREADS; i++)
epoll_fd[i] = -1;
p = &pollers[nbpollers++];
p->name = "epoll";
p->pref = 300;
p->flags = HAP_POLL_F_ERRHUP; // note: RDHUP might be dynamically added
p->private = NULL;
p->clo = __fd_clo;
p->test = _do_test;
p->init = _do_init;
p->term = _do_term;
p->poll = _do_poll;
p->fork = _do_fork;
}
/*
* Local variables:
* c-indent-level: 8
* c-basic-offset: 8
* End:
*/