sepoll counts the number of speculative events it has processed in
order to remain fair with epoll_wait(). If a same FD is processed
both for read and for write, it is counted twice. Fix this.
Some I/O callbacks are able to close their socket themselves. We
want to check this before calling epoll_ctl(EPOLL_CTL_DEL), otherwise
we get a -1 EBADF. Right now is looks like this could not cause any
trouble but the case is racy enough to fix it.
There is already an optimisation in the speculative poller which
causes newly created FDs to be checked immediately after being
created. Unfortunately, this optimisation causes the whole spec
list to be re-checked while we're only interested in the new FDs.
Doing this minor change causes performance gains of up to 6% on
medium-sized objects with a few hundreds concurrent connections.
When an accept() creates a new FD, it is already marked as set for
reads. But the task will be woken up without first checking if the
socket could be read.
The speculative I/O gives us a chance to either read the FD if there
are data pending on it, or immediately mark it for poll mode if
nothing is pending.
Simply doing this reduces the number of calls to process_session
from 6 to 5 per session, 2 to 1 calls to process_request, 10% less
calls to epoll_ctl, fd_clr, fd_set, stream_sock_data_update, 20%
less eb32_insert/eb_delete, etc... General performance increase
seems to be around 3%.
If __fd_clo() was called on a file descriptor which was previously
disabled, it was not removed from the spec list. This apparently
could not happen on previous code because the TCP states prevented
this, but now it happens regularly. The effects are spec entries
stuck populated, leading to busy loops.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
The first implementation of the monotonic clock did not verify
forward jumps. The consequence is that a fast changing time may
expire a lot of tasks. While it does seem minor, in fact it is
problematic because most machines which boot with a wrong date
are in the past and suddenly see their time jump by several
years in the future.
The solution is to check if we spent more apparent time in
a poller than allowed (with a margin applied). The margin
is currently set to 1000 ms. It should be large enough for
any poll() to complete.
Tests with randomly jumping clock show that the result is quite
accurate (error less than 1 second at every change of more than
one second).
If the system date is set backwards while haproxy is running,
some scheduled events are delayed by the amount of time the
clock went backwards. This is particularly problematic on
systems where the date is set at boot, because it seldom
happens that health-checks do not get sent for a few hours.
Before switching to use clock_gettime() on systems which
provide it, we can at least ensure that the clock is not
going backwards and maintain two clocks : the "date" which
represents what the user wants to see (mostly for logs),
and an internal date stored in "now", used for scheduled
events.
Under some circumstances, a task may already lie in the run queue
(eg: inter-task wakeup). It is disastrous to wait for an event in
this case because some processing gets delayed.
If too many events are set for spec I/O, those ones can starve the
polled events. Experiments show that when polled events starve, they
quickly turn into spec I/O, making the situation even worse. While
we can reduce the number of polled events processed at once, we
cannot do this on speculative events because most of them are new
ones (avg 2/3 new - 1/3 old from experiments).
The solution against this problem relies on those two factors :
1) one FD registered as a spec event cannot be polled at the same time
2) even during very high loads, we will almost never be interested in
simultaneous read and write streaming on the same FD.
The first point implies that during starvation, we will not have more than
half of our FDs in the poll list, otherwise it means there is less than that
in the spec list, implying there is no starvation.
The second point implies that we're statically only interested in half of
the maximum number of file descriptors at once, because we will unlikely
have simultaneous read and writes for a same buffer during long periods.
So, if we make it possible to drain maxsock/2/2 during peak loads, then we
can ensure that there will be no starvation effect. This means that we must
always allocate maxsock/4 events for the poller.
Last, sepoll uses an optimization consisting in reducing the number of calls
to epoll_wait() to once every too polls. However, when dealing with many
spec events, we can wait very long and skipping epoll_wait() every second
time increases latency. For this reason, we try to detect if we are beyond
a reasonable limit and stop doing so at this stage.
Due to the way Linux delivers EPOLLIN and EPOLLHUP, a closed connection
received after some server data sometimes results in truncated responses
if the client disconnects before server starts to respond. The reason
is that the EPOLLHUP flag is processed as an indication of end of
transfer while some data may remain in the system's socket buffers.
This problem could only be triggered with sepoll, although nothing should
prevent it from happening with normal epoll. In fact, the work factoring
performed by sepoll increases the risk that this bug appears.
The fix consists in making FD_POLL_HUP and FD_POLL_ERR sticky and that
they are only checked if FD_POLL_IN is not set, meaning that we have
read all pending data.
That way, the problem is definitely fixed and sepoll still remains about
17% faster than epoll since it can take into account all information
returned by the kernel.
Under some circumstances, it was possible with speculative I/O to
reallocate multiple entries for the same FD if an fd_{set,clr,set}
or fd_{clr,set,clr} sequences were performed before a schedule.
Fix this by keeping a an allocation flag for each fd.
By default, epoll/kqueue used to return as many events as possible.
This could sometimes cause huge latencies (latencies of up to 400 ms
have been observed with many thousands of fds at once). Limiting the
number of events returned also reduces the latency by avoiding too
many blind processing. The value is set to 200 by default and can be
changed in the global section using the tune.maxpollevents parameter.
Recreate the epoll file descriptor after a fork(). It will ensure
that all processes will not share their epoll_fd. Some side effects
were encountered because of this, such as epoll_wait() returning an
FD which was previously deleted, in multi-process mode.
Introduction of timeval timers broke *poll-based pollers, because the call to
tv_ms_remain may return 0 while the event is not elapsed yet. Now we carefully
check for those cases and round the result up by 1 ms.
It was possible in ev_sepoll() to ignore certain events if
all speculative events had been processed at once, because
the epoll_wait() timeout was not cleared, thus delaying the
events delivery.
The state machine was complicated, it has been rewritten.
It seems faster and more correct right now.
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
Two states were missing in the speculative epoll state transition
matrix. This could cause some timeouts and unhandled events. The
problem showed up in TCP mode with a fast server at high session
rates, but could in theory also affect HTTP mode.
The principle behind speculative I/O is to speculatively try to
perform I/O before registering the events in the system. This
considerably reduces the number of calls to epoll_ctl() and
sometimes even epoll_wait(), and manages to increase overall
performance by about 10%.
The new poller has been called "sepoll". It is used by default
on Linux when it works. A corresponding option "nosepoll" and
the command line argument "-ds" allow to disable it.