In order to support delimiting output events with other characters than
just the LF, let's pass the delimiter through the API. The default remains
the LF, used by applet_append_line(), and ignored by the log forwarder.
ring_dispatch_messages() now takes an optional argument <processed> which
must point to a size_t counter when provided.
When provided, the value is updated to the number of messages processed
by the function.
Since 40d1c84bf0 ("BUG/MAJOR: ring: free the ring storage not the ring
itself when using maps"), munmap() call for startup_logs's ring and
file-backed rings fails to work (EINVAL) and causes memory leaks during
process cleanup.
munmap() fails because it is called with the ring's usable area pointer
which is an offset from the underlying original memory block allocated
using mmap(). Indeed, ring_area() helper function was misused because
it didn't explicitly mention that the returned address corresponds to
the usable storage's area, not the allocated one.
To fix the issue, we add an explicit ring_allocated_area() helper to
return the allocated area for the ring, just like we already have
ring_allocated_size() for the allocated size, and we properly use both
the allocated size and allocated area to manipulate them using munmap()
and msync().
No backport needed.
There's currently an abiguity around ring_size(), it's said to return
the allocated size but returns the usable size. We can't change it as
it's used everywhere in the code like this. Let's fix the comment and
add ring_allocated_size() instead for anything related to allocation.
We're now locking the tail while looking for some room in the ring. In
fact it's still while writing to it, but the goal definitely is to get
rid of the lock ASAP. For this we reserve the topmost bit of the tail
as a lock, which may have as a possible visible effect that buffers will
be limited to 2GB instead of 4GB on 32-bit machines (though in practise,
good luck for allocating more than 2GB contiguous on 32-bit), but in
practice since the size is read with atol() and some operating systems
limit it to LONG_MAX unless passing negative numbers, the limit is
already there.
For now the impact on x86_64 is significant (drop from 2.35 to 1.4M/s
on 48 threads on EPYC 24 cores) but this situation is only temporary
so that changes can be reviewable and bisectable.
Other approaches were attempted, such as using XCHG instead, which is
slightly faster on x86 with low thread counts (but causes more write
contention), and forces readers to stall under heavy traffic because
they can't access a valid value for the queue anymore. A CAS requires
preloading the value and is les good on ARMv8.1. XADD could also be
considered with 12-13 upper bits of the offset dedicated to locking,
but that looks overkill.
The purpose is to store a head and a tail that are independent so that
we can further improve the API to update them independently from each
other.
The struct was arranged like the original one so that as long as a ring
has its head set to zero (i.e. no recycling) it will continue to work.
The new format is already detectable thanks to the "rsvd" field which
indicates the number of reserved bytes at the beginning. It's located
where the buffer's area pointer previously was, so that older versions
of haring can continue to open the ring in repair mode, and newer ones
can use the fact that the upper bits of that variable are zero to guess
that it's working with the new format instead of the old one. Also let's
keep in mind that the layout will further change to place some alignment
constraints.
The haring tool will thus updated based on this and it detects that the
rsvd field is smaller than a page and that the sum of it with the size
equals the mapped size, in which case it uses the new dump_v2() function
instead of dump_v1(). The new function also creates a buffer from the
ring's area, size, head and tail and calls the generic one so that no
other code had to be adapted.
The code now looks cleaner and more easily shows what still needs to be
addressed. There are not that many changes in practice, these are mostly
mechanical, essentially hiding the buffer from the callers.
We'll need to add more complex structures in the ring, such as wait
queues. That's far too much to be stored into the area in case of
file-backed contents, so let's split the ring definition and its
storage once for all.
This patch introduces a struct ring_storage which is assigned to
ring->storage, which contains minimal information to represent the
storage layout, i.e. for now only the buffer, and all the rest
remains in the ring itself. The storage is appended immediately after
it and the buffer's pointer always points to that area. It has the
benefit of remaining 100% compatible with the existing file-backed
layout. In memory, the allocation loses the size of a struct buffer.
It's not even certain it's worth placing the size there, given that it's
constant and that a dump of a ring wouldn't really need it (the file size
is sufficient). But for now everything comes with the struct buffer, and
later this will change once split into head and tail. Also this area may
be completed with more information in the future (e.g. storage version,
format, endianness, word size etc).
This will mostly be used during reallocation and boot-time duplicates,
the purpose is simply to save the caller from having to know the details
of the internal representation.
This new function is made around the loop that scans a ring for new
messages and dispatches them to a message handler. It also takes
ring flags (WAIT, NEW, etc) and offset pointers that the caller will
use to initialize/reuse/update the current processing offset. The
caller is still responsible for presetting it to ~0 before the
first call if it wants the function to automatically adjust it (or set
it to the correct value). The function may also return the last_ofs
that was known before releasing the lock so that the caller knows
what to compare against and if it needs to restart processing or not.
The context remains a void* so that should not necessarily depend on
an appctx.
The current "show ring" code was ported to this and it continues to
work as expected.
Cast an unified ring + storage area to a ring from area, without
reinitializing the data buffer. Reinitialize the waiters and the lock.
It helps retrieving a previously allocated ring, from an mmap for
example.
Instead of allocating two parts, one for the ring struct itself and
one for the storage area, ring_make_from_area() will arrange the two
inside the same memory area, with the storage starting immediately
after the struct. This will allow to store a complete ring state in
shared memory areas for example.
The ring watch flags (wait, seek end) were dangerously passed via ctx.cli.i0
from "show buf" in sink.c:cli_parse_show_events(), or implicitly reset in
"show errors". That's very unconvenient, difficult to follow, and prone to
short-term breakage.
Let's pass an extra argument to ring_attach_cli() to take these flags, now
defined in ring-t.h as RING_WF_*, and let the function set them itself
where appropriate (still ctx.cli.i0 for now).