While here change the field size from long to int and move it into the
gap next to cn_flags.
Shrinks struct componentname from 64 to 56 bytes on amd64.
Currently, we parse notes for the values of ELF FreeBSD feature flags
and osrel. Knowing these values, or knowing that image does not carry
the note if pointers are NULL, is useful to decide which ABI variant
(brand) we want to activate for the image.
Right now this is only a plumbing change
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273
It allows a sysent to share existing usermode data in shared page with
other sysent, assuming ABI differences are not in the layout of the
page.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25273
Attempt of adding assertions that pgrp->pg_jobc counters do not
underflow in r361967, reverted in r362910, points out bugs in the
handling of job control. Peter Holm was able to narrow down the
problem to very easy reproduction with timeout(1) which uses reaping.
The following list of problems with calculation of pg_jobs which
directs SIGHUP/SIGCONT delivery for orphaned process group was
identified:
- Re-calculation of the orphaned status for children of exiting parent
was wrong, but mostly unnoticed when all children were reparented to
init(8). When child can be reparented to a different process which
could affect the child' job control state, it was not properly
accounted for in pg_jobc.
- Lockless check for exiting process' parent process group is racy
because nothing prevents the parent from changing its group
membership.
- Exited process is left in the process group, until waited. This
affects other calculations of pg_jobc.
Split handling of job control status on process changing its process
group, and process exiting. Calculate increments and decrements for
pg_jobs by exact checking the orphanage instead of assuming process
group membership for children and parent. Move the call to killjobc()
later under the proctree_lock. Mark exiting process in killjobc()
with a new flag P_TREE_GRPEXITED and skip it for all pg_jobc
calculations after the flag is set.
Add checker that independently recalculates pg_jobc value and compares
it with the memoized process group state. This is enabled under INVARIANTS.
Reviewed by: jilles
Discussed with: kevans
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26116
hw.bus.devctl_disable has tagged been obsolete for a decade. Remove it. Also
remove some long obsolete comments. This was done and backed out once in 2014,
but we've had enough releases with the 'new' method of setting queue length that
we can just remove this sysctl now (stable/11, stable/12 and current all don't
reference it).
The actual bug is not yet addressed as it will get much easier after other
problems are addressed (most notably rename contract).
The only affected in-tree consumer is realpath. Everyone else happens to be
performing lookups within a mount point, having a side effect of ni_dvp being
set to mount point's root vnode in the worst case.
Reported by: pho
with vnode locked, use NOWAIT alloc and only report when we don't overflow.
These changes were accidentally omitted from r364402, except for the not
reporting on overflow. They were lumped in with a debugging commit in my tree
that I omitted w/o realizing this.
Other issues from the review are pending some other changes I need to do first.
data records.
The kernel RPC cannot process non-application data records when
using TLS. It must to an upcall to a userspace daemon that will
call SSL_read() to process them.
This patch adds a new flag called MSG_TLSAPPDATA that the kernel
RPC can use to tell sorecieve() to return ENXIO instead of a non-application
data record, when that is what is at the top of the receive queue.
I put the code in #ifdef KERN_TLS/#endif, although it will build without
that, so that it is recognized as only useful when KERN_TLS is enabled.
The alternative to doing this is to have the kernel RPC re-queue the
non-application data message after receiving it, but that seems more
complicated and might introduce message ordering issues when there
are multiple non-application data records one after another.
I do not know what, if any, changes will be required to support TLS1.3.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D25923
Report when a filesystem is mounted, remounted or unmounted via devd, along with
details about the mount point and mount options.
Discussed with: kib@
Reviewed by: kirk@ (prior version)
Sponsored by: Netflix
Diffential Revision: https://reviews.freebsd.org/D25969
While not strictly true this stops them from trying to use the KCSAN atomic
hooks and allows these to be compiled into the same kernel.
Sponsored by: Innovate UK
Some of the resulting fallout in CAM does not appear straightforward to
fix, so simply revert the commit for now in the absence of a better
solution.
Discussed with: mjg
Reported by: dhw
non-sleepable context. Previously only _sleep() would panic.
This will catch misuse of M_WAITOK at development stage rather
than at stress load stage.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D26027
If possible, i.e. if the requested range is resident valid in the vm
object queue, and some secondary conditions hold, copy data for
read(2) directly from the valid cached pages, avoiding vnode lock and
instantiating buffers. I intentionally do not start read-ahead, nor
handle the advises on the cached range.
Filesystems indicate support for VMIO reads by setting VIRF_PGREAD
flag, which must not be cleared until vnode reclamation.
Currently only filesystems that use vnode pager for v_objects can
enable it, due to reliance on vnp_size. There is a WIP to handle it
for tmpfs.
Reviewed by: markj
Discussed with: jeff
Tested by: pho
Benchmarked by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968
The first time Witness observes a lock order between two locks, it records
the caller's stack. On detected reversal, print out that previous observed
stack. It is quite possible that the reported "LOR" is the correct
ordering, and the violation was the observed earlier ordering.
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D26070
Avoid performing a potentially-blocking malloc for kenv lookups that will only
perform non-destructive integer conversions on the returned buffer. Instead,
perform the strtoq() in-place with the kenv lock held.
While here, factor the logic around kenv_lock acquire and release into
kenv_acquire() and kenv_release(), and use these functions for some light
cleanup. Collapse getenv_string_buffer() into kern_getenv(), as the former
no longer has any other callers and the only additional task performed by
the latter is a WITNESS check that hasn't been useful since r362231.
PR: 248250
Reported by: gbe
Reviewed by: mjg
Tested by: gbe
Differential Revision: https://reviews.freebsd.org/D26010
This is to avoid conflicts with a upcoming macro. pipe_pages is a
more accurate name since the field tracks pages wired into the kernel as
part of a process-to-process copy operation.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Add prng(9) as a replacement for random(9) in the kernel.
There are two major differences from random(9) and random(3):
- General prng(9) APIs (prng32(9), etc) do not guarantee an
implementation or particular sequence; they should not be used for
repeatable simulations.
- However, specific named API families are also exposed (for now: PCG),
and those are expected to be repeatable (when so-guaranteed by the named
algorithm).
Some minor differences from random(3) and earlier random(9):
- PRNG state for the general prng(9) APIs is per-CPU; this eliminates
contention on PRNG state in SMP workloads. Each PCPU generator in an
SMP system produces a unique sequence.
- Better statistical properties than the Park-Miller ("minstd") PRNG
(longer period, uniform distribution in all bits, passes
BigCrush/PractRand analysis).
- Faster than Park-Miller ("minstd") PRNG -- no division is required to
step PCG-family PRNGs.
For now, random(9) becomes a thin shim around prng32(). Eventually I
would like to mechanically switch consumers over to the explicit API.
Reviewed by: kib, markj (previous version both)
Discussed with: markm
Differential Revision: https://reviews.freebsd.org/D25916
The conversion was largely mechanical: sed(1) with:
-e 's|mtx_assert(&devmtx, MA_OWNED)|dev_lock_assert_locked()|g'
-e 's|mtx_assert(&devmtx, MA_NOTOWNED)|dev_lock_assert_unlocked()|g'
The definitions of these abstractions in fs/devfs/devfs_int.h are the
only non-mechanical change.
No functional change.
This removes a lot of special casing from the VFS layer.
Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D25612
namei de facto expects that the naimeidata object is properly initialized,
but at the same time it mixes consumer-passable and internal flags, while
tolerating this part by explicitly clearing some of them.
Tighten the interface instead.
While here renumber the flags and denote the gap between the 2 variants.
Try to piggy back th renumber on the just bumped __FreeBSD_version.
This means there is no expectation lookup will purge the terminal entry,
which simplifies lockless lookup.
Tested by: pho
Sponsored by: The FreeBSD Foundation
The current scheme of calling VOP_GETATTR adds avoidable overhead.
An example with tmpfs doing fstat (ops/s):
before: 7488958
after: 7913833
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D25910
Make sure to reclaim epoch structures when they are freed to support
dynamic allocation and freeing of epoch structures.
While at it, move the 64 supported epoch control structures to the
static memory domain. This overall simplifies the management and
debugging of system epoch's.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D25960
MFC after: 1 week
Sponsored by: Mellanox Technologies
- Convert panic() calls to INVARIANTS-only assertions. The PCTRIE code
provides some of the same protection since it will panic upon an
attempt to remove a non-resident buffer.
- Update the comment above reassignbuf() to reflect reality.
Reviewed by: cem, kib, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25965
As the 20-year old comment above it suggests, the counter is of dubious
value. Moreover, the (global) counter was not updated precisely and
hurts scalability.
Reviewed by: cem, kib, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25965
The routine is called on successful write and read, which on pipes happens a
lot and for small sizes.
Precision provided by default seems way bigger than necessary and it causes
problems in vms on amd64 (it rdtscp's which vmexits). getnanotime seems to
provide the level roughly in lines of Linux so we should be good here.
Sample result from will-it-scale pipe1_processes -t 1 (ops/s):
before: 426464
after: 3247421
Note the that atime handling for named pipes is broken with and without the
patch. The filesystem code is never used for updating atime and never looks
at the updated field. Consequently, while there are no provisions added to
handle named pipes separately, the change is a nop for that case.
Differential Revision: https://reviews.freebsd.org/D23964
This reduces struct namecache by sizeof(void *).
Negative side is that we have to find the previous element (if any) when
removing an entry, but since we normally don't expect collisions it should be
fine.
Note this adds cache_get_hash calls which can be eliminated.
It used to be sizeof of the given struct to accomodate for 32 bit mips
doing 64 bit loads, but the same can be achieved with requireing just
64 bit alignment.
While here reorder struct namecache so that most commonly used fields
are closer.
This removes flag setting/unsetting carried over from regular lookup.
Flags still get for compatibility when falling back.
Note .. and . handling can get partially folded together.
This allows making half-constructed entries visible to the lockless lookup,
which now can check for either "not yet fully constructed" and "no longer valid"
state.
This will be used for .. lookup.
cache_purge locklessly checks whether the vnode at hand has any namecache
entries. This can race with a concurrent purge which managed to remove
the last entry, but may not be done touching the vnode.
Make sure we observe the relevant vnode lock as not taken before proceeding
with vgone.
Paired with the fact that doomed vnodes cannnot receive entries this restores
the invariant that there are no namecache-related writing users past cache_purge
in vgone.
Reported by: pho
These functions were introduced before UMA started ensuring that freed
memory gets placed in domain-local caches. They no longer serve any
purpose since UMA now provides their functionality by default. Remove
them to simplyify the kernel memory allocator interfaces a bit.
Reviewed by: cem, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25937
The constant seems to exists on MacOS X >= 10.8.
Requested by: swills
Reviewed by: allanjude, kevans
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25933