Commit 197997a broke handling of the offset
arguments to copy_file_range() when specified non-NULL.
The code fails to update the offsets and, as such, a loop like:
do {
len = copy_file_range(infd, &inpos, outfd, &outpos,
SSIZE_MAX, 0);
} while (len > 0);
becomes an infinite loop, just doing the same copy over and
over again.
This patch fixes it.
The clause "(foffsets_locked || foffsets_set)" in the if is not
actually needed for correctness, but I thought it made the code
a little more readable and might avoid some static
analyzer from throwing a "used before being set" for
the savinoff and savoutoff variables.
Approved by: so
Security: FreeBSD-EN-25:16.vfs
(cherry picked from commit 4046ad6bb0)
(cherry picked from commit 2fd0083fcc23f4c25860b8890292448720a5961c)
This will be useful in an upcoming change. No functional change
intended.
Reviewed by: jamie
MFC after: 2 weeks
Sponsored by: Stormshield
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D51524
(cherry picked from commit 748a4ea1ca)
Replace priorities specified by a base priority and some hardcoded
offset value by symbolic constants. Hardcoded offsets prevent changing
the difference between priorities without changing their relative
ordering, and is generally a dangerous practice since the resulting
priority may inadvertently belong to a different selection policy's
range.
Since RQ_PPQ is 4, differences of less than 4 are insignificant, so just
remove them. These small differences have not been changed for years,
so it is likely they have no real meaning (besides having no practical
effect). One can still consult the changes history to recover them if
ever needed.
No functional change (intended).
MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45390
(cherry picked from commit 8ecc419180)
This commit also includes the original refactoring changes
This change allows the kernel to operate with the default netisr cpu-affinity settings while having RSS compiled in. Normally, RSS changes quite a bit of the behaviour of the kernel dispatch service - this change allows for reducing impact on incompatible hardware while preserving the option to boost throughput speeds based on packet flow CPU affinity.
Make sure to compile the following options in the kernel:
options RSS
As well as setting the following sysctls:
net.inet.rss.enabled: 1
net.isr.bindthreads: 1
net.isr.maxthreads: -1 (automatically sets it to the number of CPUs)
And optionally (to force a 1:1 mapping between CPUs and buckets):
net.inet.rss.bits: 3 (for 8 CPUs)
net.inet.rss.bits: 2 (for 4 CPUs)
etc.
Set pin_default_swi to 0 by default in the RSS case.
if allow.routing is set, the jail can modify the system routing table
even if it's not a VNET jail.
Reviewed by: kevans, des, adrian
Approved by: kevans (mentor), des (mentor)
Differential Revision: https://reviews.freebsd.org/D49843
(cherry picked from commit 3a53fe2cc4)
This is derived from swills@ fork of the Juniper virtfs with many
changes by me including bug fixes, style improvements, clearer layering
and more consistent logging. The filesystem is renamed to p9fs to better
reflect its function and to prevent possible future confusion with
virtio-fs.
Several updates and fixes from Juniper have been integrated into this
version by Val Packett and these contributions along with the original
Juniper authors are credited below.
To use this with bhyve, add 'virtio_p9fs_load=YES' to loader.conf. The
bhyve virtio-9p device allows access from the guest to files on the host
by mapping a 'sharename' to a host path. It is possible to use p9fs as a
root filesystem by adding this to /boot/loader.conf:
vfs.root.mountfrom="p9fs:sharename"
for non-root filesystems add something like this to /etc/fstab:
sharename /mnt p9fs rw 0 0
In both examples, substitute the share name used on the bhyve command
line.
The 9P filesystem protocol relies on stateful file opens which map
protocol-level FIDs to host file descriptors. The FreeBSD vnode
interface doesn't really support this and we use heuristics to guess the
right FID to use for file operations. This can be confused by privilege
lowering and does not guarantee that the FID created for a given file
open is always used for file operations, even if the calling process is
using the file descriptor from the original open call. Improving this
would involve changes to the vnode interface which is out-of-scope for
this import.
Differential Revision: https://reviews.freebsd.org/D41844
Reviewed by: kib, emaste, dch
MFC after: 3 months
Co-authored-by: Val Packett <val@packett.cool>
Co-authored-by: Ka Ho Ng <kahon@juniper.net>
Co-authored-by: joyu <joyul@juniper.net>
Co-authored-by: Kumara Babu Narayanaswamy <bkumara@juniper.net>
Having 'kern.maxvnodes' higher than 'kern.maxfiles' mitigates a scenario
where some processes can eat up all vnodes in the system, causing
a deadlock, as long as the kernel itself does not create too many vnodes
without creating some file descriptor in some process' FD table. A very
small percentage (~0.6%) of excess vnodes at infinity, coupled with
a large difference near the origin, should cover basic cases more than
enough. Note however that this measure can be defeated, e.g., by using
nullfs mounts with non-trivial file hierarchies.
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50314
(cherry picked from commit e9baf472a6)
(cherry picked from commit a6b05a35ce3c516cadfca49c310dffaabbe40440)
Approved by: re (cperciva)
Remove the parts that describe what is already in the code formula. Add
hints about which of 'physvnodes'/'virtvnodes' takes precedence, and
when 'desiredvnodes' becomes smaller than 'maxfiles'. These were both
computed analytically and verified experimentally.
Reviewed by: kib (older version)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50310
(cherry picked from commit bf8890c84b)
(cherry picked from commit 5b07152f3ec3eea78958796b45594c5bd96c0bfb)
Approved by: re (cperciva)
Change the formulae so that 'maxproc' and 'maxfiles' are computed based
on the physical memory amount, and not the corresponding number of
pages. Make sure that the practical results are unchanged for all
architectures (which is possible since all have 4096 as their
PAGE_SIZE). Despite this, we do this change to make it clearer how many
units of these values are computed per MB, which is easier on readers.
Change the comments accordingly and expand them to indicate which parts
of the computations are actually not used by default. In passing, fix the
comments about default values and limits for 'maxfiles' (they were off).
No functional change (intended).
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50124
(cherry picked from commit 73f9251c0b)
(cherry picked from commit e7d4e52eafb016707ef3dd3b778d76db280c06a4)
Approved by: re (cperciva)
Commit ab05a1cf32 intended to revert commit 8733bc277a
("vfs: don't provoke recycling non-free vnodes without a good reason"),
but due to intervening changes in commit 054f45e026 ("vfs: further
speed up continuous free vnode recycle"), it also had to revert part of
it. In particular, while removing the whole 'if (vn_alloc_cyclecount !=
0)' block, it inadvertantly removed the code block resetting
'vn_alloc_cyclecount' to 0 and skipping direct vnode reclamation (done
further below through vnlru_free_locked_direct()), which had been
outside the 'if' before the intervening commit.
Removing this block instead of reinstating it in practice causes
'vn_alloc_cyclecount' to (almost) never be 0, making vn_alloc() always
call vn_alloc_hard(), which takes the 'vnode_list_mtx' mutex. In other
words, this disables the fast path. [The reverted commit, which
introduced the 'if (vn_alloc_cyclecount != 0)' guarding this block,
actually never executed it because it also had the bug that
'vn_alloc_cyclecount' would always stay at 0, hiding its usefulness.]
Additionally, not skipping direct vnode reclamation even when there are
less vnodes than 'kern.maxvnodes' not only causes unnecessary contention
but also plain livelocks as vnlru_free_locked_direct() does not itself
check whether there are actually "free" (not referenced) vnodes to be
deallocated, and will blindly browse all the vnode list until it finds
one (which it may not, or only a few ones at the end). As the fast path
was disabled, all threads in the system would soon be competing for the
vnode list lock, outpacing the vnlru process that could never actually
recycle vnodes in a more agressive manner (i.e., even if they have
a non-zero hold count). And we could more easily get into this
situation, as each vnode allocation was reducing the count of "free"
vnodes, even if entirely new vnodes could be allocated instead. This
part was mitigated by the vnlru process (before the tipping point
described above), which explains why the mechanism would not always
livelock.
Not skipping direct vnode reclamation was arguably a bug introduced by
the intervening commit (054f45e026), but was mitigated by
vn_alloc_hard() not being called in the fast path. The revert commit,
by disabling the fast path, made it significantly annoying (to the point
of getting a few livelocks a week in some of my workloads).
Restore the reset of 'vn_alloc_cyclecount' to 0 and skip direct
reclamation when the current number of vnodes is below the
'kern.maxvnodes' limit, indicating we can start allocating a new 'struct
vnode' right away. While here, fix the comparison with the limit when
'bumped' is true.
Reviewed by: kib (older version), avg
Tested by: avg, olce
Fixes: ab05a1cf32 (revert of "vfs: don't provoke recycling non-free vnodes without a good reason")
Fixes: 054f45e026 ("vfs: further speed up continuous free vnode recycle")
MFC after: 5 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50338
(cherry picked from commit d90c9c24e2)
(cherry picked from commit ce6b7a5919da518715d6163baa42035d6d2c1bd0)
Approved by: re (cperciva)
Introduce two helpers, the more general SYSCTL_SIZEOF() and
a struct-specific one SYSCTL_SIZEOF_STRUCT() which prepends 'struct' in
the description and in the use of sizeof() but uses the raw structure
name as the knob's name. The size of the object/structure is exported
under 'debug.sizeof'.
Existing knobs under 'debug.sizeof' were all converted to use the
helpers.
Add a note before the helpers discouraging the introduction of new
leaves for ad-hoc reasons. List alternative means for developers to
obtain the size of arbitrary kernel structures easily (thanks to markj@
for providing these).
No functional change (intended).
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D50121
(cherry picked from commit 713abc9880)
(cherry picked from commit efce9f8a510b60736994e50288b78fc7b32b5d90)
Approved by: re (cperciva)
Print pointers to locks instead of their names to avoid a nested panic
if the lock object is corrupted.
Reviewed by: markj
Sponsored by: AFRL, DARPA
Differential Revision: https://reviews.freebsd.org/D49331
(cherry picked from commit a52a51a2d5)
Just print the pointer to the mutex instead of the name in case the
mutex is corrupted.
Reviewed by: olce, kib
Sponsored by: AFRL, DARPA
Differential Revision: https://reviews.freebsd.org/D49314
(cherry picked from commit 0ed104978c)
It is only (somewhat) safe to dereference lo_name if we know the mutex
has a specific lock class that is incorrect, not if just has "some"
incorrect lock class. In particular, in the case of memory
overwritten with 0xdeadc0de, the lock class won't match either mutex
type. However, trying to dereference lo_name via a 0xdeadc0de pointer
triggers a nested panic building the panicstr which then prevents a
crash dump.
Reviewed by: olce, kib, markj
Sponsored by: AFRL, DARPA
Differential Revision: https://reviews.freebsd.org/D49313
(cherry picked from commit dba45599c4)
Use a system-wide taskqueue for hot-plug events. This avoids possibly
blocking unrelated events on the thread taskqueue without requiring
multiple driver-specific taskqueues.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D49268
(cherry picked from commit 44d5f5ed1e)
so_unsplice() assumed that if SB_SPLICED is set in the receive buffer of
the first socket, then the splice is fully initialized. However, that's
not true, and it's possible for so_unsplice() to race ahead of
so_splice().
Modify so_unsplice() to simply bail if the splice state is embryonic.
Reported by: syzkaller
Reviewed by: gallatin
Fixes: a1da7dc1cd ("socket: Implement SO_SPLICE")
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49814
(cherry picked from commit 992b18a9ec)
There is a somewhat strange case where when writing to a POSIX shm
object, the object is not allowed to grow, and the I/O offset+length
overflows. In that case we simply truncate the I/O to the object size.
Later we write-lock the range [offset, objsize). However, we were not
checking whether offset > objsize, in which case we're writing zero
bytes but locking an invalid range.
Modify the range locking in shm_write() to take this possibility into
account. While here, rename a variable to make its purpose a bit more
clear, and add an assertion against negative offsets (which is supposed
to be enforced by the caller of fo_write for I/O to files that aren't
character devices).
Reported by: syzkaller
Reviewed by: kevans, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49673
(cherry picked from commit 82d8c609cf)
The check for range overlap did not correctly handle negative offests,
as the addition inoff + len is promoted to an unsigned type.
Reported by: syzkaller
Reviewed by: rmacklem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49674
(cherry picked from commit 1101d62822)
This is more useful and matches the documentation. While here, make it
settable as a tunable and add a sysctl description.
PR: 139425
MFC after: 2 weeks
(cherry picked from commit c5773d366e)
Extend malloc_usable_size() for contigmalloc; it seems the only
outside consumer is LinuxKPI ksize() which by itself has little to
no consumer either.
Sponsored by: The FreeBSD Foundation
Suggested by: jhb (see D46657)
Reviewed by: jhb, markj
Fixes: 9e6544dd6e
Differential Revision: https://reviews.freebsd.org/D49571
(cherry picked from commit c5cf4b64f4)
This structure collects count from multiple cred structures. Of course it
can't use a smaller type.
PR: 283747
Reviewed by: olce, mjg, markj
Differential Revision: https://reviews.freebsd.org/D49562
Fixes: 37337709d3
(cherry picked from commit cd46e98013)
During process exit, it's possible for the exiting thread to send a
signal to its process, via killjobc(). This happens after the itimer is
drained. If itimers are stopped, i.e., P2_ITSTOPPED is set, then
itimer_proc_continue() will resume the callout after it has been
drained.
Fix the problem by simply clearing P2_ITSTOPPED as part of the drain.
Then, a signal received after that point will not re-enable the callout.
For good measure, also make sure that we don't reset the itimer callout
in an exiting process.
Reported by: syzkaller
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49529
(cherry picked from commit a6268f89d5)
One can ask copy_file_range(2) to use the file offsets of the file
descriptions that it copies from and to. We were updating those offsets
without any locking, which is incorrect and can lead to unkillable loops
in the event of a race (e.g., the check for overlapping ranges in
kern_copy_file_range() is subject to a TOCTOU race with the following
loop which range-locks the input and output file).
Use foffset_lock() to serialize updates to the file descriptions, as we
do for other, similar system calls.
Reported by: syzkaller
Reviewed by: rmacklem, kib
MFC after: 2 weeks
Fixes: bbbbeca3e9 ("Add kernel support for a Linux compatible copy_file_range(2) syscall.")
Differential Revision: https://reviews.freebsd.org/D49440
(cherry picked from commit 197997a4c3)
This will be used in kern_copy_file_range(), which needs to lock two
offsets.
Reviewed by: kib, rmacklem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49440
(cherry picked from commit 12ecb0fe0a)
Add a default ctloutput handler and remove various NULL checks. This
fixes a problem wherein the generic SO_SETFIB handler did not check
whether the protocol has a ctloutput implementation before calling the
function pointer.
Reported by: syzkaller
Reviewed by: glebius
Fixes: caccbaef8e ("socket: Move SO_SETFIB handling to protocol layers")
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D49436
(cherry picked from commit 590b45036e)
When so_splice() links two sockets together, it first attaches the
splice control structure to the source socket; at that point, the splice
is in the idle state. After that point, a socket wakeup will queue up
work for a splice worker thread: in particular, so_splice_dispatch()
only queues work if the splice is idle.
Meanwhile, so_splice() continues initializing the splice, and finally
calls so_splice_xfer() to transfer any already buffered data. This
assumes that the splice is still idle, but that's not true if some async
work was already dispatched.
Solve the problem by introducing an initial "under construction" state
for the splice control structure, such that wakeups won't queue any work
until so_splice() has finished.
While here, remove an outdated comment from the beginning of
so_splice_xfer().
Reported by: syzkaller
Reviewed by: gallatin
Fixes: a1da7dc1cd ("socket: Implement SO_SPLICE")
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49437
(cherry picked from commit 5748163568)
When free() was adapted to support allocations originating from
contigmalloc(), redzone(9) support was not included. redzone(9)
involves adjusting the pointer to freed memory before looking up the
slab cookie, so it's not straightforward to make contigmalloc() opt out
of redzone support.
Thus, augment contigmalloc() to support redzone.
Reported by: glebius
Tested by: dhw
MFC after: 2 weeks
Fixes: 9e6544dd6e ("malloc(9): extend contigmalloc(9) by a "slab cookie"")
(cherry picked from commit 74361d693a)
Otherwise we end up searching for a mountpoint using an uninitialized
key, and likely failing the version test. This violates KMSAN's
invariants, so simply return immediately instead.
MFC after: 2 weeks
(cherry picked from commit d8703cd802)
This new system call allows to set all necessary credentials of
a process in one go: Effective, real and saved UIDs, effective, real and
saved GIDs, supplementary groups and the MAC label. Its advantage over
standard credential-setting system calls (such as setuid(), seteuid(),
etc.) is that it enables MAC modules, such as MAC/do, to restrict the
set of credentials some process may gain in a fine-grained manner.
Traditionally, credential changes rely on setuid binaries that call
multiple credential system calls and in a specific order (setuid() must
be last, so as to remain root for all other credential-setting calls,
which would otherwise fail with insufficient privileges). This
piecewise approach causes the process to transiently hold credentials
that are neither the original nor the final ones. For the kernel to
enforce that only certain transitions of credentials are allowed, either
these possibly non-compliant transient states have to disappear (by
setting all relevant attributes in one go), or the kernel must delay
setting or checking the new credentials. Delaying setting credentials
could be done, e.g., by having some mode where the standard system calls
contribute to building new credentials but without committing them. It
could be started and ended by a special system call. Delaying checking
could mean that, e.g., the kernel only verifies the credentials
transition at the next non-credential-setting system call (we just
mention this possibility for completeness, but are certainly not
endorsing it).
We chose the simpler approach of a new system call, as we don't expect
the set of credentials one can set to change often. It has the
advantages that the traditional system calls' code doesn't have to be
changed and that we can establish a special MAC protocol for it, by
having some cleanup function called just before returning (this is
a requirement for MAC/do), without disturbing the existing ones.
The mac_cred_check_setcred() hook is passed the flags received by
setcred() (including the version) and both the old and new kernel's
'struct ucred' instead of 'struct setcred' as this should simplify
evolving existing hooks as the 'struct setcred' structure evolves. The
mac_cred_setcred_enter() and mac_cred_setcred_exit() hooks are always
called by pairs around potential calls to mac_cred_check_setcred().
They allow MAC modules to allocate/free data they may need in their
mac_cred_check_setcred() hook, as the latter is called under the current
process' lock, rendering sleepable allocations impossible. MAC/do is
going to leverage these in a subsequent commit. A scheme where
mac_cred_check_setcred() could return ERESTART was considered but is
incompatible with proper composition of MAC modules.
While here, add missing includes and declarations for standalone
inclusion of <sys/ucred.h> both from kernel and userspace (for the
latter, it has been working thanks to <bsm/audit.h> already including
<sys/types.h>).
Reviewed by: brooks
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D47618
(cherry picked from commit ddb3eb4efe)
If the lock is unowned (i.e., owner == UMUTEX_CONTESTED), we might get a
spurious failure, and in that case we need to retry the loop.
Otherwise, the calling thread can end up sleeping forever.
The same problem exists in do_set_ceiling(), which open-codes
do_lock_pp(), so fix it there too.
Reviewed by: olce
Reported by: Daniel King <dmking@adacore.com>
MFC after: 2 weeks
Sponsored by: Innovate UK
Differential Revision: https://reviews.freebsd.org/D49031
(cherry picked from commit 4b79443927)
This seems like a natural complement to umtxq_unbusy_unlocked(). No
functional change intended.
Reviewed by: olce, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D49124
(cherry picked from commit b01495caac)
This is an unlocked check, and after commit 34740937f7 the debug
checks in STAILQ_EMPTY may spuriously fail here. In particular, the per
process queue is updated under the global ktrace mutex, not held in
ktr_drain(). If a record is enqueued concurrently, the recording thread
will schedule an AST to drain the queue again, so it should not be
possible for a race to leave records in the queue indefinitely.
Reviewed by: kib, olce
Reported by: syzbot+d67eddd8c4923ee28bb7@syzkaller.appspotmail.com
MFC after: 2 weeks
Fixes: 34740937f7 ("queue: New debug macros for STAILQ")
Differential Revision: https://reviews.freebsd.org/D48899
(cherry picked from commit 36631977d8)
These correspond to the current implementations of
bus_generic_(probe|attach|detach) but with more accurate names and
semantics. The intention is to deprecate bus_generic_(probe|attach)
and reimplement bus_generic_detach in a future commit.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D47673
(cherry picked from commit 46297859a7)
bus_generic_(probe|attach|detach) will not be changed in stable/14,
but providing the new APIs in stable/14 permits drivers to use the new
APIs.