opnsense-src

mirror of https://github.com/opnsense/src.git synced 2026-06-14 19:20:18 -04:00

Author	SHA1	Message	Date
Rick Macklem	74bbb9e8b9	vfs_syscalls.c: Fix handling of offset args for copy_file_range Commit `197997a` broke handling of the offset arguments to copy_file_range() when specified non-NULL. The code fails to update the offsets and, as such, a loop like: do { len = copy_file_range(infd, &inpos, outfd, &outpos, SSIZE_MAX, 0); } while (len > 0); becomes an infinite loop, just doing the same copy over and over again. This patch fixes it. The clause "(foffsets_locked \|\| foffsets_set)" in the if is not actually needed for correctness, but I thought it made the code a little more readable and might avoid some static analyzer from throwing a "used before being set" for the savinoff and savoutoff variables. Approved by: so Security: FreeBSD-EN-25:16.vfs (cherry picked from commit `4046ad6bb0`) (cherry picked from commit 2fd0083fcc23f4c25860b8890292448720a5961c)	2025-10-06 16:53:35 +02:00
Mark Johnston	50cf88be6f	jail: Make prison_owns_vnet() operate on a prison instead of a ucred This will be useful in an upcoming change. No functional change intended. Reviewed by: jamie MFC after: 2 weeks Sponsored by: Stormshield Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D51524 (cherry picked from commit `748a4ea1ca`)	2025-09-08 12:00:14 +02:00
Olivier Certner	a29140025a	Internal scheduling priorities: Always use symbolic ones Replace priorities specified by a base priority and some hardcoded offset value by symbolic constants. Hardcoded offsets prevent changing the difference between priorities without changing their relative ordering, and is generally a dangerous practice since the resulting priority may inadvertently belong to a different selection policy's range. Since RQ_PPQ is 4, differences of less than 4 are insignificant, so just remove them. These small differences have not been changed for years, so it is likely they have no real meaning (besides having no practical effect). One can still consult the changes history to recover them if ever needed. No functional change (intended). MFC after: 1 month Event: Kitchener-Waterloo Hackathon 202506 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D45390 (cherry picked from commit `8ecc419180`)	2025-07-31 12:42:21 +02:00
Stephan de Wit	0d1029fedb	rss: add sysctl enable toggle This commit also includes the original refactoring changes This change allows the kernel to operate with the default netisr cpu-affinity settings while having RSS compiled in. Normally, RSS changes quite a bit of the behaviour of the kernel dispatch service - this change allows for reducing impact on incompatible hardware while preserving the option to boost throughput speeds based on packet flow CPU affinity. Make sure to compile the following options in the kernel: options RSS As well as setting the following sysctls: net.inet.rss.enabled: 1 net.isr.bindthreads: 1 net.isr.maxthreads: -1 (automatically sets it to the number of CPUs) And optionally (to force a 1:1 mapping between CPUs and buckets): net.inet.rss.bits: 3 (for 8 CPUs) net.inet.rss.bits: 2 (for 4 CPUs) etc. Set pin_default_swi to 0 by default in the RSS case.	2025-07-09 10:05:48 +02:00
Lexi Winter	b69907d463	jail: add allow.routing jail permission if allow.routing is set, the jail can modify the system routing table even if it's not a VNET jail. Reviewed by: kevans, des, adrian Approved by: kevans (mentor), des (mentor) Differential Revision: https://reviews.freebsd.org/D49843 (cherry picked from commit `3a53fe2cc4`)	2025-07-09 10:05:40 +02:00
Doug Rabson	1ca2f5d1f7	Add an implementation of the 9P filesystem This is derived from swills@ fork of the Juniper virtfs with many changes by me including bug fixes, style improvements, clearer layering and more consistent logging. The filesystem is renamed to p9fs to better reflect its function and to prevent possible future confusion with virtio-fs. Several updates and fixes from Juniper have been integrated into this version by Val Packett and these contributions along with the original Juniper authors are credited below. To use this with bhyve, add 'virtio_p9fs_load=YES' to loader.conf. The bhyve virtio-9p device allows access from the guest to files on the host by mapping a 'sharename' to a host path. It is possible to use p9fs as a root filesystem by adding this to /boot/loader.conf: vfs.root.mountfrom="p9fs:sharename" for non-root filesystems add something like this to /etc/fstab: sharename /mnt p9fs rw 0 0 In both examples, substitute the share name used on the bhyve command line. The 9P filesystem protocol relies on stateful file opens which map protocol-level FIDs to host file descriptors. The FreeBSD vnode interface doesn't really support this and we use heuristics to guess the right FID to use for file operations. This can be confused by privilege lowering and does not guarantee that the FID created for a given file open is always used for file operations, even if the calling process is using the file descriptor from the original open call. Improving this would involve changes to the vnode interface which is out-of-scope for this import. Differential Revision: https://reviews.freebsd.org/D41844 Reviewed by: kib, emaste, dch MFC after: 3 months Co-authored-by: Val Packett <val@packett.cool> Co-authored-by: Ka Ho Ng <kahon@juniper.net> Co-authored-by: joyu <joyul@juniper.net> Co-authored-by: Kumara Babu Narayanaswamy <bkumara@juniper.net>	2025-06-20 09:54:02 +02:00
Olivier Certner	6b6d6448fe	vfs: vntblinit(): Raise default 'kern.maxvnodes' higher than 'kern.maxfiles' Having 'kern.maxvnodes' higher than 'kern.maxfiles' mitigates a scenario where some processes can eat up all vnodes in the system, causing a deadlock, as long as the kernel itself does not create too many vnodes without creating some file descriptor in some process' FD table. A very small percentage (~0.6%) of excess vnodes at infinity, coupled with a large difference near the origin, should cover basic cases more than enough. Note however that this measure can be defeated, e.g., by using nullfs mounts with non-trivial file hierarchies. MFC after: 5 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50314 (cherry picked from commit `e9baf472a6`) (cherry picked from commit a6b05a35ce3c516cadfca49c310dffaabbe40440) Approved by: re (cperciva)	2025-05-21 12:07:52 +02:00
Olivier Certner	0b4d5c560a	vfs: vntblinit(): Rework comments on the computation of 'kern.maxvnodes' Remove the parts that describe what is already in the code formula. Add hints about which of 'physvnodes'/'virtvnodes' takes precedence, and when 'desiredvnodes' becomes smaller than 'maxfiles'. These were both computed analytically and verified experimentally. Reviewed by: kib (older version) MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50310 (cherry picked from commit `bf8890c84b`) (cherry picked from commit 5b07152f3ec3eea78958796b45594c5bd96c0bfb) Approved by: re (cperciva)	2025-05-21 12:07:45 +02:00
Olivier Certner	fbc73f17ba	Compute 'maxproc'/'maxfiles' from memory amount; Expand/fix comments Change the formulae so that 'maxproc' and 'maxfiles' are computed based on the physical memory amount, and not the corresponding number of pages. Make sure that the practical results are unchanged for all architectures (which is possible since all have 4096 as their PAGE_SIZE). Despite this, we do this change to make it clearer how many units of these values are computed per MB, which is easier on readers. Change the comments accordingly and expand them to indicate which parts of the computations are actually not used by default. In passing, fix the comments about default values and limits for 'maxfiles' (they were off). No functional change (intended). MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50124 (cherry picked from commit `73f9251c0b`) (cherry picked from commit e7d4e52eafb016707ef3dd3b778d76db280c06a4) Approved by: re (cperciva)	2025-05-21 12:07:36 +02:00
Olivier Certner	7a9ea03e4b	vfs: vn_alloc(): Stop always calling vn_alloc_hard() and direct reclaiming Commit `ab05a1cf32` intended to revert commit `8733bc277a` ("vfs: don't provoke recycling non-free vnodes without a good reason"), but due to intervening changes in commit `054f45e026` ("vfs: further speed up continuous free vnode recycle"), it also had to revert part of it. In particular, while removing the whole 'if (vn_alloc_cyclecount != 0)' block, it inadvertantly removed the code block resetting 'vn_alloc_cyclecount' to 0 and skipping direct vnode reclamation (done further below through vnlru_free_locked_direct()), which had been outside the 'if' before the intervening commit. Removing this block instead of reinstating it in practice causes 'vn_alloc_cyclecount' to (almost) never be 0, making vn_alloc() always call vn_alloc_hard(), which takes the 'vnode_list_mtx' mutex. In other words, this disables the fast path. [The reverted commit, which introduced the 'if (vn_alloc_cyclecount != 0)' guarding this block, actually never executed it because it also had the bug that 'vn_alloc_cyclecount' would always stay at 0, hiding its usefulness.] Additionally, not skipping direct vnode reclamation even when there are less vnodes than 'kern.maxvnodes' not only causes unnecessary contention but also plain livelocks as vnlru_free_locked_direct() does not itself check whether there are actually "free" (not referenced) vnodes to be deallocated, and will blindly browse all the vnode list until it finds one (which it may not, or only a few ones at the end). As the fast path was disabled, all threads in the system would soon be competing for the vnode list lock, outpacing the vnlru process that could never actually recycle vnodes in a more agressive manner (i.e., even if they have a non-zero hold count). And we could more easily get into this situation, as each vnode allocation was reducing the count of "free" vnodes, even if entirely new vnodes could be allocated instead. This part was mitigated by the vnlru process (before the tipping point described above), which explains why the mechanism would not always livelock. Not skipping direct vnode reclamation was arguably a bug introduced by the intervening commit (`054f45e026`), but was mitigated by vn_alloc_hard() not being called in the fast path. The revert commit, by disabling the fast path, made it significantly annoying (to the point of getting a few livelocks a week in some of my workloads). Restore the reset of 'vn_alloc_cyclecount' to 0 and skip direct reclamation when the current number of vnodes is below the 'kern.maxvnodes' limit, indicating we can start allocating a new 'struct vnode' right away. While here, fix the comparison with the limit when 'bumped' is true. Reviewed by: kib (older version), avg Tested by: avg, olce Fixes: `ab05a1cf32` (revert of "vfs: don't provoke recycling non-free vnodes without a good reason") Fixes: `054f45e026` ("vfs: further speed up continuous free vnode recycle") MFC after: 5 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50338 (cherry picked from commit `d90c9c24e2`) (cherry picked from commit ce6b7a5919da518715d6163baa42035d6d2c1bd0) Approved by: re (cperciva)	2025-05-21 12:07:15 +02:00
Olivier Certner	de6a4418ee	sysctl(9): Ease exporting struct sizes; Discourage doing that Introduce two helpers, the more general SYSCTL_SIZEOF() and a struct-specific one SYSCTL_SIZEOF_STRUCT() which prepends 'struct' in the description and in the use of sizeof() but uses the raw structure name as the knob's name. The size of the object/structure is exported under 'debug.sizeof'. Existing knobs under 'debug.sizeof' were all converted to use the helpers. Add a note before the helpers discouraging the introduction of new leaves for ad-hoc reasons. List alternative means for developers to obtain the size of arbitrary kernel structures easily (thanks to markj@ for providing these). No functional change (intended). Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D50121 (cherry picked from commit `713abc9880`) (cherry picked from commit efce9f8a510b60736994e50288b78fc7b32b5d90) Approved by: re (cperciva)	2025-05-13 14:41:33 +02:00
John Baldwin	6fb471682a	rwmlock/rwlock/sx: Print the pointer of destroyed locks in panic messages Suggested by: markj Reviewed by: kib, markj Sponsored by: AFRL, DARPA Differential Revision: https://reviews.freebsd.org/D49332 (cherry picked from commit `00d78c5007`)	2025-04-29 10:44:08 -04:00
John Baldwin	33bf58fbd8	lockmgr/rmlock/rwlock/sx: Make various assertions more robust Print pointers to locks instead of their names to avoid a nested panic if the lock object is corrupted. Reviewed by: markj Sponsored by: AFRL, DARPA Differential Revision: https://reviews.freebsd.org/D49331 (cherry picked from commit `a52a51a2d5`)	2025-04-29 10:44:01 -04:00
John Baldwin	d9ce37179a	mtx: Include the mutex pointer in the panic message for destroyed locks Reviewed by: olce, kib, markj Sponsored by: AFRL, DARPA Differential Revision: https://reviews.freebsd.org/D49315 (cherry picked from commit `43a15a22c6`)	2025-04-29 10:43:48 -04:00
John Baldwin	7a7a7cd850	mtx: Make idle thread assertions more robust Just print the pointer to the mutex instead of the name in case the mutex is corrupted. Reviewed by: olce, kib Sponsored by: AFRL, DARPA Differential Revision: https://reviews.freebsd.org/D49314 (cherry picked from commit `0ed104978c`)	2025-04-29 10:43:42 -04:00
John Baldwin	27fb895f9e	mtx: Avoid nested panics on lock class mismatch assertions It is only (somewhat) safe to dereference lo_name if we know the mutex has a specific lock class that is incorrect, not if just has "some" incorrect lock class. In particular, in the case of memory overwritten with 0xdeadc0de, the lock class won't match either mutex type. However, trying to dereference lo_name via a 0xdeadc0de pointer triggers a nested panic building the panicstr which then prevents a crash dump. Reviewed by: olce, kib, markj Sponsored by: AFRL, DARPA Differential Revision: https://reviews.freebsd.org/D49313 (cherry picked from commit `dba45599c4`)	2025-04-29 10:43:36 -04:00
John Baldwin	3a54ca5e28	new-bus: Add taskqueue_bus to process hot-plug device events Use a system-wide taskqueue for hot-plug events. This avoids possibly blocking unrelated events on the thread taskqueue without requiring multiple driver-specific taskqueues. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D49268 (cherry picked from commit `44d5f5ed1e`)	2025-04-29 10:34:05 -04:00
Mark Johnston	c0c5d01e53	so_splice: Synchronize so_unsplice() with so_splice() so_unsplice() assumed that if SB_SPLICED is set in the receive buffer of the first socket, then the splice is fully initialized. However, that's not true, and it's possible for so_unsplice() to race ahead of so_splice(). Modify so_unsplice() to simply bail if the splice state is embryonic. Reported by: syzkaller Reviewed by: gallatin Fixes: `a1da7dc1cd` ("socket: Implement SO_SPLICE") MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49814 (cherry picked from commit `992b18a9ec`)	2025-04-29 00:44:04 +00:00
Mark Johnston	1065a7ebde	posixshm: Fix range locking in shm_write() There is a somewhat strange case where when writing to a POSIX shm object, the object is not allowed to grow, and the I/O offset+length overflows. In that case we simply truncate the I/O to the object size. Later we write-lock the range [offset, objsize). However, we were not checking whether offset > objsize, in which case we're writing zero bytes but locking an invalid range. Modify the range locking in shm_write() to take this possibility into account. While here, rename a variable to make its purpose a bit more clear, and add an assertion against negative offsets (which is supposed to be enforced by the caller of fo_write for I/O to files that aren't character devices). Reported by: syzkaller Reviewed by: kevans, kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49673 (cherry picked from commit `82d8c609cf`)	2025-04-24 13:20:52 +00:00
Mark Johnston	fb405ecd9f	copy_file_range: Fix overlap checking The check for range overlap did not correctly handle negative offests, as the addition inoff + len is promoted to an unsigned type. Reported by: syzkaller Reviewed by: rmacklem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49674 (cherry picked from commit `1101d62822`)	2025-04-24 13:20:52 +00:00
Mark Johnston	1b7e2cc108	ktr: Make debug.ktr.verbose available whenever KTR is configured This is more useful and matches the documentation. While here, make it settable as a tunable and add a sysctl description. PR: 139425 MFC after: 2 weeks (cherry picked from commit `c5773d366e`)	2025-04-24 13:20:52 +00:00
Bjoern A. Zeeb	580c891f5a	malloc: extend malloc_usable_size() for contigmalloc Extend malloc_usable_size() for contigmalloc; it seems the only outside consumer is LinuxKPI ksize() which by itself has little to no consumer either. Sponsored by: The FreeBSD Foundation Suggested by: jhb (see D46657) Reviewed by: jhb, markj Fixes: `9e6544dd6e` Differential Revision: https://reviews.freebsd.org/D49571 (cherry picked from commit `c5cf4b64f4`)	2025-04-18 14:35:56 +00:00
Gleb Smirnoff	fca3395674	cred: fix struct credbatch to use long for refcount This structure collects count from multiple cred structures. Of course it can't use a smaller type. PR: 283747 Reviewed by: olce, mjg, markj Differential Revision: https://reviews.freebsd.org/D49562 Fixes: `37337709d3` (cherry picked from commit `cd46e98013`)	2025-04-15 12:40:02 -07:00
Mark Johnston	74a0b89084	ptrace: Do not pass a negative resid to proc_rwmem() While here, avoid truncting uio_resid in proc_rwmem(). Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49479 (cherry picked from commit `1a80a157cb`)	2025-04-15 02:25:24 +00:00
Mark Johnston	6936644826	proc: Disallow re-enabling of process itimers during exit During process exit, it's possible for the exiting thread to send a signal to its process, via killjobc(). This happens after the itimer is drained. If itimers are stopped, i.e., P2_ITSTOPPED is set, then itimer_proc_continue() will resume the callout after it has been drained. Fix the problem by simply clearing P2_ITSTOPPED as part of the drain. Then, a signal received after that point will not re-enable the callout. For good measure, also make sure that we don't reset the itimer callout in an exiting process. Reported by: syzkaller Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49529 (cherry picked from commit `a6268f89d5`)	2025-04-15 02:25:24 +00:00
Mark Johnston	1d2fd8c9cf	file: Fix offset handling in kern_copy_file_range() One can ask copy_file_range(2) to use the file offsets of the file descriptions that it copies from and to. We were updating those offsets without any locking, which is incorrect and can lead to unkillable loops in the event of a race (e.g., the check for overlapping ranges in kern_copy_file_range() is subject to a TOCTOU race with the following loop which range-locks the input and output file). Use foffset_lock() to serialize updates to the file descriptions, as we do for other, similar system calls. Reported by: syzkaller Reviewed by: rmacklem, kib MFC after: 2 weeks Fixes: `bbbbeca3e9` ("Add kernel support for a Linux compatible copy_file_range(2) syscall.") Differential Revision: https://reviews.freebsd.org/D49440 (cherry picked from commit `197997a4c3`)	2025-04-15 02:25:24 +00:00
Mark Johnston	c1aa97cf79	file: Add foffset_lock_pair() This will be used in kern_copy_file_range(), which needs to lock two offsets. Reviewed by: kib, rmacklem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49440 (cherry picked from commit `12ecb0fe0a`)	2025-04-15 02:25:24 +00:00
Mark Johnston	5246532449	socket: Handle the possibility of a protocol with no ctloutput Add a default ctloutput handler and remove various NULL checks. This fixes a problem wherein the generic SO_SETFIB handler did not check whether the protocol has a ctloutput implementation before calling the function pointer. Reported by: syzkaller Reviewed by: glebius Fixes: `caccbaef8e` ("socket: Move SO_SETFIB handling to protocol layers") MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D49436 (cherry picked from commit `590b45036e`)	2025-04-15 02:25:24 +00:00
Konstantin Belousov	70ba4df540	stat(2): add st_bsdflags field (cherry picked from commit `86db734ae2`)	2025-04-09 03:53:17 +03:00
Konstantin Belousov	fdbf48835d	ptrace(2): expand ability to fetch syscall parameters (cherry picked from commit `fd5bc306ff`)	2025-04-07 04:28:22 +03:00
Konstantin Belousov	40c6cb628a	kern_proc_kqueues_out(): maxlen == -1 means there is no maxlen (cherry picked from commit `c18a16ebcf`)	2025-04-07 04:28:22 +03:00
Konstantin Belousov	2170bf801d	sysctl KERN_PROC_KQUEUE: treat omitted kqfd as request to dump all kqueues (cherry picked from commit `ab9b296498`)	2025-04-07 04:28:22 +03:00
Konstantin Belousov	5e6b89bd56	Add NT_PROCSTAT_KQUEUES core note (cherry picked from commit `5e7c43ff02`)	2025-04-07 04:28:22 +03:00
Konstantin Belousov	f4ea3580d6	sysctl KERN_PROC_KQUEUE: implement compat32 (cherry picked from commit `fa8fdd80df`)	2025-04-07 04:28:21 +03:00
Konstantin Belousov	c69b90d08b	kinfo_knote: add knt_kq_fd member (cherry picked from commit `9d1e7a7e8e`)	2025-04-07 04:28:21 +03:00
Konstantin Belousov	52464e913a	sysctl_kern_proc_kqueue(): convert to use sbuf (cherry picked from commit `fee6a52e43`)	2025-04-07 04:28:21 +03:00
Konstantin Belousov	e260058f33	descriptors: add fget_remote_foreach() (cherry picked from commit `4b69f1fab6`)	2025-04-07 04:28:21 +03:00
Konstantin Belousov	e2af76d470	Add sysctl kern.proc.kqueue (cherry picked from commit `e60f608eb9`)	2025-04-07 04:28:20 +03:00
Mark Johnston	f2214e48d0	socket: Fix a race in the SO_SPLICE state machine When so_splice() links two sockets together, it first attaches the splice control structure to the source socket; at that point, the splice is in the idle state. After that point, a socket wakeup will queue up work for a splice worker thread: in particular, so_splice_dispatch() only queues work if the splice is idle. Meanwhile, so_splice() continues initializing the splice, and finally calls so_splice_xfer() to transfer any already buffered data. This assumes that the splice is still idle, but that's not true if some async work was already dispatched. Solve the problem by introducing an initial "under construction" state for the splice control structure, such that wakeups won't queue any work until so_splice() has finished. While here, remove an outdated comment from the beginning of so_splice_xfer(). Reported by: syzkaller Reviewed by: gallatin Fixes: `a1da7dc1cd` ("socket: Implement SO_SPLICE") MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D49437 (cherry picked from commit `5748163568`)	2025-04-06 13:54:20 +00:00
Mark Johnston	43a32f790b	malloc: Fix DEBUG_REDZONE for contigmalloc() When free() was adapted to support allocations originating from contigmalloc(), redzone(9) support was not included. redzone(9) involves adjusting the pointer to freed memory before looking up the slab cookie, so it's not straightforward to make contigmalloc() opt out of redzone support. Thus, augment contigmalloc() to support redzone. Reported by: glebius Tested by: dhw MFC after: 2 weeks Fixes: `9e6544dd6e` ("malloc(9): extend contigmalloc(9) by a "slab cookie"") (cherry picked from commit `74361d693a`)	2025-04-06 13:54:05 +00:00
Mark Johnston	014ce35640	vfs: Return early from sysctl_vfs_ctl() if no input was given Otherwise we end up searching for a mountpoint using an uninitialized key, and likely failing the version test. This violates KMSAN's invariants, so simply return immediately instead. MFC after: 2 weeks (cherry picked from commit `d8703cd802`)	2025-04-06 13:54:03 +00:00
Olivier Certner	c1d7552ddd	New setcred() system call and associated MAC hooks This new system call allows to set all necessary credentials of a process in one go: Effective, real and saved UIDs, effective, real and saved GIDs, supplementary groups and the MAC label. Its advantage over standard credential-setting system calls (such as setuid(), seteuid(), etc.) is that it enables MAC modules, such as MAC/do, to restrict the set of credentials some process may gain in a fine-grained manner. Traditionally, credential changes rely on setuid binaries that call multiple credential system calls and in a specific order (setuid() must be last, so as to remain root for all other credential-setting calls, which would otherwise fail with insufficient privileges). This piecewise approach causes the process to transiently hold credentials that are neither the original nor the final ones. For the kernel to enforce that only certain transitions of credentials are allowed, either these possibly non-compliant transient states have to disappear (by setting all relevant attributes in one go), or the kernel must delay setting or checking the new credentials. Delaying setting credentials could be done, e.g., by having some mode where the standard system calls contribute to building new credentials but without committing them. It could be started and ended by a special system call. Delaying checking could mean that, e.g., the kernel only verifies the credentials transition at the next non-credential-setting system call (we just mention this possibility for completeness, but are certainly not endorsing it). We chose the simpler approach of a new system call, as we don't expect the set of credentials one can set to change often. It has the advantages that the traditional system calls' code doesn't have to be changed and that we can establish a special MAC protocol for it, by having some cleanup function called just before returning (this is a requirement for MAC/do), without disturbing the existing ones. The mac_cred_check_setcred() hook is passed the flags received by setcred() (including the version) and both the old and new kernel's 'struct ucred' instead of 'struct setcred' as this should simplify evolving existing hooks as the 'struct setcred' structure evolves. The mac_cred_setcred_enter() and mac_cred_setcred_exit() hooks are always called by pairs around potential calls to mac_cred_check_setcred(). They allow MAC modules to allocate/free data they may need in their mac_cred_check_setcred() hook, as the latter is called under the current process' lock, rendering sleepable allocations impossible. MAC/do is going to leverage these in a subsequent commit. A scheme where mac_cred_check_setcred() could return ERESTART was considered but is incompatible with proper composition of MAC modules. While here, add missing includes and declarations for standalone inclusion of <sys/ucred.h> both from kernel and userspace (for the latter, it has been working thanks to <bsm/audit.h> already including <sys/types.h>). Reviewed by: brooks Approved by: markj (mentor) Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D47618 (cherry picked from commit `ddb3eb4efe`)	2025-04-03 21:31:03 +02:00
Konstantin Belousov	4b3d44f3b4	kern/kern_descrip.c: remove unneeded headers (cherry picked from commit `6eb7cba469`)	2025-03-28 02:32:28 +02:00
Konstantin Belousov	09b8745670	kern___acl_aclcheck_path: vrele the vnode after namei() (cherry picked from commit `7fad17a3e6`)	2025-03-28 02:32:28 +02:00
Konstantin Belousov	60855d8993	kern/sys_eventfd.c: fix includes (cherry picked from commit `0a7c164d91`)	2025-03-20 05:13:38 +02:00
Konstantin Belousov	eae0129ba7	mb_unmapped_to_ext(): fix m_free()/m_freem() use (cherry picked from commit `c5a96f09df`)	2025-03-17 02:48:32 +02:00
Mark Johnston	0ce82e9346	umtx: Fix a bug in do_lock_pp() If the lock is unowned (i.e., owner == UMUTEX_CONTESTED), we might get a spurious failure, and in that case we need to retry the loop. Otherwise, the calling thread can end up sleeping forever. The same problem exists in do_set_ceiling(), which open-codes do_lock_pp(), so fix it there too. Reviewed by: olce Reported by: Daniel King <dmking@adacore.com> MFC after: 2 weeks Sponsored by: Innovate UK Differential Revision: https://reviews.freebsd.org/D49031 (cherry picked from commit `4b79443927`)	2025-03-08 00:37:37 +00:00
Mark Johnston	c813157a15	umtx: Add a helper for unlocked umtxq_busy() calls This seems like a natural complement to umtxq_unbusy_unlocked(). No functional change intended. Reviewed by: olce, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D49124 (cherry picked from commit `b01495caac`)	2025-03-07 22:51:48 +00:00
Mark Johnston	ffb4d07771	ktr: Use STAILQ_EMPTY_ATOMIC when checking for records in ktr_drain() This is an unlocked check, and after commit `34740937f7` the debug checks in STAILQ_EMPTY may spuriously fail here. In particular, the per process queue is updated under the global ktrace mutex, not held in ktr_drain(). If a record is enqueued concurrently, the recording thread will schedule an AST to drain the queue again, so it should not be possible for a race to leave records in the queue indefinitely. Reviewed by: kib, olce Reported by: syzbot+d67eddd8c4923ee28bb7@syzkaller.appspotmail.com MFC after: 2 weeks Fixes: `34740937f7` ("queue: New debug macros for STAILQ") Differential Revision: https://reviews.freebsd.org/D48899 (cherry picked from commit `36631977d8`)	2025-03-07 22:51:48 +00:00
John Baldwin	4aed8b3b61	new-bus: Add bus_(identify\|attach\|detach)_children These correspond to the current implementations of bus_generic_(probe\|attach\|detach) but with more accurate names and semantics. The intention is to deprecate bus_generic_(probe\|attach) and reimplement bus_generic_detach in a future commit. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D47673 (cherry picked from commit `46297859a7`) bus_generic_(probe\|attach\|detach) will not be changed in stable/14, but providing the new APIs in stable/14 permits drivers to use the new APIs.	2025-02-27 10:19:24 -05:00

1 2 3 4 5 ...

20200 commits