Commit graph

19665 commits

Author SHA1 Message Date
Doug Moore
8df38859d0 radix_trie: replace node count with popmap
Replace the 'count' field in a trie node with a bitmap that
identifies non-NULL children. Drop the 'last' field, and use the
last bit set in the bitmap instead.  In lookup_le, lookup_ge,
remove, and reclaim_all, use the bitmap to find the
previous/next/only/every non-null child in constant time by
examining the bitmask instead of looping across array elements
and null-checking them one-by-one.

A buildworld test suggests that this reduces the cycle count on
those functions that eliminate some null-checks by 4.9%, 1.5%,
0.0% and 13.3%.
Reviewed by:	alc
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D40775
2023-07-07 11:09:36 -05:00
Mike Karels
be30fd3ab2 KERN_PROC_VM_LAYOUT sysctl: fix bug in 32-bit-compatible path
vmspace_free() is called redundantly in the 32-bit-compatible
path in sysctl_kern_proc_vm_layout(), causing a premature free
(possibly for the current address space).  Remove the extra call.

PR:		272401
Reported by:	marklmi at yahoo.com
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40908
2023-07-07 08:37:16 -05:00
Ka Ho Ng
034c085601 modules: fix freebsd32_modstat on big endian platforms
The layout of modspecific_t on both little endian and big endian are as
follows:
|0|1|2|3|4|5|6|7|
+-------+-------+
|uintval|       |
+-------+-------+
|ulongval       |
+-------+-------+

For the following code snippet:
        CP(mod->data, data32, longval);
        CP(mod->data, data32, ulongval);
It only takes care of little endian platforms that it truncates the
highest 32bit automatically. However on big endian platforms it takes
the highest 32bit instead. This eventually returns a garbage syscall
number to the 32bit userland.

Since modspecific_t's usage currently is for the use of syscall modules,
we only initialize modspecific32_t with uintval. Now on both BE and LE
64-bit platforms it always pick up the first 4 bytes.

Sponsored by:	Juniper Networks, Inc.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40814
MFC after:	1 week
2023-07-07 00:22:59 -04:00
Mateusz Guzik
80bd5ef070 vfs: factor out mount point traversal to a dedicated routine
While here tidy up asserts in the area.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D40883
2023-07-07 01:15:04 +00:00
Mateusz Guzik
ebf37c3fed vfs: drop LK_RETRY when crossing mount points in vfs_lookup
vn_lock already returns the expected error.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D40883
2023-07-07 01:14:55 +00:00
Doug Moore
d4e236c70b inline_ffs: remove backup binary implementation
There is no longer be any point to maintaining a binary search routine
for ffs; inlines will always do it as well or better.
Reviewed by:	mhorne
Differential Revision:	https://reviews.freebsd.org/D40703
2023-07-06 13:36:12 -05:00
Doug Moore
6419ed7ee7 inline_fls: drop compile-time check
HAVE_INLINE_FLSLL is #defined always. This change assumes that where
__HAVE_INLINE_FLSLL is tested, the two leading underscores are a
mistake, and that the code will be better for using the efficient
flsll implementation.
Reviewed by:	markj, mhorne
Differential Revision:	https://reviews.freebsd.org/D40705
2023-07-06 13:32:59 -05:00
John Baldwin
3a9e3ed6b0 ddb: Always terminate DB_SHOW_ALIAS_FLAGS with a semi-colon.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D40818
2023-07-05 16:00:31 -07:00
Mateusz Guzik
0724cf3862 vfs: whack dpunlocked var in vfs_lookup
It is redundant given the bad_unlocked goto label.
2023-07-05 21:55:24 +00:00
Mateusz Guzik
ba8cc6d727 vfs: use __enum_uint8 for vtype and vstate
This whacks hackery around only reading v_type once.

Bump __FreeBSD_version to 1400093
2023-07-05 15:06:30 +00:00
Olivier Certner
5842f73dbc vfs: compute_lk_cnflags(): Remove unused argument 'cnflags'; Rename
Argument unused since commit 93a0ba8f49.

Rename it to enforce_lkflags(), which seems to more aptly describe what it does.

[mjg: massaged the commit message a little]
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D40848
2023-07-05 13:43:38 +00:00
Konstantin Belousov
658e762067 kern_lockf.c: fix typo
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-07-05 02:11:37 +03:00
Konstantin Belousov
d7614c010c vn_path_to_global_path_hardlink(): initialize len
before calling vn_fullpath_hardlink().  Otherwise we get random failures
when the len is automatically clipped.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-07-04 19:00:24 +03:00
Konstantin Belousov
81a37995c7 killpg(): close a race with fork(), part 2
When we are sending terminating signal to the group, killpg() needs to
guarantee that all group members are to be terminated (it does not need
to ensure that they are terminated on return from killpg()).  The
pg_killsx change eliminates the largest window there, but still, if a
multithreaded process is signalled, the following could happen:
- thread 1 is selected for the signal delivery and gets descheduled
- thread 2 waits for pg_killsx lock, obtains it and forks
- thread 1 continue executing and terminates the process
This scenario allows the child to escape still.

To fix it, count the number of signals sent to the process with
killpg(2), in p_killpg_cnt variable, which is incremented in killpg()
and decremented after signal handler frame is created or in exit1()
after single-threading.  This way we avoid forking if the termination is
due.

Noted and reviewed by:	markj (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D40493
2023-07-04 06:43:16 +03:00
Konstantin Belousov
3360b48525 killpg(2): close a race with fork(2), part1
If the process group member performs fork(), the child could escape
signalling from killpg(). Prevent it by introducing an sx process group
lock pg_killsx which is taken interruptibly shared around fork. If there
is a pending signal, do the trip through userspace with ERESTART to
handle signal ASTs. The lock is taken exclusively during killpg().

The lock is also locked exclusive when the process changes group
membership, to avoid escaping a signal by this means, by ensuring that
the process group is stable during fork.

Note that the new lock is before proctree lock, so in some situations we
could only do trylocking to obtain it.

This relatively simple approach cannot work for REAP_KILL, because
process potentially belongs to more than one reaper tree by having
sub-reapers.

Reported by:	dchagin
Tested by:	dchagin, pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D40493
2023-07-04 06:21:53 +03:00
Konstantin Belousov
4b59d1724b killpg1(): update the herald comment
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D40493
2023-07-04 06:21:53 +03:00
Konstantin Belousov
d6b900c915 vn_path_to_global_path_hardlink(): avoid freeing non-initialized pointer
Reported by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2023-07-04 06:19:47 +03:00
Ka Ho Ng
005aa1743b modules: bzero the modspecific_t
Per https://reviews.llvm.org/D68115, only the first field is
zero-initialized, meanwhile other fields are undef.

The pattern can be observed on clang as well, that when
-ftrivial-auto-var-init=pattern is specified 0xaa is filled for
non-active fields, otherwise they are zero-initialized.
Technically both are acceptable when using clang. However it
would be good to simply bzero the modspecific_t in such case to
be strict to the standard.

MFC with:	2cab2d43b8
MFC after:	1 day
Sponsored by:	Juniper Networks, Inc.
Reviewed by:	delphij
Differential Revision:	https://reviews.freebsd.org/D40830
2023-07-01 18:58:46 -04:00
Ka Ho Ng
2cab2d43b8 syscalls: fix modspecific_t stack content leak
Zero-initialize the whole modspecific_t so that there would
not be kernel stack content leak in the unused part.

Sponsored by:	Juniper Networks, Inc.
MFC after:	1 days
Differential Revision:	https://reviews.freebsd.org/D40815
2023-07-01 14:38:11 -04:00
Andrew Turner
9beb195fd9 Continue searching for an irq map from the start
When searching for a free irq map location continue the search from the
beginning of the list. There may be holes in the map before
irq_map_first_free_idx, e.g. when removing an entries in order will
increase the index past the current free entry.

PR:		271990
Reviewed by:	mhorne
Sponsored by:	Arm Ltd

Differential Revision:	https://reviews.freebsd.org/D40768
2023-06-28 18:03:08 +01:00
Andrew Turner
1e0ba9d43c Hide irq_next_free, it's not used out of this file
Reviewed by:	mhorne
Sponsored by:	Arm Ltd
Differential Revision:	https://reviews.freebsd.org/D40767
2023-06-28 18:03:08 +01:00
Igor Ostapenko
7b5a1c39f1 vfs: bring vfs_lookup() description comment up to date
Signed-off-by: Igor Ostapenko <pm@igoro.pro>
Reviewed by: imp, mhorne
Pull Request: https://github.com/freebsd/freebsd-src/pull/737
2023-06-27 16:34:02 -06:00
Igor Ostapenko
5958cd88f2 vfs: fix description comment of vfs_lookup()
Signed-off-by: Igor Ostapenko <pm@igoro.pro>
Reviewed by: imp, mhorne
Pull Request: https://github.com/freebsd/freebsd-src/pull/737
2023-06-27 16:33:25 -06:00
Doug Moore
da72505f9c radix_trie: pass fewer params to node_get
Let node_get calculate it's own owner value. Don't pass the count
parameter, since it's always 2. Save 16 bytes in insert(). Move,
without modifying, slot and trimkey to handle use-before-declaration
problem.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40723
2023-06-27 12:21:11 -05:00
Doug Moore
9cfed089ac radix_trie: clean up overlong lines
This is purely a cosmetic change. vm_radix.c has lines that reach past
column 80 and this change cleans that up. The associated changes to
subr_pctrie.c are just to keep mirroring vm_radix.c.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40764
2023-06-27 12:01:33 -05:00
Konstantin Belousov
4a402dfe0b VFS: Remove VV_READLINK flag
since its only reason to exist is removed.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D40700
2023-06-27 13:43:25 +03:00
Doug Moore
72c3a43b16 radix_trie: skip compare in lookup_le, lookup_ge
In _lookup_ge, where a loop "looks for an available edge or val within
the current bisection node" (to quote the code comment), the value of
index has already been modified to guarantee that it is the least
value than can be found in the non-NULL child node being
examined. Therefore, if the non-NULL child is a leaf, there's no need
to compare 'index' to anything, and the value can just be returned.

The same is true for _lookup_le with 'most' replacing 'least'.
Reviewed by:	alc
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D40746
2023-06-27 00:42:41 -05:00
Doug Moore
a42d8fe001 radix_trie: simplify trimkey functions
Replacing a branch and two shifts with a single masking operation saves 64 bytes the pair of functions lookup_le and lookup_ge on amd64.  Refresh the associated comments.
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D40722
2023-06-25 12:49:15 -05:00
Doug Moore
e8efee297c radix_trie: avoid reloading radix node
In the vm_radix:remove loop that searches for the last child, load
that child once, without loading it again after the search is over.
Change KASSERTS from index check to NULL node check.
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D40721
2023-06-23 18:47:23 -05:00
Mark Johnston
712079d381 unix: Fix uipc_peeraddr() to handle self-connected sockets
Reported by:	syzbot+c2da2dbae5fe006556bc@syzkaller.appspotmail.com
Reported by:	syzbot+b4d6b093b1d78bfa859b@syzkaller.appspotmail.com
Fixes:		e8f6e5b2d9 ("unix: Fix locking in uipc_peeraddr()")
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2023-06-21 14:38:26 -04:00
Doug Moore
05963ea4d1 radix_trie: eliminate iteration in keydiff
Use flsll(), instead of a loop, to find where two keys differ, and
then arithmetic to transform that to a trie level.
Approved by:	alc, markj
Differential Revision:	https://reviews.freebsd.org/D40585
2023-06-20 11:30:29 -05:00
Dmitry Chagin
cea7c564c7 namei: Reset the lookup to start from the real root for abs symlink target
Since fd745e1d Linux ABI specifies alternative root directory to reroot
lookups. First, an attempt is made to lookup the file in /ABI/original-path.
If that fails, the lookup is done in /original-path. In case of lookup
symbolic link with leading / in target namei() fails due to reroot reloads
original file name.
To avoid this handle restart in a special maner, without origin path name
reloading.

Reported by:		Goran Mekić, Vincent Milum Jr
Tested by:		Goran Mekić
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D40479
2023-06-13 15:24:25 +03:00
Dmitry Chagin
861abdadf9 namei: Add a comment explaining ISRESTARTED flag
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D40494
2023-06-13 15:22:09 +03:00
Dmitriy Alexandrov
6016aedba1 uipc_syscalls: removed unnecessary check in accept1() function
Signed-off-by: Dmitriy Alexandrov <d06alexandrov@gmail.com>
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/773
2023-06-12 08:49:13 -06:00
Dmitriy Alexandrov
af9ce4e9bb kern_ntptime: Fix undefined behavior of the shift operator
L_LINT macro is used with negative numbers [i.e.
L_LINT(time_freq, -MAXFREQ)], it could cause undefined
behavior. It should be similar to the L_RSHIFT(v, n) macro.

MFC after:	2 weeks
Reviewed by:	cy
Pull Request:	https://github.com/freebsd/freebsd-src/pull/769
Signed-off-by: Dmitriy Alexandrov <d06alexandrov@gmail.com>
2023-06-09 14:04:54 -07:00
Warner Losh
9121945d70 Regenerate sysent stuff after $FreeBSD$ removal
Sponsored by:		Netflix
2023-06-09 07:28:27 -06:00
John Baldwin
653738e895 ptrace: Clear TDB_BORN during PT_DETACH.
If a debugger detaches from a process that has a new thread that has
not yet executed, the new thread will raise a SIGTRAP signal to report
it's thread birth event even after the detach.  With the debugger
detached, this results in a SIGTRAP sent to the process and typically
a core dump.  Fix this by clearing TDB_BORN from any new threads
during detach.

Bump __FreeBSD_version for debuggers to notice when the fix is
present.

Reported by:	GDB's testsuite
Reviewed by:	kib, markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D39856
2023-06-07 12:28:36 -07:00
Colin Percival
9d6ae1e3c2 Revert "Revert "tslog: Annotate some early boot functions""
Now that <sys/tslog.h> is wrapped in #ifdef _KERNEL, it's safe to have
tslog annotations in files which might be built from userland (i.e. in
subr_boot.c, which is built as part of the boot loader).

This reverts commit 59588a546f.
2023-06-04 22:49:38 -07:00
Colin Percival
59588a546f Revert "tslog: Annotate some early boot functions"
The change to subr_boot.c broke the libsa build because the TSLOG
macros have their own definitions for the boot loader -- I didn't
realize that the loader code used subr_boot.c.

I'm currently testing a fix and I'll revert this revert once I'm
satisfied that everything works, but I don't want to leave the
tree broken for too long.

This reverts commit 469cfa3c30.
2023-06-04 11:39:45 -07:00
Colin Percival
45cc8519f5 tslog: Annotate parts of SYSINIT cpu
Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
SYSINIT cpu takes roughly 2770 us:
* 2280 us in vm_ksubmap_init
  * 535 us in kmem_malloc
    * 450 us in pmap_zero_page
  * 1720 us in pmap_growkernel
    * 1620 us in pmap_zero_page
* 80 us in bufinit
* 480 us in cpu_setregs
  * 430 us in cpu_setregs calling load_cr0

Much of this is hypervisor overhead: load_cr0 is slow because it traps
to the hypervisor, and 99% of the time in pmap_zero_page is spent when
we first touch the page, presumably due to the host Linux kernel
faulting in backing pages one by one.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D40327
2023-06-04 10:16:35 -07:00
Colin Percival
469cfa3c30 tslog: Annotate some early boot functions
Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
hammer_time takes roughly 2740 us:
* 55 us in xen_pvh_parse_preload_data
  * 20 us in boot_parse_cmdline_delim
  * 20 us in boot_env_to_howto
* 15 us in identify_hypervisor
* 1320 us in link_elf_reloc
  * 1310 us in relocate_file1 handling ef->rela
* 25 us in init_param1
* 30 us in dpcpu_init
* 355 us in initializecpu
  * 255 us in initializecpu calling load_cr4
* 425 us in getmemsize
  * 280 us in pmap_bootstrap
    * 205 us in create_pagetables
* 10 us in init_param2
* 25 us in pci_early_quirks
* 60 us in cninit
* 90 us in kdb_init
* 105 us in msgbufinit
* 20 us in fpuinit
* 205 us elsewhere in hammer_time

Some of these are unavoidable (e.g. identify_hypervisor uses CPUID and
load_cr4 loads the CR4 register, both of which trap to the hypervisor)
but others may deserve attention.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D40325
2023-06-04 10:16:22 -07:00
Colin Percival
02d9045866 tslog: Handle curthread equal to NULL
Early in the kernel boot, curthread goes through three stages:

1. Kernel crash when you try to access it, because PCPU doesn't exist.
2. NULL, because PCU exists but isn't initialized.
3. &thread0, which is where most of the kernel boot process runs.

This broke TSLOG from inside hammer_time since the scripts which parse
logged records didn't understand that NULL meant &thread0.

Tell tslog to record &thread0 as the active thread if passed NULL.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D40324
2023-06-04 10:16:22 -07:00
Mark Johnston
67f938c5ff kevent: Make references to filter definitions const
Follow-up revisions can make individual filter definitions const.  No
functional change intended.

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35842
2023-06-02 13:43:15 -04:00
Mark Johnston
3080f82b8b ktrace: Make the data lengths table const
No functional change intended.

MFC after:	1 week
2023-06-01 17:18:23 -04:00
Mark Johnston
0d3f1b4f25 signal: Make the signal disposition table const
No functional change intended.

MFC after:	1 week
2023-06-01 17:18:23 -04:00
Baptiste Daroussin
afbb26b58b devctl: allow to register a hook to receive the events
In preparation for netlink sysvent add a function that allow
registering a function to hook the events and also send it via
another kernel module (nlsysvent will be that module).

Prepare a static list of known existing events in the kernel that
will be used to prepopulate nlsysvent multicast group (one per event)

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D37573
2023-06-01 23:01:40 +02:00
Jessica Clarke
94426d21bf pmc: Rework PROCEXEC event to support PIEs
Currently the PROCEXEC event only reports a single address, entryaddr,
which is the entry point of the interpreter in the typical dynamic case,
and used solely to calculate the base address of the interpreter. For
PDEs this is fine, since the base address is known from the program
headers, but for PIEs the base address varies at run time based on where
the kernel chooses to load it, and so pmcstat has no way of knowing the
real address ranges for the executable. This was less of an issue in the
past since PIEs were rare, but now they're on by default on 64-bit
architectures it's more of a problem.

To solve this, pass through what was picked for et_dyn_addr by the
kernel, and use that as the offset for the executable's start address
just as is done for everything in the kernel. Since we're changing this
interface, sanitise the way we determine the interpreter's base address
by passing it through directly rather than indirectly via the entry
point and having to subtract off whatever the ELF header's e_entry is
(and anything that wants the entry point in future can still add that
back on as needed; this merely changes the interface to directly provide
the underlying variables involved).

This will be followed up by a bump to the pmc major version.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D39595
2023-05-31 00:20:36 +01:00
Jessica Clarke
659a0041dd imgact: Make et_dyn_addr part of image_params
This already gets passed around between various imgact_elf functions, so
moving it removes an argument from all those places. A future commit
will make use of this for hwpmc, though, to provide the load base for
PIEs, which currently isn't available to tools like pmcstat.

Reviewed by:	kib, markj, jhb
Differential Revision:	https://reviews.freebsd.org/D39594
2023-05-31 00:15:43 +01:00
Jessica Clarke
53d0b9e438 pmc: Provide full path to modules from kernel linker
This unifies the user object and kernel module paths in libpmcstat,
allows modules loaded from non-standard locations (e.g. from a user's
home directory when testing) to be found and, since buffer is what all
the warnings here use (they were never updated when buffer_modules were
added to pick based on where the file was found) has the side-effect of
ensuring the messages are correct.

This includes obsoleting the now-superfluous -k option in pmcstat.

This change breaks the hwpmc ABI and will be followed by a bump to the
pmc major version.

Reviewed by:	jhb, jkoshy, mhorne
Differential Revision:	https://reviews.freebsd.org/D40048
2023-05-31 00:15:34 +01:00
Mark Johnston
4e78addbef buf: Make the number of pbufs slightly more dynamic
Various subsystems pre-allocate a set of pbufs, allocated to implement
I/O operations.  pbuf allocations are transient, unlike most buf
allocations.

Most subsystems preallocate nswbuf or nswbuf/2 pbufs each.  The
preallocation ensures that pbuf allocation will succeed in low memory
conditions, which might help avoid deadlocks.  Currently we initialize
nswbuf = min(nbuf / 4, 256).

nbuf/4 > 256 on anything but the smallest systems.  For example,
nswbuf is 256 in a VM with 128MB of memory.  In this configuration, a
firecracker VM with one CPU preallocates over 900 pbufs.  This consumes
2MB of RAM and adds several milliseconds to the kernel's (very small)
boot time.

Scale nswbuf by ncpu in the common case.  I think this makes more sense
than scaling by the amount of RAM, since pbuf allocations are transient
and aren't used for caching.  With the change, we get nswbuf=256 with 8
CPUs.  With fewer than 8 CPUs we'll preallocate fewer pbufs than before,
and with more we'll preallocate more.

Event:		BSDCan 2023
Reported by:	cperciva
Reviewed by:	glebius, kib
MFC after:	2 months
Differential Revision:	https://reviews.freebsd.org/D40216
2023-05-30 15:11:32 -04:00