Replace the 'count' field in a trie node with a bitmap that
identifies non-NULL children. Drop the 'last' field, and use the
last bit set in the bitmap instead. In lookup_le, lookup_ge,
remove, and reclaim_all, use the bitmap to find the
previous/next/only/every non-null child in constant time by
examining the bitmask instead of looping across array elements
and null-checking them one-by-one.
A buildworld test suggests that this reduces the cycle count on
those functions that eliminate some null-checks by 4.9%, 1.5%,
0.0% and 13.3%.
Reviewed by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D40775
vmspace_free() is called redundantly in the 32-bit-compatible
path in sysctl_kern_proc_vm_layout(), causing a premature free
(possibly for the current address space). Remove the extra call.
PR: 272401
Reported by: marklmi at yahoo.com
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40908
The layout of modspecific_t on both little endian and big endian are as
follows:
|0|1|2|3|4|5|6|7|
+-------+-------+
|uintval| |
+-------+-------+
|ulongval |
+-------+-------+
For the following code snippet:
CP(mod->data, data32, longval);
CP(mod->data, data32, ulongval);
It only takes care of little endian platforms that it truncates the
highest 32bit automatically. However on big endian platforms it takes
the highest 32bit instead. This eventually returns a garbage syscall
number to the 32bit userland.
Since modspecific_t's usage currently is for the use of syscall modules,
we only initialize modspecific32_t with uintval. Now on both BE and LE
64-bit platforms it always pick up the first 4 bytes.
Sponsored by: Juniper Networks, Inc.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40814
MFC after: 1 week
There is no longer be any point to maintaining a binary search routine
for ffs; inlines will always do it as well or better.
Reviewed by: mhorne
Differential Revision: https://reviews.freebsd.org/D40703
HAVE_INLINE_FLSLL is #defined always. This change assumes that where
__HAVE_INLINE_FLSLL is tested, the two leading underscores are a
mistake, and that the code will be better for using the efficient
flsll implementation.
Reviewed by: markj, mhorne
Differential Revision: https://reviews.freebsd.org/D40705
Argument unused since commit 93a0ba8f49.
Rename it to enforce_lkflags(), which seems to more aptly describe what it does.
[mjg: massaged the commit message a little]
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D40848
before calling vn_fullpath_hardlink(). Otherwise we get random failures
when the len is automatically clipped.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
When we are sending terminating signal to the group, killpg() needs to
guarantee that all group members are to be terminated (it does not need
to ensure that they are terminated on return from killpg()). The
pg_killsx change eliminates the largest window there, but still, if a
multithreaded process is signalled, the following could happen:
- thread 1 is selected for the signal delivery and gets descheduled
- thread 2 waits for pg_killsx lock, obtains it and forks
- thread 1 continue executing and terminates the process
This scenario allows the child to escape still.
To fix it, count the number of signals sent to the process with
killpg(2), in p_killpg_cnt variable, which is incremented in killpg()
and decremented after signal handler frame is created or in exit1()
after single-threading. This way we avoid forking if the termination is
due.
Noted and reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D40493
If the process group member performs fork(), the child could escape
signalling from killpg(). Prevent it by introducing an sx process group
lock pg_killsx which is taken interruptibly shared around fork. If there
is a pending signal, do the trip through userspace with ERESTART to
handle signal ASTs. The lock is taken exclusively during killpg().
The lock is also locked exclusive when the process changes group
membership, to avoid escaping a signal by this means, by ensuring that
the process group is stable during fork.
Note that the new lock is before proctree lock, so in some situations we
could only do trylocking to obtain it.
This relatively simple approach cannot work for REAP_KILL, because
process potentially belongs to more than one reaper tree by having
sub-reapers.
Reported by: dchagin
Tested by: dchagin, pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D40493
Per https://reviews.llvm.org/D68115, only the first field is
zero-initialized, meanwhile other fields are undef.
The pattern can be observed on clang as well, that when
-ftrivial-auto-var-init=pattern is specified 0xaa is filled for
non-active fields, otherwise they are zero-initialized.
Technically both are acceptable when using clang. However it
would be good to simply bzero the modspecific_t in such case to
be strict to the standard.
MFC with: 2cab2d43b8
MFC after: 1 day
Sponsored by: Juniper Networks, Inc.
Reviewed by: delphij
Differential Revision: https://reviews.freebsd.org/D40830
Zero-initialize the whole modspecific_t so that there would
not be kernel stack content leak in the unused part.
Sponsored by: Juniper Networks, Inc.
MFC after: 1 days
Differential Revision: https://reviews.freebsd.org/D40815
When searching for a free irq map location continue the search from the
beginning of the list. There may be holes in the map before
irq_map_first_free_idx, e.g. when removing an entries in order will
increase the index past the current free entry.
PR: 271990
Reviewed by: mhorne
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D40768
Let node_get calculate it's own owner value. Don't pass the count
parameter, since it's always 2. Save 16 bytes in insert(). Move,
without modifying, slot and trimkey to handle use-before-declaration
problem.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40723
This is purely a cosmetic change. vm_radix.c has lines that reach past
column 80 and this change cleans that up. The associated changes to
subr_pctrie.c are just to keep mirroring vm_radix.c.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40764
since its only reason to exist is removed.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40700
In _lookup_ge, where a loop "looks for an available edge or val within
the current bisection node" (to quote the code comment), the value of
index has already been modified to guarantee that it is the least
value than can be found in the non-NULL child node being
examined. Therefore, if the non-NULL child is a leaf, there's no need
to compare 'index' to anything, and the value can just be returned.
The same is true for _lookup_le with 'most' replacing 'least'.
Reviewed by: alc
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D40746
Replacing a branch and two shifts with a single masking operation saves 64 bytes the pair of functions lookup_le and lookup_ge on amd64. Refresh the associated comments.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D40722
In the vm_radix:remove loop that searches for the last child, load
that child once, without loading it again after the search is over.
Change KASSERTS from index check to NULL node check.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D40721
Use flsll(), instead of a loop, to find where two keys differ, and
then arithmetic to transform that to a trie level.
Approved by: alc, markj
Differential Revision: https://reviews.freebsd.org/D40585
Since fd745e1d Linux ABI specifies alternative root directory to reroot
lookups. First, an attempt is made to lookup the file in /ABI/original-path.
If that fails, the lookup is done in /original-path. In case of lookup
symbolic link with leading / in target namei() fails due to reroot reloads
original file name.
To avoid this handle restart in a special maner, without origin path name
reloading.
Reported by: Goran Mekić, Vincent Milum Jr
Tested by: Goran Mekić
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40479
L_LINT macro is used with negative numbers [i.e.
L_LINT(time_freq, -MAXFREQ)], it could cause undefined
behavior. It should be similar to the L_RSHIFT(v, n) macro.
MFC after: 2 weeks
Reviewed by: cy
Pull Request: https://github.com/freebsd/freebsd-src/pull/769
Signed-off-by: Dmitriy Alexandrov <d06alexandrov@gmail.com>
If a debugger detaches from a process that has a new thread that has
not yet executed, the new thread will raise a SIGTRAP signal to report
it's thread birth event even after the detach. With the debugger
detached, this results in a SIGTRAP sent to the process and typically
a core dump. Fix this by clearing TDB_BORN from any new threads
during detach.
Bump __FreeBSD_version for debuggers to notice when the fix is
present.
Reported by: GDB's testsuite
Reviewed by: kib, markj (previous version)
Differential Revision: https://reviews.freebsd.org/D39856
Now that <sys/tslog.h> is wrapped in #ifdef _KERNEL, it's safe to have
tslog annotations in files which might be built from userland (i.e. in
subr_boot.c, which is built as part of the boot loader).
This reverts commit 59588a546f.
The change to subr_boot.c broke the libsa build because the TSLOG
macros have their own definitions for the boot loader -- I didn't
realize that the loader code used subr_boot.c.
I'm currently testing a fix and I'll revert this revert once I'm
satisfied that everything works, but I don't want to leave the
tree broken for too long.
This reverts commit 469cfa3c30.
Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
SYSINIT cpu takes roughly 2770 us:
* 2280 us in vm_ksubmap_init
* 535 us in kmem_malloc
* 450 us in pmap_zero_page
* 1720 us in pmap_growkernel
* 1620 us in pmap_zero_page
* 80 us in bufinit
* 480 us in cpu_setregs
* 430 us in cpu_setregs calling load_cr0
Much of this is hypervisor overhead: load_cr0 is slow because it traps
to the hypervisor, and 99% of the time in pmap_zero_page is spent when
we first touch the page, presumably due to the host Linux kernel
faulting in backing pages one by one.
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D40327
Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
hammer_time takes roughly 2740 us:
* 55 us in xen_pvh_parse_preload_data
* 20 us in boot_parse_cmdline_delim
* 20 us in boot_env_to_howto
* 15 us in identify_hypervisor
* 1320 us in link_elf_reloc
* 1310 us in relocate_file1 handling ef->rela
* 25 us in init_param1
* 30 us in dpcpu_init
* 355 us in initializecpu
* 255 us in initializecpu calling load_cr4
* 425 us in getmemsize
* 280 us in pmap_bootstrap
* 205 us in create_pagetables
* 10 us in init_param2
* 25 us in pci_early_quirks
* 60 us in cninit
* 90 us in kdb_init
* 105 us in msgbufinit
* 20 us in fpuinit
* 205 us elsewhere in hammer_time
Some of these are unavoidable (e.g. identify_hypervisor uses CPUID and
load_cr4 loads the CR4 register, both of which trap to the hypervisor)
but others may deserve attention.
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D40325
Early in the kernel boot, curthread goes through three stages:
1. Kernel crash when you try to access it, because PCPU doesn't exist.
2. NULL, because PCU exists but isn't initialized.
3. &thread0, which is where most of the kernel boot process runs.
This broke TSLOG from inside hammer_time since the scripts which parse
logged records didn't understand that NULL meant &thread0.
Tell tslog to record &thread0 as the active thread if passed NULL.
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D40324
In preparation for netlink sysvent add a function that allow
registering a function to hook the events and also send it via
another kernel module (nlsysvent will be that module).
Prepare a static list of known existing events in the kernel that
will be used to prepopulate nlsysvent multicast group (one per event)
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D37573
Currently the PROCEXEC event only reports a single address, entryaddr,
which is the entry point of the interpreter in the typical dynamic case,
and used solely to calculate the base address of the interpreter. For
PDEs this is fine, since the base address is known from the program
headers, but for PIEs the base address varies at run time based on where
the kernel chooses to load it, and so pmcstat has no way of knowing the
real address ranges for the executable. This was less of an issue in the
past since PIEs were rare, but now they're on by default on 64-bit
architectures it's more of a problem.
To solve this, pass through what was picked for et_dyn_addr by the
kernel, and use that as the offset for the executable's start address
just as is done for everything in the kernel. Since we're changing this
interface, sanitise the way we determine the interpreter's base address
by passing it through directly rather than indirectly via the entry
point and having to subtract off whatever the ELF header's e_entry is
(and anything that wants the entry point in future can still add that
back on as needed; this merely changes the interface to directly provide
the underlying variables involved).
This will be followed up by a bump to the pmc major version.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D39595
This already gets passed around between various imgact_elf functions, so
moving it removes an argument from all those places. A future commit
will make use of this for hwpmc, though, to provide the load base for
PIEs, which currently isn't available to tools like pmcstat.
Reviewed by: kib, markj, jhb
Differential Revision: https://reviews.freebsd.org/D39594
This unifies the user object and kernel module paths in libpmcstat,
allows modules loaded from non-standard locations (e.g. from a user's
home directory when testing) to be found and, since buffer is what all
the warnings here use (they were never updated when buffer_modules were
added to pick based on where the file was found) has the side-effect of
ensuring the messages are correct.
This includes obsoleting the now-superfluous -k option in pmcstat.
This change breaks the hwpmc ABI and will be followed by a bump to the
pmc major version.
Reviewed by: jhb, jkoshy, mhorne
Differential Revision: https://reviews.freebsd.org/D40048
Various subsystems pre-allocate a set of pbufs, allocated to implement
I/O operations. pbuf allocations are transient, unlike most buf
allocations.
Most subsystems preallocate nswbuf or nswbuf/2 pbufs each. The
preallocation ensures that pbuf allocation will succeed in low memory
conditions, which might help avoid deadlocks. Currently we initialize
nswbuf = min(nbuf / 4, 256).
nbuf/4 > 256 on anything but the smallest systems. For example,
nswbuf is 256 in a VM with 128MB of memory. In this configuration, a
firecracker VM with one CPU preallocates over 900 pbufs. This consumes
2MB of RAM and adds several milliseconds to the kernel's (very small)
boot time.
Scale nswbuf by ncpu in the common case. I think this makes more sense
than scaling by the amount of RAM, since pbuf allocations are transient
and aren't used for caching. With the change, we get nswbuf=256 with 8
CPUs. With fewer than 8 CPUs we'll preallocate fewer pbufs than before,
and with more we'll preallocate more.
Event: BSDCan 2023
Reported by: cperciva
Reviewed by: glebius, kib
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D40216