it. The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.
Reported/Tested by: rlibby
Discussed with: rlibby, jhb
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.
Let's get a working version of this in the tree and we can refine it from
here.
Submitted by: bwidawsk, scottph
Reviewed by: bcr (manpages), myself
Discussed with: jhb, kib (earlier versions)
With feedback from: Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18028
The vnode pager does not want the object lock held. Moving this out allows
further object lock scope reduction in callers. While here add some missing
paging in progress calls and an assert. The object handle is now protected
explicitly with pip.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23033
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23186
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.
Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.
Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.
Reviewed by: brooks, imp
Nice!: emaste
Differential Revision: https://reviews.freebsd.org/D23197
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.
This soothes a global serialisation point for all 0<->1 hold count transitions.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D23162
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.
The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.
Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158
Take advantage of global ordering introduced in r356672.
Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067
ordering to allocate early pages in the same way boot pages were but only
as needed. After the KVA allocator has started up we allocate the KVA that
we consumed during boot. This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.
Parts of this patch were written and tested by markj.
Reviewed by: glebius, markj
Differential Revision: https://reviews.freebsd.org/D23102
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.
The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.
Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.
The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().
Reported by: Peter Holm
Analysis by: Chuck Silvers
Tested by: Peter Holm
MFC after: 7 days
Sponsored by: Netflix
subsystems tend to need to know about it, and including if_var.h is
huge header pollution for them. Polluting possible non-network
users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch. Not used yet, and unlikely
to get used in close future.
The only reason to vlazy there is to (overzealously) ensure all vnodes
which need to be visited by msync scan can be found there.
In particluar this is of no use zfs and tmpfs.
While here depessimize the check.
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE. A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.
Per-cpu areas are locked in order to synchronize against UMA freeing
memory.
vnode's v_mflag is converted to short to prevent the struct from
growing.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
The current notion of an active vnode is eliminated.
Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.
Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
This obviates the need to scan the entire active list looking for vnodes
of interest.
msync is handled by adding all vnodes with write count to the lazy list.
deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.
Vnodes get dequeued from the list when their hold count reaches 0.
Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995
Treat it as a synonym for GRND_NONBLOCK. The reasoning is this:
We have two choices for handling Linux's GRND_INSECURE API flag.
1. We could ignore it completely (like GRND_RANDOM). However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.
2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk. Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.
Honoring the flag in the way Linux does seems fraught. If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak). This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.
Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.
If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar. Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes. So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.
For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2). Consider this API not-quite stable for now. We
guarantee it will never block. But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).
Approved by: csprng(markm), manpages(bcr)
See also: https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also: https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision: https://reviews.freebsd.org/D23130
This creates a dedicated routine (vn_alloc) to allocate vnodes.
As a side effect code duplicationw with getnewvnode_reserve is eleminated.
Add vn_free for symmetry.
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.
Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.
Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.
Reviewed by: brooks
Discussed with: sjg
Differential Revision: https://reviews.freebsd.org/D23099
Otherwise the malloc type accounting in malloc_domainset(9) is wrong
after r355203.
Reviewed by: rlibby
Reported by: kaktus
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23095
Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX
for safety, so we still get proper synchronization of shmfd->shm_size in
effect. There's no need to block readers/writers of earlier segments when
we're just reserving more space, so narrow the scope -- it would likely be
safe to narrow it completely to just the section of the range that extends
beyond our current size, but this likely isn't worth it since the size isn't
stable until the writelock is granted the first time.
Suggested by: cem (passing comment)
Linux expects to be able to use posix_fallocate(2) on a memfd. Other places
would use this with shm_open(2) to act as a smarter ftruncate(2).
Test has been added to go along with this.
Reviewed by: kib (earlier version)
Differential Revision: https://reviews.freebsd.org/D23042