Commit graph

17117 commits

Author SHA1 Message Date
Jeff Roberson
91e31c3c08 Consistently use busy and vm_page_valid() rather than touching page bits
directly.  This improves API compliance, asserts, etc.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23283
2020-01-23 04:54:49 +00:00
Jeff Roberson
1eb13fce84 Block the thread lock in sched_throw() and use cpu_switch() to unblock
it.  The introduction of lockless switch in r355784 created a race to
re-use the exiting thread that was only possible to hit on a hypervisor.

Reported/Tested by:	rlibby
Discussed with:	rlibby, jhb
2020-01-23 03:36:50 +00:00
Gleb Smirnoff
ad3980121b DEVICE_POLLING is an alternative to network interrupts and also
needs to enter epoch.  Assert that in the netisr_poll() and do
the work for the idle poll routine.
2020-01-23 01:30:50 +00:00
Gleb Smirnoff
511d1afb6b Enter the network epoch for interrupt handlers of INTR_TYPE_NET.
Provide tunable to limit how many times handlers may be executed
without reentering epoch.

Differential Revision:	https://reviews.freebsd.org/D23242
2020-01-23 01:24:47 +00:00
Gleb Smirnoff
c4eb66309f Add ie_hflags to struct intr_event, which accumulates flags from all
handlers on this event.  For now handle only IH_ENTROPY in that manner.
2020-01-23 01:20:59 +00:00
Conrad Meyer
4577cf3744 cpufreq(4): Add support for Intel Speed Shift
Intel Speed Shift is Intel's technology to control frequency in hardware,
with hints from software.

Let's get a working version of this in the tree and we can refine it from
here.

Submitted by:	bwidawsk, scottph
Reviewed by:	bcr (manpages), myself
Discussed with:	jhb, kib (earlier versions)
With feedback from:	Greg V, gallatin, freebsdnewbie AT freenet.de
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D18028
2020-01-22 23:28:42 +00:00
Hans Petter Selasky
1f69a50940 Make sure the VNET is properly set when calling tcp_drop() from
the ktls taskqueue callback function.

A valid VNET is needed when updating statistics.

panic()
tcp_state_change()
tcp_drop()
ktls_reset_send_tag()
taskqueue_run_locked()
taskqueue_thread_loop()

Sponsored by:	Mellanox Technologies
2020-01-21 11:43:25 +00:00
Mateusz Guzik
6403455301 cache: revert r352613 now that vhold does not take locks 2020-01-20 19:52:23 +00:00
Mateusz Guzik
8bba93c7e0 cache: make numcachehv use counter(9) on all archs
Requested by:	kib
2020-01-20 14:42:11 +00:00
Jeff Roberson
d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Mateusz Guzik
a9099e5b10 vfs: switch vop_stdunlock to call lockmgr_unlock
Since the flags argument is now alawys 0 the new call provides the same
behavior.
2020-01-19 21:41:34 +00:00
Jeff Roberson
811d05fcb7 Provide an API for interlocked refcount sleeps.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D22908
2020-01-19 18:18:17 +00:00
Mateusz Guzik
28479aaae2 vfs: allow v_holdcnt to transition 0->1 without the interlock
Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23186
2020-01-19 17:47:04 +00:00
Mateusz Guzik
059cb4843b cache: counter_u64_add_protected -> counter_u64_add
Fixes booting on RISC-V where it does happen to not be equivalent.

Reported by:	lwhsu
2020-01-19 17:05:26 +00:00
Mateusz Guzik
1399033590 cache: convert numcachehv to counter(9) on 64-bit platforms 2020-01-19 05:37:27 +00:00
Mateusz Guzik
512fa9a4e0 vfs: plug a conditional assigment of lo_name in getnewvnode
It only matters for witness. No functional changes.
2020-01-19 05:36:45 +00:00
Kyle Evans
05d7dd739c sysent targets: further cleanup and deduplication
r355473 vastly improved the readability and cleanliness of these Makefiles.
Every single one of them follows the same pattern and duplicates the exact
same logic.

Now that we have GENERATED/SRCS, split SRCS up into the two parameters we'll
use for ${MAKESYSCALLS} rather than assuming a specific ordering of SRCS and
include a common sysent.mk to handle the rest. This makes it less tedious to
make sweeping changes.

Some default values are provided for GENERATED/SYSENT_*; almost all of these
just use a 'syscalls.master' and 'syscalls.conf' in cwd, and they all use
effectively the same filenames with an arbitrary prefix. Most ABIs will be
able to get away with just setting GENERATED_PREFIX and including
^/sys/conf/sysent.mk, while others only need light additions. kern/Makefile
is the notable exception, as it doesn't take a SYSENT_CONF and the generated
files are spread out between ^/sys/kern and ^/sys/sys, but it otherwise fits
the pattern enough to use the common version.

Reviewed by:	brooks, imp
Nice!:		emaste
Differential Revision:	https://reviews.freebsd.org/D23197
2020-01-18 20:37:45 +00:00
Mateusz Guzik
2d0c620272 vfs: distribute freevnodes counter per-cpu
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23235
2020-01-18 01:29:02 +00:00
Mateusz Guzik
d3cc535474 vfs: provide F_ISUNIONSTACK as a kludge for libc
Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D23162
2020-01-17 14:42:25 +00:00
Mateusz Guzik
1ad72b270c vfs: shorten lock hold time in vdbatch_process 2020-01-17 14:39:00 +00:00
Gleb Smirnoff
66c6c556b6 Change argument order of epoch_call() to more natural, first function,
then its argument.

Reviewed by:	imp, cem, jhb
2020-01-17 06:10:24 +00:00
Mateusz Guzik
66f67d5e5e vfs: increment numvnodes without the vnode list lock unless under pressure
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23140
2020-01-16 21:45:21 +00:00
Mateusz Guzik
b7f50b9ad1 vfs: refcator vnode allocation
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D23158
2020-01-16 21:43:13 +00:00
Mateusz Guzik
875cfc082d vfs: reimplement vlrureclaim to actually use LRU
Take advantage of global ordering introduced in r356672.

Reviewed by:	mckusick (previous version)
Differential Revision:	https://reviews.freebsd.org/D23067
2020-01-16 10:44:02 +00:00
Jeff Roberson
a81c400e75 Simplify VM and UMA startup by eliminating boot pages. Instead use careful
ordering to allocate early pages in the same way boot pages were but only
as needed.  After the KVA allocator has started up we allocate the KVA that
we consumed during boot.  This also makes the boot pages freeable since they
have vm_page structures allocated with the rest of memory.

Parts of this patch were written and tested by markj.

Reviewed by:	glebius, markj
Differential Revision:	https://reviews.freebsd.org/D23102
2020-01-16 05:01:21 +00:00
Kirk McKusick
bbb1e07d65 Peter Holm reports that his test that does an umount(8) on an active
mount point while numerous tests are running that are writing to
files on that mount point cause the unmount(8) to hang forever.

The unmount(8) system call is handled in the kernel by the dounmount()
function. The cause of the hang is that prior to dounmount() calling
VFS_UNMOUNT() it is calling VFS_SYNC(mp, MNT_WAIT). The MNT_WAIT
flag indicates that VFS_SYNC() should not return until all the dirty
buffers associated with the mount point have been written to disk.
Because user processes are allowed to continue writing and can do
so faster than the data can be written to disk, the call to VFS_SYNC()
can never finish.

Unlike VFS_SYNC(), the VFS_UNMOUNT() routine can suspend all processes
when they request to do a write thus having a finite number of dirty
buffers to write that cannot be expanded. There is no need to call
VFS_SYNC() before calling VFS_UNMOUNT(), because VFS_UNMOUNT() needs
to flush everything again anyway after suspending writes, to catch
anything that was dirtied between the VFS_SYNC() and writes being
suspended.

The fix is to simply remove the unnecessary call to VFS_SYNC() from
dounmount().

Reported by:  Peter Holm
Analysis by:  Chuck Silvers
Tested by:    Peter Holm
MFC after:    7 days
Sponsored by: Netflix
2020-01-15 18:53:32 +00:00
Gleb Smirnoff
9074694339 Since this code uses if_ref()/if_rele() it must include if_var.h
explicitly, not via header pollution.
2020-01-15 03:39:11 +00:00
Gleb Smirnoff
3264dcadc9 - Move global network epoch definition to epoch.h, as more different
subsystems tend to need to know about it, and including if_var.h is
  huge header pollution for them.  Polluting possible non-network
  users with single symbol seems much lesser evil.
- Remove non-preemptible network epoch.  Not used yet, and unlikely
  to get used in close future.
2020-01-15 03:34:21 +00:00
Mateusz Guzik
cda3176851 vfs: in vop_stdadd_writecount only vlazy vnodes on mounts using msync
The only reason to vlazy there is to (overzealously) ensure all vnodes
which need to be visited by msync scan can be found there.

In particluar this is of no use zfs and tmpfs.

While here depessimize the check.
2020-01-15 01:34:05 +00:00
Ryan Libby
51871224c0 malloc: remove assumptions about MINALLOCSIZE
Remove assumptions about the minimum MINALLOCSIZE, in order to allow
testing of smaller MINALLOCSIZE.  A following patch will lower the
MINALLOCSIZE, but not so much that the present patch is required for
correctness at these sites.

Reviewed by:	jeff, markj
Sponsored by:	Dell EMC Isilon
2020-01-14 02:14:02 +00:00
Konstantin Belousov
fedab1b499 Code must not unlock a mutex while owning the thread lock.
Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D23150
2020-01-13 14:30:19 +00:00
Mateusz Guzik
0c236d3d52 vfs: per-cpu batched requeuing of free vnodes
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing
memory.

vnode's v_mflag is converted to short to prevent the struct from
growing.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22998
2020-01-13 02:39:41 +00:00
Mateusz Guzik
cc3593fbd9 vfs: rework vnode list management
The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22997
2020-01-13 02:37:25 +00:00
Mateusz Guzik
57083d2576 vfs: add per-mount vnode lazy list and use it for deferred inactive + msync
This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22995
2020-01-13 02:34:02 +00:00
Conrad Meyer
365cd52245 Fix a typo in r356667 comment
No functional change.

Reported by:	bdragon
Approved by:	csprng(markm), earlier version
X-MFC-With:	r356667
2020-01-12 23:52:16 +00:00
Conrad Meyer
86def3dcd6 getrandom(2): Add Linux GRND_INSECURE API flag
Treat it as a synonym for GRND_NONBLOCK.  The reasoning is this:

We have two choices for handling Linux's GRND_INSECURE API flag.

1. We could ignore it completely (like GRND_RANDOM).  However, this might
produce the surprising result of GRND_INSECURE requests blocking, when the
Linux API does not block.

2. Alternatively, we could treat GRND_INSECURE requests as requests for
GRND_NONBLOCk.  Here, the surprising result for Linux programs is that
invocations with unseeded random(4) will produce EAGAIN, rather than
garbage.

Honoring the flag in the way Linux does seems fraught.  If we actually use
the output of a random(4) implementation prior to seeding, we leak some
entropy (in an information theory and also practical sense) from what will
be the initial seed to attackers (or allow attackers to arbitrary DoS
initial seeding, if we don't leak).  This seems unacceptable -- it defeats
the purpose of blocking on initial seeding.

Secondary to that concern, before seeding we may have arbitrarily little
entropy collected; producing output from zero or a handful of entropy bits
does not seem particularly useful to userspace.

If userspace can accept garbage, insecure, non-random bytes, they can create
their own insecure garbage with srandom(time(NULL)) or similar.  Any program
which would be satisfied with a 3-bit key CTR stream has no need for CSPRNG
bytes.  So asking the kernel to produce such an output from the secure
getrandom(2) API seems inane.

For now, we've elected to emulate GRND_INSECURE as an alternative spelling
of GRND_NONBLOCK (2).  Consider this API not-quite stable for now.  We
guarantee it will never block.  But we will attempt to monitor actual port
uptake of this bizarre API and may revise our plans for the unseeded
behavior (prior stable/13 branching).

Approved by:	csprng(markm), manpages(bcr)
See also:	https://lwn.net/ml/linux-kernel/cover.1577088521.git.luto@kernel.org/
See also:	https://lwn.net/ml/linux-kernel/20200107204400.GH3619@mit.edu/
Differential Revision:	https://reviews.freebsd.org/D23130
2020-01-12 20:47:38 +00:00
Edward Tomasz Napierala
ca603bb1ee dd kern_getpriority(), make Linuxulator use it.
Reviewed by:	kib, emaste
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22842
2020-01-12 14:25:44 +00:00
Edward Tomasz Napierala
7a0ef283e6 Add kern_setpriority(), use it in Linuxulator.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22841
2020-01-12 13:38:51 +00:00
Mateusz Guzik
d199ad3b44 Add "panicked" boolean which can be tested instead of panicstr
The test is performed all the time and reading entire panicstr to do it
wastes space.
2020-01-12 06:09:10 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Mateusz Guzik
91de98e6d4 vfs: only recalculate watermarks when limits are changing
Previously they would get recalculated all the time, in particular in:
getnewvnode -> vcheckspace -> vspace
2020-01-11 23:00:57 +00:00
Mateusz Guzik
e6ae744e0e vfs: deduplicate vnode allocation logic
This creates a dedicated routine (vn_alloc) to allocate vnodes.

As a side effect code duplicationw with getnewvnode_reserve is eleminated.

Add vn_free for symmetry.
2020-01-11 22:59:44 +00:00
Mateusz Guzik
b52d50cf69 vfs: prealloc vnodes in getnewvnode_reserve
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
2020-01-11 22:58:14 +00:00
Mateusz Guzik
6928306764 vfs: incomplete pass at converting more ints to u_long
Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.
2020-01-11 22:56:20 +00:00
Mateusz Guzik
bf62296f35 vfs: add missing CLTFLA_MPSAFE annotations
This covers all kern/vfs_*.c files.
2020-01-11 22:55:12 +00:00
Kyle Evans
1171c633fb Set .ORDER for makesyscalls generated files
When either makesyscalls.lua or syscalls.master changes, all of the
${GENERATED} targets are now out-of-date. With make jobs > 1, this means we
will run the makesyscalls script in parallel for the same ABI, generating
the same set of output files.

Prior to r356603 , there is a large window for interlacing output for some
of the generated files that we were generating in-place rather than staging
in a temp dir. After that, we still should't need to run the script more
than once per-ABI as the first invocation should update all of them. Add
.ORDER to do so cleanly.

Reviewed by:	brooks
Discussed with:	sjg
Differential Revision:	https://reviews.freebsd.org/D23099
2020-01-10 18:24:17 +00:00
Mark Johnston
dc727127f1 Change malloc_domain() to return the allocation size to the caller.
Otherwise the malloc type accounting in malloc_domainset(9) is wrong
after r355203.

Reviewed by:	rlibby
Reported by:	kaktus
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23095
2020-01-09 15:02:48 +00:00
Kyle Evans
6a38cd3a54 kern/Makefile: systrace_args.c is also generated 2020-01-09 06:10:25 +00:00
Kyle Evans
39eae263cd shmfd: posix_fallocate(2): only take rangelock for section we need
Other mechanisms that resize the shmfd grab a write lock from 0 to OFF_MAX
for safety, so we still get proper synchronization of shmfd->shm_size in
effect. There's no need to block readers/writers of earlier segments when
we're just reserving more space, so narrow the scope -- it would likely be
safe to narrow it completely to just the section of the range that extends
beyond our current size, but this likely isn't worth it since the size isn't
stable until the writelock is granted the first time.

Suggested by:	cem (passing comment)
2020-01-09 04:03:17 +00:00
Kyle Evans
f10405323a posixshm: implement posix_fallocate(2)
Linux expects to be able to use posix_fallocate(2) on a memfd. Other places
would use this with shm_open(2) to act as a smarter ftruncate(2).

Test has been added to go along with this.

Reviewed by:	kib (earlier version)
Differential Revision:	https://reviews.freebsd.org/D23042
2020-01-08 19:08:44 +00:00