- Change vm_page_reclaim_contig[_domain] to return an errno instead
of a boolean. 0 indicates a successful reclaim, ENOMEM indicates
lack of available memory to reclaim, with any other error (currently
only ERANGE) indicating that reclamation is impossible for the
specified address range. Change all callers to only follow
up with vm_page_wait* in the ENOMEM case.
- Introduce vm_domainset_iter_ignore(), which marks the specified
domain as unavailable for further use by the iterator. Use this
function to ignore domains that can't possibly satisfy a physical
allocation request. Since WAITOK allocations run the iterators
repeatedly, this avoids the possibility of infinitely spinning
in domain iteration if no available domain can satisfy the
allocation request.
PR: 274252
Reported by: kevans
Tested by: kevans
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D42706
(cherry picked from commit 2619c5ccfe1f7889f0241916bd17d06340142b05)
MFCed as a prerequisite for further MFC of VM domainset changes. Based
on analysis, it would not hurt, and I have been using it in productions
for months now.
Resolved the trivial conflict due to commit 718d1928f874 ("LinuxKPI:
make linux_alloc_pages() honor __GFP_NORETRY") having been MFCed before
this one.
Suppose a vnode is mapped with MAP_PROT and MAP_PRIVATE, mlock() is
called on the mapping, and then the vnode is truncated such that the
last page of the mapping becomes invalid. The now-invalid page will be
unmapped, but stays resident in the VM object to preserve the invariant
that a range of pages mapped by a wired map entry is always resident.
This invariant is checked by vm_object_unwire(), for example.
Then, suppose that the mapping is upgraded to PROT_READ|PROT_WRITE. We
will copy the invalid page into a new anonymous VM object. If the
process then forks, vm_object_split() may then be called on the object.
Upon encountering an invalid page, rather than moving it into the
destination object, it is removed. However, this is wrong when the
entry is wired, since the invalid page's wiring belongs to the map
entry; this behaviour also violates the invariant mentioned above.
Fix this by moving invalid pages into the destination object if the map
entry is wired. In this case we must not dirty the page, so add a flag
to vm_page_iter_rename() to control this.
Reported by: syzkaller
Reviewed by: dougm, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D49443
(cherry picked from commit 43c1eb894a57ef30562a02708445c512610d4f02)
Change the comment before this block of code, and separate the latter
from the preceding one by an empty line.
Move the loop on phys_avail[] to compute the minimum and maximum memory
physical addresses closer to the initialization of 'low_avail' and
'high_avail', so that it's immediately clear why the loop starts at
2 (and remove the related comment).
While here, fuse the additional loop in the VM_PHYSSEG_DENSE case that
is used to compute the exact physical memory size.
This change suppresses one occurence of detecting whether at least one
of VM_PHYSSEG_DENSE or VM_PHYSSEG_SPARSE is defined at compile time, but
there is still another one in PHYS_TO_VM_PAGE().
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48632
(cherry picked from commit 16317a174a5288f0377f8d40421b5c7821d57ac2)
Previously, such requests would lead to a panic. The only caller so far
(vm_phys_early_startup()) actually faces the case where some address can
be one of the chunk's boundaries and has to test it by hand. Moreover,
a later commit will introduce vm_phys_early_alloc_ex(), which will also
have to deal with such boundary cases.
Consequently, make this function handle boundaries by not splitting the
chunk and returning EJUSTRETURN instead of 0 to distinguish this case
from the "was split" result.
While here, expand the panic message when the address to split is not in
the passed chunk with available details.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48630
(cherry picked from commit e1499bfff8b8c128d7b3d330f95e0c67d7c1fa77)
On improper termination of phys_avail[] (two consecutive 0 starting at
an even index), this function would (unnecessarily) continue searching
for the termination markers even if the index was out of bounds.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48629
(cherry picked from commit 291b7bf071e8b50f2b7877213b2d3307ae5d3e38)
Segments are passed by machine-dependent routines, so explicit checks
will make debugging much easier on very weird machines or when someone
is tweaking these machine-dependent routines. Additionally, this
operation is not performance-sensitive.
For the same reasons, test that we don't reach the maximum number of
physical segments (the compile-time of the internal storage) in
production kernels (replaces the existing KASSERT()).
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48628
(cherry picked from commit 8a14ddcc1d8e4384d8ad77c5536c916c6e9a7d65)
A bad specification is if 'start' is strictly greater than 'end', or
bounds are not page aligned.
The latter was already tested under INVARIANTS, but now will be also on
production kernels. The reason is that vm_phys_early_startup() pours
early segments into the final phys_segs[] array via vm_phys_add_seg(),
but vm_phys_early_add_seg() did not check their validity. Checking
segments once and for all in vm_phys_add_seg() avoids duplicating
validity tests and is possible since early segments are not used before
being poured into phys_segs[]. Finally, vm_phys_add_seg() is not
performance critical.
Allow empty segments and discard them (silently, unless 'bootverbose' is
true), as vm_page_startup() was testing for this case before calling
vm_phys_add_seg(), and we felt the same test in vm_phys_early_startup()
was due before calling vm_phys_add_seg(). As a consequence, remove the
empty segment test from vm_page_startup().
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48627
(cherry picked from commit f30309abcce4cec891413da5cba2db92dd6ab0d7)
The passed index must be the start of a chunk in phys_avail[], so must
be even. Test for that and print a separate panic message.
While here, fix panic messages: In one, the wrong chunk boundary was
printed, and in another, the desired but not the actual condition was
printed, possibly leading to confusion.
Reviewed by: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48626
(cherry picked from commit 125ef4e041fed40fed2d00b0ddd90fa0eb7b6ac3)
After commit 389a3fa693, uma_reclaim_domain(UMA_RECLAIM_DRAIN_CPU)
calls uma_zone_reclaim_domain(UMA_RECLAIM_DRAIN_CPU) twice on each zone
in addition to globally draining per-CPU caches. This was unintended
and is unnecessarily slow; in particular, draining per-CPU caches
requires binding to each CPU.
Stop draining per-CPU caches when visiting each zone, just do it once in
pcpu_cache_drain_safe() to minimize the amount of expensive sched_bind()
calls.
Fixes: 389a3fa693 ("uma: Add UMA_ZONE_UNMANAGED")
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: NetApp, Inc.
Reviewed by: gallatin, kib
Differential Revision: https://reviews.freebsd.org/D49349
(cherry picked from commit f506d5af50fccc37f5aa9fe090e9a0d5f05506c8)
fork() may allocate a new thread in one of two ways: from UMA, or cached
in a freed proc that was just allocated from UMA. In either case, KASAN
and KMSAN need to initialize some state; in particular they need to
initialize the shadow mapping of the new thread's stack.
This is done differently between KASAN and KMSAN, which is confusing.
This patch improves things a bit:
- Add a new thread_recycle() function, which moves all kernel stack
handling out of kern_fork.c, since it doesn't really belong there.
- Then, thread_alloc_stack() has only one local caller, so just inline
it.
- Avoid redundant shadow stack initialization: thread_alloc()
initializes the KMSAN shadow stack (via kmsan_thread_alloc()) even
through vm_thread_new() already did that.
- Add kasan_thread_alloc(), for consistency with kmsan_thread_alloc().
No functional change intended.
Reviewed by: khng
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44891
(cherry picked from commit 800da341bc4a35f4b4d82d104b130825d9a42ffa)
Right now we have the vm.pageout_cpus_per_thread tunable which controls
the number of threads to start up per CPU per NUMA domain, but after
booting, it's not possible to disable multi-threaded scanning.
There is at least one workload where this mechanism doesn't work well;
let's make it possible to disable it without a reboot, to simplify
troubleshooting.
Reviewed by: dougm, kib
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D48377
(cherry picked from commit 55b343f4f9bc586eba5e26a2524a35f04dd60c65)
Readahead/behind pages are handled by the swap pager, but the get_pages
caller is responsible for putting fetched pages into queues (or wiring
them beforehand).
Note that the VM object lock prevents the newly queued page from being
immediately reclaimed in the window before it is marked dirty by
swap_pager_swapoff_object().
Reported by: pho
Tested by: pho
Reviewed by: dougm, alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D47526
(cherry picked from commit d11d407aee4835fd50811a5980125bb46748fa0b)
Pages in PQ_UNSWAPPABLE should be considered part of the laundry.
Otherwise, on systems with no swap, the total amount of memory visible
to tools like top(1) decreases.
It doesn't seem very useful to have a dedicated counter for unswappable
pages, and updating applications accordingly would be painful, so just
lump them in with laundry for now.
PR: 280846
Reviewed by: bnovkov, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D47216
(cherry picked from commit 6a07e67fb7a8b5687a492d9d70a10651d5933ff5)
When releasing a page reference, we have logic for various cases, based
on the value of the counter. But, the implementation fails to take into
account the possibility that the VPRC_BLOCKED flag is set, which is ORed
into the counter for short windows when removing mappings of a page. If
the flag is set while the last reference is being released, we may fail
to add the page to a page queue when the last wiring reference is
released.
Fix the problem by performing comparisons with VPRC_BLOCKED masked off.
While here, add a related assertion.
Reviewed by: dougm, kib
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D46944
(cherry picked from commit c59166e5b4e8821556a3d23af7bd17ca556f2e22)
Take advantage of a nearby 2-byte hole to avoid growing the struct.
This way, only the offsets of "flags" and "pg_color" change. Bump
__FreeBSD_version since some out-of-tree kernel modules may access these
fields, though I haven't found any examples so far.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35905
(cherry picked from commit 9d52823bf1dfac237e58b5208299aaa5e2df42e9)
Make sure that the compiler loads the initial value value only once.
Because atomic_fcmpset is used to load the value for subsequent
iterations, this is probably not needed, but we should not rely on that.
I verified that code generated for an amd64 GENERIC kernel does not
change.
Reviewed by: dougm, alc, kib
Tested by: pho
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D46943
(cherry picked from commit d8b32da2354d2fd72ae017fd63affa3684786e1f)
Add a function like kva_alloc that allows us to specify the alignment
of the virtual address space returned.
Reviewed by: alc, kib, markj
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42788
(cherry picked from commit 839999e7efdc980d5ada92ea93719c7e29765809)
The kernel_arena used in kva_alloc has the qcache disabled. vmem_alloc
will first try to use the qcache before falling back to vmem_xalloc.
Rather than trying to use the qcache in vmem_alloc just call
vmem_xalloc directly.
Reviewed by: alc, kib, markj
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D42831
(cherry picked from commit 8daee410d2c13b4e8530b00e7877eeecf30bb064)
Fixes: bec000c9c1ef409989685bb03ff0532907befb4aESC
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 9c5d7e4a0c02bc45b61f565586da2abcc65d70fa)
Make the tlb shootdown function as a pointer. By default, it still
points to the system function smp_targeted_tlb_shootdown(). It allows
other implemenations to overwrite in the future.
Reviewed by: kib
Tested by: whu
Authored-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Co-Authored-by: Erni Sri Satya Vennela <ernis@microsoft.com>
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D45174
(cherry picked from commit bec000c9c1ef409989685bb03ff0532907befb4a)
I cannot find a time where the function was not named this.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45383
(cherry picked from commit deab57178f0b06eab56d7811674176985a8ea98d)
The swap pager itself allocates readahead pages, so should take care to
unbusy them after a read error, just as it does in the non-error case.
PR: 277538
Reviewed by: olce, dougm, alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44646
(cherry picked from commit 4696650782e2e5cf7ae5823f1de04550c05b5b75)