Replace several sequential searches for a segment that contains a
phyiscal address with a call to a function that does it by binary
search. In vm_page_reclaim_contig_domain_ext, find the first segment
to reclaim from, and reclaim from each subsequent appropriate segment.
Eliminate vm_phys_scan_contig.
Reviewed by: alc, markj
Differential Revision: https://reviews.freebsd.org/D40058
This is in keeping with the trend of removing uses of boolean_t, and the
sole caller was implicitly converting it to a "bool".
No functional change intended.
Reviewed by: dougm, alc, imp, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40401
Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once. This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.
This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA). Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.
The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
just seconds or minutes.
Differential Revision: https://reviews.freebsd.org/D39739
Sponsored by: Netflix
Reviewed by: alc, jhb, markj
Same as it is done in other error return cases. Callers depend on error
case returning NULL, e.g. vm_imgact_hold_page().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37719
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
markj says:
...the assertion is incorrect and should simply be removed.
It has been racy since we removed the use of the page hash
lock to synchronize wiring of pages.
PR: 267621
Reviewed by: markj, Anton Rang <rang@acm.org>
MFC after: 1 week
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D37320
Use it and several other vm_page_*_valid() functions in more places.
Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37024
As an optimization, vm_page_activate() avoids requeuing a page that's
already in the active queue. A page's location in the active queue is
mostly unimportant.
When a page is unwired and placed back in the page queues,
vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
request, the idea being that they're likely mapped and so will simply
get bounced back in to PQ_ACTIVE during a queue scan.
In both cases, if the page was logically in PQ_ACTIVE but had not yet
been physically enqueued (i.e., the page is in a per-CPU batch), we
would end up clearing PGA_REQUEUE from the page. Then, batch processing
would ignore the page, so it would end up unwired and not in any queues.
This can arise, for example, when a page is allocated and then
vm_page_activate() is called multiple times in quick succession. The
result is that the page is hidden from the page daemon, so while it will
be freed when its VM object is destroyed, it cannot be reclaimed under
memory pressure.
Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
optimization if the page is physically enqueued.
PR: 256507
Fixes: f3f38e2580 ("Start implementing queue state updates using fcmpset loops.")
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: E-CARD Ltd.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36839
This reverts commit f6ffed44a8.
fetchadd will fail the waiters flag, which can cause other
code to wait when it should not with nothing clear it
Revert until I sort this out.
Reported by: markj
This also fixes a bug where not-last unbusy failed to post a release
fence.
Reviewed by: markj (previous version), kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36084
This is not completely exhaustive, but covers a large majority of
commands in the tree.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583
Now that OBJT_DEFAULT objects can't be instantiated, we can simplify
checks of the form object->type == OBJT_DEFAULT || (object->flags &
OBJ_SWAP) != 0. No functional change intended.
Reviewed by: alc, kib
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35788
The assertion was added in commit 1771e987ca. After that, vm_wait()
and friends were refactored such that the actual sleep happens
elsewhere. Now the assertion condition is not checked when
vm_wait_doms() is called directly, and it is checked even if we are not
going to sleep (because vm_page_count_min_set(wdoms) is false).
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34909
Define simple functions for alignment and boundary checks and use them
everywhere instead of having slightly different implementations
scattered about. Define them in vm_extern.h and use them where
possible where vm_extern.h is included.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D33685
fb38b29b56 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
should have moved a comment block when it moved the function call that followed it.
Move the comment block now.
Function vm_reserv_reclaim_contig breaks a reservation with enough
free space to satisfy an allocation request and returns the free space
to the buddy allocator. Change the function to allocate the request
memory from the reservation before breaking it, and return that memory
to the caller. That avoids a second call to the buddy allocator and
guarantees successful allocation after breaking the reservation, where
that success is not currently guaranteed.
Reviewed by: alc, kib (previous version)
Differential Revision: https://reviews.freebsd.org/D33644
Fix a very recent change that introduced a page accounting error in
case of a reserveration being broken.
Reviewed by: alc
Fixes: fb38b29b56 (page_alloc_br) vm_page: Remove extra test, dup code from page alloc
Differential Revision: https://reviews.freebsd.org/D33645
Extract code common to functions vm_page_alloc_contig_domain and
vm_page_alloc_noobj_contig_domain into a new function. Do so in a way
that eliminates a bound-to-fail reservation test after a reservation
is broken by a call from vm_page_alloc_contig_domain.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33551
A page must not become invalid while vm_fault_soft_fast() is attempting
to map unbusied pages for reading.
Note that all callers hold the object write lock already, and
vm_page_set_invalid() asserts the object write lock.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33250
- Modify vm_page_busy_sleep() and vm_page_busy_sleep_unlocked() to take
a VM_ALLOC_* flag indicating whether to sleep on shared-busy, and fix
up callers.
- Modify vm_page_busy_sleep() to return a status indicating whether the
object lock was dropped, and fix up callers.
- Convert callers of vm_page_sleep_if_busy() to use vm_page_busy_sleep()
instead.
- Remove vm_page_sleep_if_(x)busy().
No functional change intended.
Obtained from: jeff (object_concurrency patches)
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D32947
We added _NORECLAIM to request that kmem_alloc_contig_pages() not spend
time scanning physical memory for candidates to reclaim. In some
situations the scanning can induce large amounts of undesirable latency,
and it's less important that the request be satisfied than it is that we
not spend many milliseconds scanning.
The problem extends to vm_reserv_reclaim_contig(), which unlike
vm_reserv_reclaim() may have to scan the entire list of partially
populated reservations. Use VM_ALLOC_NORECLAIM to request that this
scan not be executed.[1]
As a side effect, this fixes a regression in 02fb0585e7 ("vm_page:
Drop handling of VM_ALLOC_NOOBJ in vm_page_alloc_contig_domain()")
where VM_ALLOC_CONTIG was not included in VPAC_FLAGS or VPANC_FLAGS even
though it is not masked by kmem_alloc_contig_pages().[2]
Reported by: gallatin [1], glebius [2]
Reviewed by: alc, glebius, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32899
vm_reserv_reclaim_*() will release pages to the default freepool, not
the direct freepool from which noobj allocations are drawn. But if both
pools are empty, the noobj allocator variants must break reservations to
make progress.
Reported by: cy
Reviewed by: kib (previous version)
Fixes: b498f71bc5 ("vm_page: Add a new page allocator interface for unnamed pages")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32592
As in vm_page_alloc_domain_after(), unconditionally preserve PG_ZERO.
Implement vm_page_alloc_noobj_contig_domain().
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32034
This makes the allocator simpler since it can assume object != NULL.
Also modify the function to unconditionally preserve PG_ZERO, so
VM_ALLOC_ZERO is effectively ignored (and still must be implemented by
the caller for now).
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32033
This is the same as vm_page_alloc_noobj(), but allocates physically
contiguous runs of memory. For now it is implemented in terms of
vm_page_alloc_contig(), with the difference that
vm_page_alloc_noobj_contig() implements VM_ALLOC_ZERO by zeroing the
page.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32005
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986
The diff adds vm_page_alloc_noobj() and vm_page_alloc_noobj_domain().
These mostly correspond to vm_page_alloc() and vm_page_alloc_domain()
when no VM object is specified, with the exception that they handle
VM_ALLOC_ZERO by zeroing the page, rather than by preserving PG_ZERO.
This simplifies callers and will permit simplification of the
vm_page_alloc_domain() definition.
Since the new allocator variant is similar to vm_page_alloc_freelist(),
implement both of them using a common backend allocator function. No
functional change intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31985
If phys_avail[] segment only intersect with some vm_phys segment, add
pages from it to the free list that belong to the given vm_phys_seg,
instead of dropping them.
The vm_phys segments are generally result of subdivision of phys_avail
segments, for instance DMA32 or LOWMEM boundaries split them. On
amd64, after UEFI in-place kernel activation (copy_staging disable)
was enabled, we typically have a large phys_avail[] segment below 4G
which crosses LOWMEM (1M) boundary. With the current way of requiring
phys_avail[] fully fit into vm_phys_seg, this memory was ignored.
Reported by: madpilot
Reviewed by: markj
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31958
In fee2a2fa39 the KASSERTs in
vm_page_unwire_noq() changed from "vm_page_unwire" to "vm_page_unref".
While the former no longer was part of that function the latter does
not exist as a function and is highly confusing when hit when using
tools to lookup the functions and not doing a full-text search.
Use %s __func__ for printing the function name, as that will do the
right thing as code moves around and functions get renamed.
Hit: while debugging a wired page leak with linuxkpi/iwlwifi
Sponsored by: The FreeBSD Foundation
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D31635
which is the place to put MD asserts about allocated pages.
On amd64, verify that allocated page does not belong to the kernel
(text, data) or early allocated pages.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31121
This is OBJT_SWAP pager, specialized for tmpfs. Right now, both swap pager
and generic vm code have to explicitly handle swap objects which are tmpfs
vnode v_object, in the special ways. Replace (almost) all such places with
proper methods.
Since VM still needs a notion of the 'swap object', regardless of its
use, add yet another type-classification flag OBJ_SWAP. Set it in
vm_object_allocate() where other type-class flags are set.
This change almost completely eliminates the knowledge of tmpfs from VM,
and opens a way to make OBJT_SWAP_TMPFS loadable from tmpfs.ko.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30070
When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit. Otherwise,
it will not be visible to vm_phys_alloc_contig() as it is currently
implemented. This is a problem for allocation requests that are not a
power of 2 in size, as with 9KB jumbo mbuf clusters.
Reported by: alc
Reviewed by: alc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28924
This KASSERT is overzealous because of the following race condition:
1) A managed page which is currently in PQ_LAUNDRY is freed.
vm_page_free_prep calls vm_page_dequeue_deferred()
The page state is:
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
2) The laundry worker comes around and pick up the page and calls
vm_pageout_defer(m, PQ_LAUNDRY, true) to check if page is still in the
queue. We do a vm_page_astate_load and get
PQ_LAUNDRY, PGA_DEQUEUE|PGA_ENQUEUED
as per above.
3) The laundry worker is pre-empted and another thread allocates our page
from the free pool. For example vm_page_alloc_domain_after calls
vm_page_dequeue() and sets VPO_UNMANAGED because we are allocating for
an OBJT_UNMANAGED object.
The page state is:
PQ_NONE, 0 - VPO_UNMANAGED
4) The laundry worker resumes, and processes vm_pageout_defer based on the
stale astate which leads to a call to vm_page_pqbatch_submit, which will
trip on the KASSERT.
Submitted by: mlaier
Reviewed by: markj, rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28563
This flag indicates that the page should be enqueued near the head of
the inactive queue, skipping the LRU queue. It is used when unwiring
pages from the buffer cache following direct I/O or after I/O when
POSIX_FADV_NOREUSE or _DONTNEED advice was specified, or when
sendfile(SF_NOCACHE) completes. For the direct I/O and sendfile cases
we only enqueue the page if we decide not to free it, typically because
it's mapped.
Pass "noreuse" through to vm_page_release_toq() so that we actually
honour the desired LRU policy for these scenarios.
Reported by: bdrewery
Reviewed by: alc, kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28555
In vm_page_busy_acquire(), load the object pointer using
atomic_load_ptr() as we do elsewhere. Per the comment, the object
identity must be consistent across sleeps.
In vm_page_grab_sleep(), pass the correct pindex to
_vm_page_busy_sleep(). The pindex is used to re-check the page's
identity before going to sleep. In particular, vm_page_grab_sleep() is
used in unlocked grab, so the object lock is not necessarily held when
verifying the page's identity, and the pindex may change if the page is
moved, or freed and re-allocated. I believe this can result in spurious
VM_PAGER_FAILs from vm_page_grab_valid_unlocked() or early termination
of vm_page_grab_pages_unlocked().
In vm_page_grab_pages(), pass the correct pindex to
vm_page_grab_sleep(). Otherwise I believe vm_page_grab_pages() will
effectively spin when attempting to busy a busy page after the first
index in the range.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27607
It can useful for code outside the VM system to look up the NUMA domain
of a page backing a virtual or physical address, specifically when
creating NUMA-aware data structures. We have _vm_phys_domain() for
this, but the leading underscore implies that it's an internal function,
and vm_phys.h has dependencies on a number of other headers.
Rename vm_phys_domain() to vm_page_domain(), and _vm_phys_domain() to
vm_phys_domain(). Make the latter an inline function.
Add _vm_phys.h and define struct vm_phys_seg there so that it's easier
to use in other headers. Include it from vm_page.h so that
vm_page_domain() can be defined there.
Include machine/vmparam.h from _vm_phys.h since it depends directly on
some constants defined there.
Reviewed by: alc
Reviewed by: dougm, kib (earlier versions)
Differential Revision: https://reviews.freebsd.org/D27207
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h. This fixes default gcc build for mips.
Reviewed by: alc, scottph
Tested by: kevans (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26741
On Ampere Altra systems, the sparse population of RAM within the
physical address space causes the vm_page_dump bitmap to be much
larger than necessary, increasing the size from ~8 Mib to > 2 Gib
(and overflowing `int` for the size).
Changing the page dump bitmap also changes the minidump file
format, so changes are also necessary in libkvm.
Reviewed by: jhb
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26131