In particular, vnode must be exclusively locked when the tmpfs vnode
and object are divorced. When the vnode is opened, the object must be
still alive, since only live vnode can be opened, and the tmpfs node
owns a reference on the object.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
a vnode until it is verified that the vnode indeed belongs to tmpfs
mount. Otherwise, it might access random memory, at least in the
debug kernel.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
permissions test, forgotten in r164033.
Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).
Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
lookup cookies to be less obscure.
No functional change.
Since r245115, cnt has not really been needed in tmpfs_dir_getdents(). Keep
it for the MPASS() for now though.
Sponsored by: EMC / Isilon Storage Division
MFC after: 2 weeks
UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY,
and UF_HIDDEN.
Sort the file flags tmpfs supports alphabetically. tmpfs now
supports the same flags as UFS, with the exception of SF_SNAPSHOT.
Reported by: bdrewery, antoine
Sponsored by: Spectra Logic
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.
Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
and vm_page_grab are being executed. This will be very helpful
once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag
The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff, kib
Tested by: gavin, bapt (older version)
Tested by: pho, scottl
in particular, from the tmpfs_lookup VOP method. If LK_NOWAIT is not
specified in the lkflags, the lookup is supposed to return an alive
vnode whenever the underlying node is valid.
Currently, the tmpfs_alloc_vp() returns ENOENT if the vnode attached
to node exists and is being reclaimed. This causes spurious ENOENT
errors from lookup on tmpfs and corresponding random 'No such file'
failures from syscalls working with tmpfs files.
Fix this by waiting for the doomed vnode to be detached from the tmpfs
node if sleepable allocation is requested.
Note that filesystems which use vfs_hash.c, correctly handle the case
due to vfs_hash_get() looping when vget() returns ENOENT for sleepable
requests.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.
Before this patch is reinserted we need to break this ordering.
Sponsored by: EMC / Isilon storage division
Reported by: kib
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages
Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).
After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).
Fixing such primitive can bring to complete removal of the page hold
mechanism.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho
kern_sendfile() which is unnecessary.
The page is already wired so it will not be subjected to pagefault.
The content cannot be effectively protected as it is full of races
already.
Multiple accesses to the same indexes are serialized through vn_rdwr().
Sponsored by: EMC / Isilon storage division
Reviewed by: alc, jeff
Tested by: pho
the page. This both reduces the number of queues locking and avoids
moving the active page to inactive list just because the page was read
or written.
Based on the suggestion by: alc
Reviewed by: alc
Tested by: pho
insmntque() is called. The standard insmntque destructor resets the
vop vector to deadfs one, and calls vgone() on the vnode. As result,
v_object is kept unchanged, which triggers an assertion in the reclaim
code, on instmntque() failure. Also, in this case, OBJ_TMPFS flag on
the backed vm object is not cleared.
Provide the tmpfs insmntque() destructor which properly clears
OBJ_TMPFS flag and resets v_object.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
vnode v_object to avoid double-buffering. Use the same object both as
the backing store for tmpfs node and as the v_object.
Besides reducing memory use up to 2x times for situation of mapping
files from tmpfs, it also makes tmpfs read and write operations copy
twice bytes less.
VM subsystem was already slightly adapted to tolerate OBJT_SWAP object
as v_object. Now the vm_object_deallocate() is modified to not
reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle
VV_TEXT flag on the last dereference of the tmpfs backing object.
Reviewed by: alc
Tested by: pho, bf
MFC after: 1 month
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.
Discussed on: arch
Sponsored by: The FreeBSD Foundation
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
tmpfs_mapped{read, write}() functions:
- tmpfs_mapped{read, write}() are only called within VOP_{READ, WRITE}(),
which check before-hand to work only on valid VREG vnodes. Also the
vnode is locked for the duration of the work, making vnode reclaiming
impossible, during the operation. Hence, vobj can never be NULL.
- Currently check on resident pages and cached pages without vm object
lock held is racy and can do even more harm than good, as a page could
be transitioning between these 2 pools and then be skipped entirely.
Skip the checks as lookups on empty splay trees are very cheap.
Discussed with: alc
Tested by: flo
MFC after: 2 weeks
Use file name hash as a tree key, handle duplicate keys. Both VOP_LOOKUP
and VOP_READDIR operations utilize same tree for search. Directory
entry offset (cookie) is either file name hash or incremental id in case
of hash collisions (duplicate-cookies). Keep sorted per directory list
of duplicate-cookie entries to facilitate cookie number allocation.
Don't fail if previous VOP_READDIR() offset is no longer valid, start
with next dirent instead. Other file system handle it similarly.
Workaround race prone tn_readdir_last[pn] fields update.
Add tmpfs_dir_destroy() to free all dirents.
Set NFS cookies in tmpfs_dir_getdents(). Return EJUSTRETURN from
tmpfs_dir_getdents() instead of hard coded -1.
Mark directory traversal routines static as they are no longer
used outside of tmpfs_subr.c
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.
Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.
Suggested and reviewed by: alc
MFC after: 2 weeks
Since r230208 update mounts were allowed if the list of mount options
contained the "export" option. This is not correct as tmpfs doesn't
really support updating all options.
Reviewed by: kevlo, trociny
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.
Reviewed by: alc
MFC after: 2 weeks
X-MFC: r234039
allocator.
Replace UINT32_MAX checks with INT_MAX. Keeping more than 2^31 nodes in
memory is not likely to become possible in foreseeable feature and would
require new unit number allocator.
Discussed with: delphij
MFC after: 2 weeks
Doomed vnode is hardly of any use here, besides all callers handle error
case. vfs_hash_get() does the same.
Don't mess with vnode holdcount, vget() takes care of it already.
Approved by: mdf (mentor)
associated with the previous vnode (if any) associated with the target of
a rename(). Otherwise, a lookup of the target pathname concurrent with a
rename() could re-add a name cache entry after the namei(RENAME) lookup
in kern_renameat() had purged the target pathname.
MFC after: 2 weeks
vm_object_pip_{add,subtract}() on the swap object because the swap
object can't be destroyed while the vnode is exclusively locked.
Moreover, even if the swap object could have been destroyed during
tmpfs_nocacheread() and tmpfs_mappedwrite() this code is broken
because vm_object_pip_subtract() does not wake up the sleeping thread
that is trying to destroy the swap object.
Free invalid pages after an I/O error. There is no virtue in keeping
them around in the swap object creating more work for the page daemon.
(I believe that any non-busy page in the swap object will now always
be valid.)
vm_pager_get_pages() does not return a standard errno, so its return
value should not be returned by tmpfs without translation to an errno
value.
There is no reason for the wakeup on vpg in tmpfs_mappedwrite() to
occur with the swap object locked.
Eliminate printf()s from tmpfs_nocacheread() and tmpfs_mappedwrite().
(The swap pager already spam your console if data corruption is
imminent.)
Reviewed by: kib
MFC after: 3 weeks
operation on POSIX shared memory objects and tmpfs. Previously, neither of
these modules correctly handled the case in which the new size of the object
or file was not a multiple of the page size. Specifically, they did not
handle partial page truncation of data stored on swap. As a result, stale
data might later be returned to an application.
Interestingly, a data inconsistency was less likely to occur under tmpfs
than POSIX shared memory objects. The reason being that a different mistake
by the tmpfs truncation operation helped avoid a data inconsistency. If the
data was still resident in memory in a PG_CACHED page, then the tmpfs
truncation operation would reactivate that page, zero the truncated portion,
and leave the page pinned in memory. More precisely, the benevolent error
was that the truncation operation didn't add the reactivated page to any of
the paging queues, effectively pinning the page. This page would remain
pinned until the file was destroyed or the page was read or written. With
this change, the page is now added to the inactive queue.
Discussed with: jhb
Reviewed by: kib (an earlier version)
MFC after: 3 weeks
tmpfs_nocacheread(). It is both unnecessary and a pessimization. It
results in either the page being zeroed twice or zeroed first and then
overwritten by an I/O operation.
MFC after: 3 weeks
- Don't deduct wired pages from total usable counts because it does not
make any sense. To make things worse, on systems where swap size is
smaller than physical memory and use a lot of wired pages (e.g. ZFS),
tmpfs can suddenly have free space of 0 because of this;
- Count cached pages as available; [1]
- Don't count inactive pages as available, technically we could but that
might be too aggressive; [1]
[1] Suggested by kib@
MFC after: 1 week
-- highly experimental even. So far the closest to a bug in TMPFS that people
have gotten to relates to how ZFS can take away from the memory that TMPFS
needs. One can argue that such is not a bug in TMPFS. Irrespective, even if
there is a bug here and there in TMPFS, it's not in our own advantage to
scare people away from using TMPFS. I for one have been using it, even with
ZFS, very successfully.
flags field. Updates to the atomic flags are performed using the atomic
ops on the containing word, do not require any vm lock to be held, and
are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9)
functions are provided to modify afalgs.
Document the changes to flags field to only require the page lock.
Introduce vm_page_reference(9) function to provide a stable KPI and
KBI for filesystems like tmpfs and zfs which need to mark a page as
referenced.
Reviewed by: alc, attilio
Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64)
Approved by: re (bz)
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.
This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.
Update all of the existing assertions on pmap_remove_all() to reflect this
change.
Reviewed by: kib
method, so that callers can indicate the minimum vnode
locking requirement. This will allow some file systems to choose
to return a LK_SHARED locked vnode when LK_SHARED is specified
for the flags argument. This patch only adds the flag. It
does not change any file system to use it and all callers
specify LK_EXCLUSIVE, so file system semantics are not changed.
Reviewed by: kib
file's last accessed, modified, and changed times:
TMPFS_NODE_ACCESSED and TMPFS_NODE_CHANGED should be set unconditionally
in tmpfs_remove() without regard to the number of hard links to the file.
Otherwise, after the last directory entry for a file has been removed, a
process that still has the file open could read stale values for the last
accessed and changed times with fstat(2).
Similarly, tmpfs_close() should update the time-related fields even if all
directory entries for a file have been removed. In this case, the effect
is that the time-related fields will have values that are later than
expected. They will correspond to the time at which fstat(2) is called.
In collaboration with: kib
MFC after: 1 week
object's size field. Previously, that field was always zero, even
when the object tn_reg.tn_aobj contained numerous pages.
Apply style fixes to tmpfs_reg_resize().
In collaboration with: kib
either overflow the supplied buffer, or cause uiomove fail.
Do not advance cached de when directory entry was not copied out.
Do not return EOF when no entries could be copied due to first entry
too large for supplied buffer, signal EINVAL instead.
Reported by: Beat G?tzi <beat chruetertee ch>
MFC after: 1 week
Otherwise, adding insult to injury, in addition to double-caching of data
we would always copy the data into a vnode's vm object page from backend.
This is specific to sendfile case only (VOP_READ with UIO_NOCOPY).
PR: kern/141305
Reported by: Wiktor Niesiobedzki <bsd@vink.pl>
Reviewed by: alc
Tested by: tools/regression/sockets/sendfile
MFC after: 2 weeks
Right now unionfs only allows filesystems to be mounted on top of
another if it supports whiteouts. Even though I have sent a patch to
daichi@ to let unionfs work without it, we'd better also add support for
whiteouts to tmpfs.
This patch implements .vop_whiteout and makes necessary changes to
lookup() and readdir() to take them into account. We must also make sure
that when adding or removing a file, we honour the componentname's
DOWHITEOUT and ISWHITEOUT, to prevent duplicate filenames.
MFC after: 1 month
to unconditionally set PG_REFERENCED on a page before sleeping. In many
cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by
the page daemon, before the caller to vm_page_sleep() is reawakened.
Instead, we now explicitly set PG_REFERENCED in those cases where having
the page persist until the caller is awakened is clearly desirable. Note,
however, that setting PG_REFERENCED on the page is still only a hint,
and not a guarantee that the page should persist.
maximum file size limit. Default is UINT64_MAX when the option is
not specified. It was useless to set the limit to the total amount of
memory and swap in the system.
Use tmpfs_mem_info() rather than get_swpgtotal() in tmpfs_mount() to
check if there is enough memory available.
Remove now unused get_swpgtotal().
Reviewed by: Gleb Kurtsou
Approved by: trasz (mentor)
convert its value. [1]
- Set default tm_nodes_max to min(pages + 3, UINT32_MAX). It's more
reasonable than the old four nodes per page (with page size 4096) because
non-empty regular files always use at least one page. This fixes possible
overflow in the calculation. [2]
- Don't allow more than tm_nodes_max nodes allocated in tmpfs_alloc_node().
PR: kern/138367
Suggested by: bde [1], Gleb Kurtsou [2]
Approved by: trasz (mentor)
with vfs_scanopt(9) using the "%qu" format string.
- Limit the maximum value of size_max to (SIZE_MAX - PAGE_SIZE) to
prevent overflow in howmany() macro.
PR: kern/141194
Approved by: trasz (mentor)
MFC after: 2 weeks
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.
The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.
The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.
The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).
Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
both node pointer and name component. This does the right thing for
hardlinks to the same node in the same directory.
Submitted by: Yoshihiro Ota <ota j email ne jp>
PR: kern/131356
MFC after: 2 weeks
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.
Approved by: rwatson (mentor)
and bcmp are not the same thing. 'man bcmp' states that the return is
"non-zero" if the two byte strings are not identical. Where as,
'man memcmp' states that the return is the "difference between the
first two differing bytes (treated as unsigned char values" if the
two byte strings are not identical.
So provide a proper memcmp(9), but it is a C implementation not a tuned
assembly implementation. Therefore bcmp(9) should be preferred over memcmp(9).
initialize the vattr structure in VOP_GETATTR() with VATTR_NULL(),
vattr_null() or by zeroing it. Remove these to allow preinitialization
of fields work in vn_stat(). This is needed to get birthtime initialized
correctly.
Submitted by: Jaakko Heinonen <jh saunalahti fi>
Discussed on: freebsd-fs
MFC after: 1 month