Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
If bumping over the counter goes over the limit we have to decrement it back.
Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).
Sponsored by: The FreeBSD Foundation
cache_lookup's documentation got dislocated by r324378. Relocate and expand
it.
Reviewed by: jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Implement a ddb function walking the namecache to do this.
Reviewed by: jhb, mjg
Inspired by: gdb macro from jhb (old version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14898
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Since the case of an empty chain was already covered, it si very likely
that the existing entry is matching. Skipping readlocking saves on lock
upgrade.
It is used on each new entry addition to decide whether to whack an existing
negative entry in order to prevent a blow out in size, but the parameter was
set years ago and never revisited.
Building with poudriere results in about 400 evictions per second which
unnecessarily grab entries from the hot list.
With the new parameter there are next to no evictions of the sort.
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.
In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.
MFC after: 1 week
This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.
MFC after: 1 week
Sponsored by: DARPA / AFRL
namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.
The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.
MFC after: 1 week
All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.
Store the size - 1 and use 'foo & hash' instead which allows mere shift.
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
Force the number of buckets to be not smaller.
Note this should not matter for real world cases.
Reported and tested by: pho
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.
Reported and tested by: truckman
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.
Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.
Fix the problem by initializing to NULL on entry.
Reported by: glebius
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.
Create N dedicated lists for new negative entries.
Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.
Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.
Reviewed by: kib
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.
In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.
Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.
The default limit is the number of bucket locks.
Reviewed by: kib
Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.
Lookups still require the relevant bucket to be locked.
Discussed with: kib
Tested by: pho
An array of bucket locks is added.
All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.
This is an intermediate step towards removal of the global lock.
Reviewed by: kib
Tested by: pho
Since negative entries are managed with a LRU list, a hit requires a
modificaton.
Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.
Provide a dedicated lock for use when the cache is only shared-locked.
Reviewed by: kib
MFC after: 1 week
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.
Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.
Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.
The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.
Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
1. vhold and zap immediately instead of postponing few lines later
2. increment numneg after new entry is added
No functional changes.
No objections: kib
Previously the code would just increment statistics while only holding a
shared lock, in effect losing updates.
Separate tracking for nchstats is removed as values can be obtained from
existing counters. Note that some fields are updated by external
consumers and are left unfixed. This should not be a serious issue as
this structure looks quite obsolete.
No strong objections: kib
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.
MFC after: 1 week
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.
Perhaps SDT_PROBE should be made a private implementation detail.
MFC after: 20 days
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.
Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.
Reviewed by: kib (earlier version)
Tested by: pho
argument. This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.
Differential Revision: https://reviews.freebsd.org/D2335
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash().
These changes are in preparation for allowing changes in the size
of the vnode hash tables driven by increases and decreases in the
maximum number of vnodes in the system.
Reviewed by: kib@
Phabric: D2265
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.
These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.
While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.
Pointed out by: bde
MFC after: 1 week
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
Having ncneg diverge with the actual length of the ncneg tailq causes
NULL dereference.
Add assertion that an entry taken from ncneg queue is indeed negative.
Reported by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.
Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.
Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.
Reviewed by: jhb (previous version)
MFC after: 2 weeks
consistently, creating some namecache entries without NCF_TS flag.
This causes panic due to failed assertion.
As a temporal relief, remove the assert. Return epoch timestamp for
the entries without timestamp if asked.
While there, consolidate the code which returns timestamps, into a
helper cache_out_ts().
Discussed with: jhb
MFC after: 2 weeks
provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.
Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty. Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.
The aim of the change is to not waste 20 bytes per small namecache
entry.
Reviewed by: jhb
MFC after: 2 weeks
X-MFC-note: after r230394
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.
MFC after: 2 weeks
This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.
In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.
Unbreaks jailed zfs(8) with enforce_statfs set to 1.
Reviewed by: kib
MFC after: 1 month
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.
Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.
Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.
Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.
Reviewed by: rwatson
Approved by: re (bz)
Move debug.ncnegfactor to vfs.ncnegfactor [1].
Provide some descriptions for the namecache related sysctls [1].
Based on the submission by: Rogier R. Mulhuijzen <drwilco drwilco net> [1]
MFC after: 2 weeks
X-MFC-note: remove debug.ncnegfactor in HEAD after MFC
use '-' in probe names, matching the probe names in Solaris.[1]
Add userland SDT probes definitions to sys/sdt.h.
Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]
Assert this.
In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.
Verify that dvp is not reclaimed before calling cache_enter().
Reported and tested by: pho
Reviewed by: kan
MFC after: 2 weeks
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.
This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.
Reviewed by: rwatson
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.
Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.
Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.
Tested by: pho
not populated in parent directory if negative entry was being
created, yet entry itself was added to the nc_neg list. It was
possible for parent vnode to get discarded later, leaving negative
entry pointing to now unused memory block.
Reported by: dho
Revewed by: kib
Check the condition and return ENOENT then.
In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.
Reported and tested by: pho
the size and cost of name cache entries, but make adding debugging
and tracing easier.
Add SDT DTrace probes for various namecache events:
vfs:namecache:enter:done - new entry in the name cache, passed parent
directory vnode pointer, name added to the cache, and child vnode
pointer.
vfs:namecache:enter_negative:done - new negative entry in the name cache,
passed parent vnode pointer, name added to the cache.
vfs:namecache:fullpath:enter - call to vn_fullpath1() is made, passed
the vnode to resolve to a name.
vfs:namecache:fullpath:hit - vn_fullpath1() successfully resolved a
search for the parent of an object using the namecache, passed the
discovered parent directory vnode pointer, name, and child vnode
pointer.
vfs:namecache:fullpath:miss - vn_fullpath1() failed to resolve a search
for the parent of an object using the namecache, passed the child
vnode pointer.
vfs:namecache:fullpath:return - vn_fullpath1() has completed, passed the
error number, and if that is zero, the vnode to resolve, and the
returned path.
vfs:namecache:lookup:hit - postive name cache entry hit, passed the
parent directory vnode pointer, name, and child vnode pointer.
vfs:namecache:lookup:hit_negative - negative name cache entry hit,
passed the parent directory vnode pointer and name.
vfs:namecache:lookup:miss - name cache miss, passed the parent directory
pointer and the full remaining component name (not terminated after the
cache miss component).
vfs:namecache:purge:done - name cache purge for a vnode, passed the vnode
pointer to purge.
vfs:namecache:purge_negative:done - name cache purge of negative entries
for children of a vnode, passed the vnode pointer to purge.
vfs:namecache:purgevfs - name cache purge for a mountpoint, passed the
mount pointer. Separate probes will also be invoked for each cache
entry zapped.
vfs:namecache:zap:done - name cache entry zapped, passed the parent
directory vnode pointer, name, and child vnode pointer.
vfs:namecache:zap_negative:done - negative name cache entry zapped,
passed the parent directory vnode pointer and name.
For any probes involving an extant name cache entry (enter, hit, zapp),
we use the nul-terminated string for the name component. For misses,
the remainder of the path, including later components, is provided as
an argument instead since there is no handy nul-terminated version of
the string around. This is arguably a bug.
MFC after: 1 month
Sponsored by: Google, Inc.
Reviewed by: jhb, kan, kib (earlier version)
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.
This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.
Reviewed by: kib
debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on
a 3GHz Intel Xeon (4x2) running 7.1. It accounted for almost a quarter of
the total runtime of 'sysctl -a'. It also performs lots of copyout's while
holding the namecache lock (this does not attempt to fix that).
MFC after: 2 weeks
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.
Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans
mutex to a reader/writer lock. Lookup operations first grab a read lock and
perform the lookup. If the operation results in a need to modify the cache,
then it tries to do an upgrade. If that fails, it drops the read lock,
obtains a write lock, and redoes the lookup.
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.
In normal operation, the number of cache entries is roughly equal to the
number of active vnodes. However, when most of the recently accessed
vnodes have many hard links, the number of cache entries can be 32000
times as large, exhausting kernel memory and provoking a panic in
kmem_malloc().
MFC after: 2 weeks
did not compared nc_dvp with supplied parent directory vnode pointer.
Add the check and note that now branches for vp != NULL and vp == NULL
are the same, thus can be merged.
Reported and reviewed by: kan
Tested by: pho
MFC after: 2 weeks
looked up would have v_dd set to a non-NULL value. This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.
Submitted by: cracauer
Approved by: kib
on a best-effort basis. Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails. This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.
Currently, an implementation for devfs is being committed. The default
implementation is to return ENOENT.
A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.
Reviewed by: arch
Approved by: kib
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.
Prevent creation of the duplicated negative entries.
Reported and debugged with: pho
Reviewed by: jhb
X-MFC: after shared lookup changes
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.
Thanks to kib for guidance and patience on this process.
Reviewed by: kib
Approved by: kib
unmounts. When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.
Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.
Tested by: pho
not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
it is currently off by default) and the filesystem supports shared lookups
(e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss
in the name cache in parallel, each of the lookups may each end up adding an
entry to the namecache resulting in duplicate entries in the namecache
for the same pathname. A subsequent removal of the mapping of that
pathname to that vnode (via remove or rename) would only evict one of the
entries from the name cache. As a result, subseqent lookups for that
pathname would still return the old vnode.
This race was observed with shared lookups over NFS where a file was updated
by writing a new file out to a temporary file name and then renaming that
temporary file to the "real" file to effect atomic updates of a file. Other
processes on the same client that were periodically reading the file would
occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
in nfs_open() would receive that error when given the stale vnode.
The fix here is to check for duplicates in cache_enter() and just return
if an entry for this same directory and leaf file name for this vnode is
already in the cache. The check for duplicates is done by walking the
per-vnode list of name cache entries. It is expected that this list should
be very small in the common case (usually 0 or 1 entries during a
cache_enter() since most files only have 1 "leaf" name).
Reviewed by: ups, scottl
MFC after: 2 months
processes are not producing absolute pathname tokens. It is required
that audited pathnames are generated relative to the global root mount
point. This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache. Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.
Further, the string parsing routines have been converted to use the
sbuf(9) framework. This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.
The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele(). This will modify the hold count instead of modifying the
user count. This makes more sense since it's the kernel that requires
the reference to the vnode. This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]
Discussed with: rwatson
Reviewed by: kib [1]
MFC after: 2 weeks
no longer needed, but for now we still want to be consistent with other
similar checks in the tree.
- Call ASSERT_VOP_ELOCKED() only when vget() returns 0.
Reviewed by: jeff
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.
MFC after: 1 month
Discussed with: imp, rink
always curthread.
As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.
Tested by: Andrea Barberio <insomniac at slackware dot it>
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.
KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.
Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.
Manpage and FreeBSD_version will be updated through further commits.
As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.
Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
parent vnode and relock it after locking child vnode. The problem was that
we always relock it exclusively, even when it was share-locked.
Discussed with: jeff
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.
- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.
- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.
- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).
- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.
In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).
Tested by: kris
Discussed with: jhb, kris, attilio, jeff
- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
filesystem agnostic. We are not touching any file system specific functions
in this code path. Since we have a cache lock, there is really no need to
keep Giant around here.
This eliminates Giant acquisitions for any syscall which is auditing pathnames.
Discussed with: jeff
cache_zap() to clear the v_dd pointers when a directory vnode is forcibly
discarded. For this to work, all vnodes with v_dd pointers to a directory
must also have name cache entries linked via v_cache_dst to that dvp
otherwise we could not find them at cache_purge() time. The following
code snipit could break this guarantee by unlinking a directory before
fetching it's dotdot. The dotdot lookup would initialize the v_dd field
of the unlinked directory which could never be cleared. To fix this
we don't initialize v_dd for orphaned vnodes.
printf("rmdir: %d\n", rmdir("../foo")); /* foo is cwd */
printf("chdir: %d\n", chdir(".."));
printf("%s\n", getwd(NULL));
Sponsored by: Isilon Systems, Inc.
Discovered by: kkenn
Approved by: re (blanket vfs)