If a pNFS server's DS runs out of disk space, it replies
NFSERR_NOSPC to the client doing writing. For the Linux
client, it then sends a LayoutError RPC to the server to
tell it about the error and keeps retrying, doing repeated
LayoutGet and Write RPCs to the DS. The Linux client is
"stuck" until disk space on the DS is free'd up.
For a mirrored server configuration, the first mirror that
ran out of space was taken offline. This does not make
much sense, since the other mirror(s) will run out of space
soon and the fix is a manual cleanup up disk space.
This patch changes the pNFS server to not disable a mirror
for the mirrored case when this occurs.
Further work is needed, since the Linux client expects the
MDS to reply NFSERR_NOSPC to LayoutGets once the DS is out
of space. Without this further change, the above mentioned
looping occurs.
Found during a recent IEFT NFSv4 working group testing event.
(cherry picked from commit a7e014eee5)
When a mount with the "pnfs" and "nconnect" options specified
does an I/O operation, it erroneously uses a TCP connection
to the MDS when it is meant to be a DS operation and, as such,
needs to use a TCP connection to the DS. This patch fixes this.
When the "pnfs" and "nconnect" options are specified for a
NFSv4.1/4.2 mount, there probably should be N connections
established to each DS for I/O RPCs. This is a fair amount
of work and may be done in a future commit.
This problem was found during a recent IETF NFSv4 working
group testing event.
(cherry picked from commit 80e5955b08)
Although I was not able to cause a failure during testing, there
are places in nfscl_removedeleg() and nfscl_renamedeleg() where
I think a forced dismount could get hung. This patch fixes those.
This patch only affects forced dismount and only if the NFSv4
server is issuing delegations to the client.
Found by code inspection.
(cherry picked from commit f5d5164fb6)
When a forced dismount is done and the "nconnect" mount
option was used, the additional connections must be closed.
This patch does that.
Found during a recent IETF NFSv4 working group testing event.
(cherry picked from commit ae49051c03)
When a forced dismount is in progress, it is possible to
end up looping, retrying commits that fail.
This patch fixes the problem by pretending
that commits succeeded when a forced dismount is in prgress.
(cherry picked from commit 6b67753488)
When a forced dismount is done and delegations are being
issued by the server (disabled by default for FreeBSD
servers), the delegation structure is free'd before the
loop calling vflush(). This could result in a use after
free of the delegation structure.
This patch changes the code so that the delegation
structures are not free'd until after the vflush()
loop for forced dismounts.
Found during a recent IETF NFSv4 working group testing event.
(cherry picked from commit 4412225859)
The nfscl_getref() function is called within nfscl_doiods() when
the NFSv4.1/4.2 pNFS client is doing I/O on a DS. As such,
nfscl_getref() needs to check for a forced dismount.
This patch adds that check.
Found during a recent IETF NFSv4 working group testing event.
(cherry picked from commit 331883a2f2)
For NFS RPCs that receive a NFSERR_DELAY reply, the delay time
is initially 1sec and then increases exponentially to NFS_TRYLATERDEL.
It was found that this delay time is excessive for some NFSv4
servers, which work well with a 1msec delay.
A 1sec delay resulted in very slow performance for Remove and
Rename when delegations and pNFS were enabled.
This patch decreases the initial delay time to 1msec.
Found during a recent IETF NFSv4 working group testing event.
(cherry picked from commit 5a95a6e8e4)
Commit 5e5ca4c8fc added a NFSMNTP_DELEGISSUED flag to indicate when
a delegation has been issued to the mount. For the common case
where an NFSv4 server is not issuing delegations, this flag
can be checked to avoid acquisition of the NFSCLSTATEMUTEX.
This patch adds checks for NFSMNTP_DELEGISSUED being set
to two more functions.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
(cherry picked from commit dc6dd769de)
PR#259071 provides a test program that fails for the NFS client.
Testing with it, there appears to be a race between Lookup
and VOPs like Setattr-of-size, where Lookup ends up loading
stale attributes (including what might be the wrong file size)
into the NFS vnode's attribute cache.
The race occurs when the modifying VOP (which holds a lock
on the vnode), blocks the acquisition of the vnode in Lookup,
after the RPC (with now potentially stale attributes).
Here's what seems to happen:
Child Parent
does stat(), which does
VOP_LOOKUP(), doing the Lookup
RPC with the directory vnode
locked, acquiring file attributes
valid at this point in time
blocks waiting for locked file does ftruncate(), which
vnode does VOP_SETATTR() of Size,
changing the file's size
while holding an exclusive
lock on the file's vnode
releases the vnode lock
acquires file vnode and fills in
now stale attributes including
the old wrong Size
does a read() which returns
wrong data size
This patch fixes the problem by saving a timestamp in the NFS vnode
in the VOPs that modify the file (Setattr-of-size, Allocate).
Then lookup/readdirplus compares that timestamp with the time just
before starting the RPC after it has acquired the file's vnode.
If the modifying RPC occurred during the Lookup, the attributes
in the RPC reply are discarded, since they might be stale.
With this patch the test program works as expected.
Note that the test program does not fail on a July stable/12,
although this race is in the NFS client code. I suspect a
fairly recent change to the name caching code exposed this
bug.
PR: 259071
(cherry picked from commit 2be417843a)
Similar to commit 2be417843a, I believe there could be a race between
the NFS client VOP_LOOKUP() and file Writing that could result in stale
file attributes being loaded into the NFS vnode by VOP_LOOKUP().
I have not been able to reproduce a failure due to this race, but
I believe that there are two possibilities:
The Lookup RPC happens while VOP_WRITE() is being executed and loads
stale file attributes after VOP_WRITE() returns when it has already
completed the Write/Commit RPC(s).
--> For this case, setting the local modify timestamp at the end of
VOP_WRITE() should ensure that stale file attributes are not loaded.
The Lookup RPC occurs after VOP_WRITE() has returned, while
asynchronous Write/Commit RPCs are in progress and then is
blocked by the vnode held by VOP_OPEN/VOP_CLOSE/VOP_FSYNC which
will flush writes via ncl_flush() or ncl_vinvalbuf(), clearing the
NMODIFIED flag (which indicates Writes-in-progress). The VOP_LOOKUP()
then acquires the NFS vnode lock and fills in stale file attributes.
--> Setting the local modify timestamp in ncl_flsuh() and ncl_vinvalbuf()
when they clear NMODIFIED should ensure that stale file attributes
are not loaded.
This patch does the above.
PR: 259071
(cherry picked from commit 50dcff0816)
For pNFS servers that specify that Layouts are to be returned
upon close, they may expect that LayoutReturn to happen before
the associated Close.
This patch modifies the NFSv4.1/4.2 pNFS client so that this
is done. This only affects a pNFS mount against a non-FreeBSD
NFSv4.1/4.2 server that specifies return_on_close in LayoutGet
replies.
Found during a recent IETF NFSv4 working group testing event.
(cherry picked from commit d5d2ce1c85)
There was a case in nfscl_doiods() where the function would return
without releasing the delegation shared lock, if it was aquired by
the call to nfscl_getstateid(). This patch adds that release.
I have never observed a failure due to this missing release, so I
do not know if it ever happens in practice. However, since the pNFS
client is not yet heavily used, it might be the case.
Found by code inspection during a recent NFSv4 IETF working group
testing event.
(cherry picked from commit 23024f004a)
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
Sponsored by: The FreeBSD Foundation
(cherry picked from commit a4667e09e6)
Without this patch, if a NFSv4.1/4.2 server replies NFSERR_DELAY to
a Close operation, the client loops retrying the Close while holding
a shared lock on the clientID. This shared lock blocks returns of
delegations, even though the server has issued a CB_RECALL to request
the delegation return.
This patch delays doing a retry of a Close that received a reply of
NFSERR_DELAY until after the shared lock on the clientID is released,
for NFSv4.1/4.2. To fix this for NFSv4.0 would be very difficult and
since the only known NFSv4 server to reply NFSERR_DELAY to Close only
does NFSv4.1/4.2, this fix is hoped to be sufficient.
This problem was detected during a recent IETF working group NFSv4
testing event.
(cherry picked from commit 52dee2bc03)
This patch modifies the function that does the Close RPC (nfsrpc_closerpc)
so that it does not use the open_owner (nfso_own) for NFSv4.1/4.2.
Use of the seqid in the open_owner structure is only needed for NFSv4.0.
Same applies to a NFSERR_STALESTATEID reply, which should only happen
for NFSv4.0. This allows nfsrpc_closerpc() to be called when nfso_own
is no longer valid. This, in turn, allows nfsrpc_closerpc() to be called
after the shared lock on the clientID is released, for NFSv4.1/4.2.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
(cherry picked from commit d95c0a12a2)
This patch moves release of the shared clientID lock from nfsrpc_close()
just after the nfscl_doclose() call to the end of nfscl_doclose() call.
This does make the code cleaner, since the shared lock is acquired at
the beginning of nfscl_doclose(). The only semantics change is that
the code no longer drops and reaquires the NFSCLSTATELOCK() mutex,
which I do not believe will have a negative effect on the NFSv4 client.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
(cherry picked from commit e2aab5e2d7)
This patch adds a new argument to nfscl_tryclose() to indicate
whether or not it should loop when a NFSERR_DELAY reply is received
from the NFSv4 server. Since this new argument is always passed in
as "true" at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
(cherry picked from commit 77c595ce33)
This patch factors the unlinking of the nfsclopen structure out of
nfscl_freeopen() into a separate function called nfscl_unlinkopen().
It also adds a new argument to nfscl_freeopen() to conditionally do
the unlink. Since this new argument is always passed in as "true"
at this time, no semantics change should occur.
This is being done to prepare the code for a future patch that fixes
the case where an NFSv4.1/4.2 server replies NFSERR_DELAY to a Close
operation.
(cherry picked from commit 6495766acf)
Without this patch, if a pNFS read layout has already been acquired
for a file, writes would be redirected to the Metadata Server (MDS),
because nfscl_getlayout() would not acquire a read/write layout for
the file. This happened because there was no "mode" argument to
nfscl_getlayout() to indicate whether reading or writing was being done.
Since doing I/O through the Metadata Server is not encouraged for some
pNFS servers, it is preferable to get a read/write layout for writes
instead of redirecting the write to the MDS.
This patch adds a access mode argument to nfscl_getlayout() and
nfsrpc_getlayout(), so that nfscl_getlayout() knows to acquire a read/write
layout for writing, even if a read layout has already been acquired.
This patch only affects NFSv4.1/4.2 client behaviour when pNFS ("pnfs" mount
option against a server that supports pNFS) is in use.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
(cherry picked from commit 24af0fcdfc)
Some exported file systems, such as ZFS ones, cannot do VOP_ALLOCATE().
Since an NFSv4.2 server must either support the Allocate operation for
all file systems or not support it at all, define a sysctl called
vfs.nfsd.enable_v42allocate to enable the Allocate operation.
This sysctl is false by default and can only be set true if all
exported file systems (or all DSs for a pNFS server) can perform
VOP_ALLOCATE().
Unfortunately, there is no way to know if a ZFS file system will
be exported once the nfsd is operational, even if there are none
exported when the nfsd is started up, so enabling Allocate must
be done manually for a server configuration.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
(cherry picked from commit dfe887b7d2)
Without this patch, nfs_allocate() fell back on using vop_stdallocate()
for NFS mounts without Allocate operation support. This was incorrect,
since some file systems, such as ZFS, cannot do allocate via
vop_stdallocate(), which uses writes to try and allocate blocks.
Also, fix nfs_allocate() to return EINVAL when mounts cannot do Allocate,
since that is the correct error for posix_fallocate(2).
Note that Allocate is only supported by some NFSv4.2 servers.
(cherry picked from commit 235891a127)
Without this patch, it is possible to hang the NFSv4 client,
when a rename/remove is being done on a file where the client
holds a delegation, if pNFS is being used. For a delegation
to be returned, dirty data blocks must be flushed to the NFSv4
server. When pNFS is in use, a shared lock on the clientID
must be acquired while doing a write to the DS(s).
However, if rename/remove is doing the delegation return
an exclusive lock will be acquired on the clientID, preventing
the write to the DS(s) from acquiring a shared lock on the clientID.
This patch stops rename/remove from doing a delegation return
if pNFS is enabled. Since doing delegation return in the same
compound as rename/remove is only an optimization, not doing
so should not cause problems.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
(cherry picked from commit b82168e657)
Without this patch, it is possible for a process doing an NFSv4
Open/create of a file to block to allow another process
to acquire the exclusive lock on the clientID when holding
a shared lock on the clientID. As such, both processes
deadlock, with one wanting the exclusive lock, while the
other holds the shared lock. This deadlock is unlikely to occur
unless delegations are in use on the NFSv4 mount.
This patch fixes the problem by not deferring to the process
waiting for the exclusive lock when a shared lock (reference cnt)
is already held by the process.
This problem was detected during a recent NFSv4 interoperability
testing event held by the IETF working group.
(cherry picked from commit 120b20bdf4)
Commit 5e5ca4c8fc added a flag to a NFSv4 mount point that is set when
the first delegation is acquired from the NFSv4 server.
For a common case where delegations are not being issued by the
NFSv4 server, the nfscl_removedeleg() code acquires the mutex lock for
open/lock state, finds the delegation list empty, then just unlocks the
mutex and returns. This patch adds a check of the flag to avoid the
need to acquire the mutex for this common case.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
(cherry picked from commit 62c5be4ab4)
These began to become obsolete in d6d64f0f2c (r137739) and the deal
was later sealed in 003e18aef4 (r137801) when vfs.fifofs.fops was
dropped and vop-bypass for pipes became mandatory.
PR: 225934
(cherry picked from commit 6b88668f0b)
During VOP_GETPAGES, fusefs needs to determine the file's length, which
could require a FUSE_GETATTR operation. If that fails, it's better to
SIGBUS than panic.
Sponsored by: Axcient
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D31994
(cherry picked from commit 4f917847c9)
Unlike Copy, the NFSv4.2 Allocate operation does not
allow a reply with partial completion. As such, the only way to
limit the time the operation takes to provide a reasonable RPC RTT
is to limit the size of the allocation in the NFSv4.2
client.
This patch adds a sysctl called vfs.nfs.maxalloclen to set
the limit on the size of the Allocate operation.
There is no way to know how long a server will take to do an
allocate operation, but 64Mbytes results in a reasonable
RPC RTT for the slow hardware I test on, so that is what
the default value for vfs.nfs.maxalloclen is set to.
For an 8Gbyte allocation, the elapsed time for doing it in 64Mbyte
chunks was the same as the elapsed time taken for a single large
allocation operation for a FreeBSD server with a UFS file system.
(cherry picked from commit 9ebe4b8c67)
As of commit 103b207536, the NFSv4.2 server will limit the size
of a Copy operation based upon a 1 second timeout. The Linux 5.2
kernel server also limits Copy operation size to 4Mbytes.
As such, the NFSv4.2 client can attempt a large Copy without
resulting in a long RPC RTT for these servers.
This patch changes vfs.nfs.maxcopyrange to 64bits and sets
the default to the maximum possible size of SSIZE_MAX, since
a larger size makes the Copy operation more efficient and
allows for copying to complete with fewer RPCs.
The sysctl may be need to be made smaller for other non-FreeBSD
NFSv4.2 servers.
(cherry picked from commit 55089ef4f8)
By default NFS server reports as scope and owner major the host UUID
value and zero for owner minor. It works good in case of standalone
server. But in case of CARP-based HA cluster failover the values
should remain persistent, otherwise some clients like VMware ESXi
get confused by the change and fail to reconnect automatically.
The patch makes server scope, major owner and minor owner values
configurable via sysctls. If not set (by default) the host UUID
value is still used.
Reviewed by: rmacklem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31952
(cherry picked from commit 272c4a4dc5)
Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.
Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1 second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.
This patch uses the COPY_FILE_RANGE_TIMEO1SEC flag added by
commit c5128c48df to limit the reply time for a Copy
operation to approximately 1 second.
(cherry picked from commit 103b207536)
The NFSv4.2 Deallocate operation loops on VOP_DEALLOCATE()
while progress is being made (remaining length decreasing).
This patch changes the loop on VOP_ALLOCATE() for the NFSv4.2
Allocate operation do the same, instead of stopping after
an arbitrary 20 iterations.
(cherry picked from commit 13914e51eb)
The partial page invalidation code is factored out to be a separate
helper from tmpfs_reg_resize().
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31683
(cherry picked from commit 399be91098)
In tmpfs_link() error was erroneously cleared in commit c12118f6ce.
Sponsored by: The FreeBSD Foundation
MFC with: c12118f6ce
(cherry picked from commit a48416f844)