This argument is not a bitmask of flags, but only accepts a single value.
Fail with EINVAL if an invalid value is passed to 'flag'. Rename the
'flags' argument to getmntinfo(3) to 'mode' as well to match.
This is a followup to r308088.
Reviewed by: kib
MFC after: 1 month
Instead of failing with ENAMETOOLONG, which is swallowed by
pthread_set_name_np() anyway, truncate the given name to MAXCOMLEN+1
bytes. This is more likely what the user wants, and saves the
caller from truncating it before the call (which was the only
recourse).
Polish pthread_set_name_np(3) and add a .Xr to thr_set_name(2)
so the user might find the documentation for this behavior.
Reviewed by: jilles
MFC after: 3 days
Sponsored by: Dell EMC
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:
1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
reclaimed, and this work may be unfairly deferred to the vnlru process
or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
the laundry thread needs to launder pages from an unreferenced such
vnode, it will reactivate and deactivate the vnode with each laundering,
potentially resulting in a large number of expensive scans.
Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.
Reviewed by: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8641
We return [EMLINK] instead of [ELOOP] when trying to open a symlink with
O_NOFOLLOW, so that the original case of [ELOOP] can be distinguished. Code
like cmp -h and xz takes advantage of this.
PR: 214633
Reviewed by: kib, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D8586
do any speculations about readahead, and use exactly the amount of readahead
specified by user. E.g. setting SF_FLAGS(0, SF_USER_READAHEAD) will guarantee
that no readahead at all will be performed.
Previously the flag returned by cap_getmode was not described explicitly
in the man page.
Reviewed by: wblock
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7822
The initial value of NOASM is nearly the same in all cases and the
initial value of PSEUDO is the same in all cases so reduce duplication
(and hopefully, future merge conflicts) by machine independent defaults.
Also document the PSEUDO variable.
Reviewed by: jhb, kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D7820
Besides removing hand-translation to assembler, this also adds missing
wrappers for arm64 and risc-v.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7694
Since ptrace(2) syscall can return -1 for non-error situations, libc
wrappers set errno to 0 before performing the syscall, as the service
to the caller. On both i386 and amd64, the errno symbol was directly
referenced, which only works correctly in single-threaded process.
Change assembler wrappers for ptrace(2) to get current thread errno
location by calling __error(). Allow __error interposing, as
currently allowed in cerror().
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
- Avoid double use of "request" in a single sentence. Instead, describe
aio_sigevent as being used to request notification of the associated
operation's completion. This matches the language used to describe
aio_sigevent in aio(4).
- Simplify the prohibition on modifying buffers while requests are in
flight.
- Fix case mismatch.
- Drop note about not using stack variables. C programmers should be able
to figure out if a stack variable is safe based on the later warning
about the life cycle requirements of control blocks.
- Remove prohibition on modifying the I/O buffer for aio_fsync() since
it does not use an I/O buffer. For aio_mlock(), prohibit modifications
to the mapping (e.g. due to mprotect, munmap, mmap, etc.) but do not
prohibit modifications to the memory backing the buffer (stores into
the pages backing the buffer).
Requested by: wblock (1,2), kib (4)
Reviewed by: kib, rpokala, wblock
MFC after: 3 days
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7462
Right now, userspace (fast) gettimeofday(2) on x86 only works for
RDTSC. For older machines, like Core2, where RDTSC is not C2/C3
invariant, and which fall to HPET hardware, this means that the call
has both the penalty of the syscall and of the uncached hw behind the
QPI or PCIe connection to the sought bridge. Nothing can me done
against the access latency, but the syscall overhead can be removed.
System already provides mappable /dev/hpetX devices, which gives
straight access to the HPET registers page.
Add yet another algorithm to the x86 'vdso' timehands. Libc is updated
to handle both RDTSC and HPET. For HPET, the index of the hpet device
to mmap is passed from kernel to userspace, index might be changed and
libc invalidates its mapping as needed.
Remove cpu_fill_vdso_timehands() KPI, instead require that
timecounters which can be used from userspace, to provide
tc_fill_vdso_timehands{,32}() methods. Merge i386 and amd64
libc/<arch>/sys/__vdso_gettc.c into one source file in the new
libc/x86/sys location. __vdso_gettc() internal interface is changed
to move timecounter algorithm detection into the MD code.
Measurements show that RDTSC even with the syscall overhead is faster
than userspace HPET access. But still, userspace HPET is three-four
times faster than syscall HPET on several Core2 and SandyBridge
machines.
Tested by: Howard Su <howard0su@gmail.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D7473
The syscall is a trivial wrapper around new VOP_FDATASYNC(), sharing
code with fsync(2). For all filesystems, this commit provides the
implementation which delegates the work of VOP_FDATASYNC() to
VOP_FSYNC(). This is functionally correct but not efficient.
This is not yet POSIX-compliant implementation, because it does not
ensure that queued AIO requests are completed before returning.
Reviewed by: mckusick
Discussed with: avg (ZFS), jhb (AIO part)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471
Our mprotect() function seems to take a "const void *" address to the
pages whose permissions need to be adjusted. POSIX uses "void *". Simply
stick to the POSIX one to prevent us from writing unportable code.
PR: 211423 (exp-run)
Tested by: antoine@ (Thanks!)
It looks like the msgrcv() system call is already written in such a way
that the size is internally computed as a size_t and written into all of
td_retval[0]. This means that it is effectively already returning
ssize_t. It's just that the userspace prototype doesn't match up.
The asynchronous I/O changes made previously result in different
behavior out of the box. Previously all AIO requests failed with
ENOSYS / SIGSYS unless aio.ko was explicitly loaded. Now, some AIO
requests complete and others ("unsafe" requests) fail with EOPNOTSUPP.
Reword the introductory paragraph in aio(4) to add a general
description of AIO before describing the vfs.aio.enable_unsafe sysctl.
Remove the ENOSYS error description from aio_fsync(2), aio_read(2),
and aio_write(2) and replace it with a description of EOPNOTSUPP.
Remove the ENOSYS error description from aio_mlock(2).
Log a message to the system log the first time a process requests an
"unsafe" AIO request that fails with EOPNOTSUPP. This is modeled on
the log message used for processes using the legacy pty devices.
Reviewed by: kib (earlier version)
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D7151
First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork(). Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask. This new stop is reported after the vfork parent resumes
due to the child calling exit or exec. Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7045
ptrace() now stores a mask of optional events in p_ptevents. Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.
Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.
The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.
The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.
The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.
While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.
Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7044
- Add a sigevent(3) manpage to give a general overview of the sigevent
structure and the available notification mechanisms.
- Document that AIO requests contain a nested sigevent structure that can
be used to request completion notification.
- Expand the sigevent details in other manuals to note details such as
the extra values stored in a queued signal's information or in a posted
kevent.
Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7122
Since r302216, thread suspension causes advisory file locks to restart
(instead of continuing to wait) and for a long time SA_RESTART has
affected advisory file locks. These are both not compliant to POSIX.1.
To clarify that restarting means something, add a paragraph about fair
queuing. Note that the network lock manager does not implement fair
queuing.
Reviewed by: kib (previous version)
Approved by: re (gjb)
value.
This eliminates the need for machine dependant assembly wrappers for
pipe(2).
It also make passing an invalid address to pipe(2) return EFAULT rather
than triggering a segfault. Document this behavior (which was already
true for pipe2(2), but undocumented).
Reviewed by: andrew
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D6815
Setting time by seconds or microseconds may cause unexpected effects
especially if sysctl vfs.timestamp_precision=3 (not default).
Calling the obsolete functions with NULL timestamps is acceptable.
Add some missing errno values to thr_new(2) and pthread_create(3).
In particular, EDEADLK was not documented in the latter.
While I'm here, improve some English and cross-references.
Reviewed by: kib
Sponsored by: Dell Inc.
Differential Revision: https://reviews.freebsd.org/D6663
Add text to thr_exit(2) and thr_new(2) discouraging their use in
applications since calling these in a process with libthr loaded will
confuse libthr and is likely to cause hangs or crashes.
The thr_kill2(2) call is not used by libthr and may be useful in special
applications.
The other calls can be used in applications but it should not be necessary.
While there, order EVFILT_VNODE notes descriptions alphabetically.
Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
the monitored directory as the result of rename(2) operation. The
renames staying in the directory are not reported.
Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
a basic usage example. Although it is an
untypical example for the use of kqueue, it is
better than nothing and should get people started.
PR: 196844
Submitted by: fernando.apesteguia@gmail.com
Reviewed by: kib
Approved by: kib
MFC after: 5 days
Differential Revision: https://reviews.freebsd.org/D6082
First, update the return types of aio_return() and aio_waitcomplete() to
ssize_t.
POSIX requires aio_return() to return a ssize_t so that it can represent
all return values from read() and write(). aio_waitcomplete() should use
ssize_t for the same reason.
aio_return() has used ssize_t in <aio.h> since r31620 but the manpage and
system call entry were not updated. aio_waitcomplete() has always
returned int.
Note that this does not require new system call stubs as this is
effectively only an API change in how the compiler interprets the return
value.
Second, allow aio_nbytes values up to IOSIZE_MAX instead of just INT_MAX.
aio_read/write should now honor the same length limits as normal read/write.
Third, use longs instead of ints in the aio_return() and aio_waitcomplete()
system call functions so that the 64-bit size_t in the in-kernel aiocb
isn't truncated to 32-bits before being copied out to userland or
being returned.
Finally, a simple test has been added to verify the bounds checking on the
maximum read size from a file.
These entries should have never been present since they only exist for
compat with FreeBSD 6.x (and older) binaries. This was missed in r296572.
Technically this breaks the ABI by removing versioned symbols. However,
no binaries should be linked against these symbols. No release has
shipped with a header that contained a prototype for these functions.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5615
documented and easy to miss.
At the same time, it's pretty important for anyone who is trying to use
SEEK_HOLE/SEEK_DATA in real app. Try to bridge that gap by making that
description more pronounced and also document how it affects failure codes.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5162
do not participate in the global symbols namespace, but rtld locks are
still replaced and functions are interposed. In particular,
__pthread_map_stacks_exec is resolved to the libc version. If a
library is loaded later, which requires adjustment of the stack
protection mode, rtld calls into libc __pthread_map_stacks_exec due to
the symbols scope. The libc version might recurse into binder and
recursively acquire rtld bind lock, causing the hang.
Make libc __pthread_map_stacks_exec() interposed, which synchronizes
rtld locks and version of the stack exec hook when libthr loaded,
regardless of the symbol scope control or symbol resolution order.
The __pthread_map_stacks_exec() symbol is removed from the private
version in libthr since libc symbol now operates correctly in presence
of libthr.
Reported and tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
wraps sendmsg(2) and recvmsg(2) into batch send and receive operation.
The goal of this implementation is only to provide API compatibility
with Linux.
The cancellation behaviour of the functions is not quite right, but
due to relative rare use of cancellation it is considered acceptable
comparing with the complexity of the correct implementation. If
functions are reimplemented as syscalls, the fix would come almost
trivial. The direct use of the syscall trampolines instead of libc
wrappers for sendmsg(2) and recvmsg(2) is to avoid data loss on
cancellation.
Submitted by: Boris Astardzhiev <boris.astardzhiev@gmail.com>
Discussed with: jilles (cancellation behaviour)
MFC after: 1 month
intended behaviour in its man page. Simplify tty_drain() to match.
Don't call ttydevsw methods in tty_flush() if the device is gone
since we now sometimes call it then.
The flushing was supposed to be implemented by passing the FNONBLOCK
flag to VOP_CLOSE() for revoke(). The tty driver is one of the few
that can block in close and was one of the fewer that knew about this.
This almost worked in FreeBSD-1 and similarly in Net/2. These
versions only almost worked because there was and is considerable
confusion between IO_NDELAY and FNONBLOCK (aka O_NONBLOCK). IO_NDELAY
is only valid for VOP_READ() and VOP_WRITE(). For other VOPs it has
the same value as O_SHLOCK. But since vfs_subr.c and tty.c
consistently used the wrong flag and the O_SHLOCK flag is rarely set,
this mostly worked. It also gave the feature than applications could
get the non-blocking close by abusing O_SHLOCK.
This was first broken then fixed in 1995. I changed only the tty
driver to use FNONBLOCK, as a hack to get non-blocking via the normal
flag FNONBLOCK for last closes. I didn't know about revoke()'s use
of IO_NDELAY or change it to be consistent, so revoke() was broken.
Then I changed revoke() to match.
This was next broken in 1997 then fixed in 1998. Importing Lite2 made
the flags inconsistent again by undoing the fix only in vfs_subr.c.
This was next broken in 2008 by replacing everything in tty.c and not
checking any flags in last close. Other bugs in draining limited the
resulting unbounded waits to drain in some cases.
It is now possible to fix this better using the new FREVOKE flag.
Just restore flushing for revoke() for now. Don't restore or undo any
hacks for ordinary last closes yet. But remove dead code in the
1-second relative timeout (r272789). This did extra work to extend
the buggy draining for revoke() for as long as possible. The 1-second
timeout made this not very long by usually flushing after 1 second.
Submitted by: bde
MFC after: 2 weeks
* Fix __FreeBSD_version check.
* Update history section in man page.
An MFC of this commit to stable/10 will allow using the new system calls
instead of the fallback.
MFC after: 3 days
up to now.
The new sendfile is the code that Netflix uses to send their multiple tens
of gigabits of data per second. The new implementation features asynchronous
I/O, when I/O operations are launched, but not awaited to be complete. An
explanation of why such behavior is beneficial compared to old one is
going to be too long for a commit message, so we will skip it here.
Additional features of new syscall are extra flags, which provide an
application more control over data sent. The SF_NOCACHE flag tells
kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
macro allows to specify readahead size in pages.
The new syscalls is a drop in replacement. No modifications are required
to applications. One can take nginx binary for stable/10 and run it
successfully on head. Although SF_NODISKIO lost its original sense, as now
sendfile doesn't block, and now means something completely different (tm),
using the new sendfile the old way is absolutely safe.
Celebrates: Netflix global launch!
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
Relnotes: yes
(CLOCK_REALTIME case) system calls is non negative.
This commit hides a kernel panic in atrtc_settime() as the clock_ts_to_ct()
does not properly convert negative tv_sec.
ps. in my opinion clock_ts_to_ct() should be rewritten to properly handle
negative tv_sec values.
Differential Revision: https://reviews.freebsd.org/D4714
Reviewed by: kib
MFC after: 1 week
Depending on system configuration and parameters, clock_gettime() and
gettimeofday() may not be system calls. If so, passing an invalid pointer
will cause a signal and not an [EFAULT] error.
From a standards perspective, this is OK since passing an invalid pointer is
undefined behaviour.
MFC after: 1 week
system call information such as system call arguments. Initially this
will consist of pulling duplicated code out of truss and kdump though it
may prove useful for other utilities in the future.
This commit moves the shared utrace(2) record parser out of kdump into
the library and updates kdump and truss to use it. One difference from
the previous version is that the library version treats unknown events
that start with the "RTLD" signature as unknown events. This simplifies
the interface and allows the consumer to decide how to handle all
non-recognized events. Instead, this function only generates a string
description for known malloc() and RTLD records.
Reviewed by: bdrewery
Differential Revision: https://reviews.freebsd.org/D4537
This removes the need for manually changing this flag for Google Chrome
users. It also improves compatibility with Linux applications running under
Linuxulator compatibility layer, and possibly also helps in porting software
from Linux.
Generally speaking, the flag allows applications to create the shared memory
segment, attach it, remove it, and then continue to use it and to reattach it
later. This means that the kernel will automatically "clean up" after the
application exits.
It could be argued that it's against POSIX. However, SUSv3 says this
about IPC_RMID: "Remove the shared memory identifier specified by shmid from
the system and destroy the shared memory segment and shmid_ds data structure
associated with it." From my reading, we break it in any case by deferring
removal of the segment until it's detached; we won't break it any more
by also deferring removal of the identifier.
This is the behaviour exhibited by Linux since... probably always, and
also by OpenBSD since the following commit:
revision 1.54
date: 2011/10/27 07:56:28; author: robert; state: Exp; lines: +3 -8;
Allow segments to be used even after they were marked for deletion with
the IPC_RMID flag.
This is permitted as an extension beyond the standards and this is similar
to what other operating systems like linux do.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3603
This uses the kdump(1) utrace support code directly until a common library
is created.
This allows malloc(3) tracing with MALLOC_CONF=utrace:true and rtld tracing
with LD_UTRACE=1. Unknown utrace(2) data is just printed as hex.
PR: 43819 [inspired by]
Reviewed by: jhb
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D3819