Note that the mountlist manipulations are somewhat fragile, and not very
pretty. The reason for this is to avoid changing vfs_mountroot(), which
is (obviously) rather mission-critical, but not very well documented,
and thus hard to test properly. It might be possible to rework it to use
its own simple root mount mechanism instead of vfs_mountroot().
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D2698
are aliases for the syscall stubs and are plt-interposed, to the
libc-private aliases of internally interposed sigprocmask() etc.
Since e.g. _sigaction is not interposed by libthr, calling signal()
removes thr_sighandler() from the handler slot etc. The result was
breaking signal semantic and rtld locking.
The added __libc_sigprocmask and other symbols are hidden, they are
not exported and cannot be called through PLT. The setjmp/longjmp
functions for x86 were changed to use direct calls, and since
PIC_PROLOGUE only needed for functional PLT indirection on i386, it is
removed as well.
The PowerPC bug of calling the syscall directly in the setjmp/longjmp
implementation is kept as is.
Reported by: Pete French <petefrench@ingresso.co.uk>
Tested by: Michiel Boland <boland37@xs4all.nl>
Reviewed by: jilles (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
having some children, the children' reaper is not reset to the parent.
This allows for the situation where reaper has children but not
descendands and the too strict asserts in the reap_status() fire.
Remove the wrong asserts, add some clarification for the situation to
the procctl(2) REAP_STATUS.
Reported and tested by: feld
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The comment in the libc/sys symbol map referenced the generated symbols
for the syscall trampolines. Such comment was out of place in the secure
symbol map so remove the stale comment and attempt to clarify the old one
to avoid risks of confusion.
Pointed out by: kib
As part of the code refactoring to support FORTIFY_SOURCE we want
a new subdirectory "secure" to keep the files related to security.
Move the stack protector functions to this new directory.
No functional change.
Differential Review: https://reviews.freebsd.org/D3333
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.
Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303
of the timehands, from the kern_tc.c implementation to vdso. Add
comments giving hints where to look for the algorithm explanation.
To compensate the removal of rmb() in userspace binuptime(), add
explicit lfence instruction before rdtsc. On i386, add usual
complications to detect SSE2 presence; assume that old CPUs which do
not implement SSE2 also execute rdtsc almost in order.
Reviewed by: alc, bde (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Summary:
Back in 2005, maxim@ attempted to fix shutdown() to return ENOTCONN in case the socket was not connected (r150152). This had to be rolled back (r150155), as it broke some of the existing programs that depend on this behavior. I reapplied this change on my system and indeed, syslogd failed to start up. I fixed this back in February (279016) and MFC'ed it to the supported stable branches. Apart from that, things seem to work out all right.
Since at least Linux and Mac OS X do the right thing, I'd like to go ahead and give this another try. To keep old copies of syslogd working, only start returning ENOTCONN for recent binaries.
I took a look at the XNU sources and they seem to test against both SS_ISCONNECTED, SS_ISCONNECTING and SS_ISDISCONNECTING, instead of just SS_ISCONNECTED. That seams reasonable, so let's do the same.
Test Plan:
This issue was uncovered while writing tests for shutdown() in CloudABI:
https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/sys/socket/shutdown_test.c#L26
Reviewers: glebius, rwatson, #manpages, gnn, #network
Reviewed By: gnn, #network
Subscribers: bms, mjg, imp
Differential Revision: https://reviews.freebsd.org/D3039
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).
Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.
Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.
* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.
This is only relevant for very specific workloads.
This doesn't pretend to be a final NUMA solution.
The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.
This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.
Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.
Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.
Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.
Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!
Tested:
* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)
* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)
* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.
Verified:
* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.
Review:
This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).
This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.
Notes:
* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.
* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.
Sponsored by: Norse Corp, Inc.; Dell
Use a constant array for the MIB. Newer LLVM decided that mib[] warranted
stack protections, with the obvious crash after the setup was done.
As a positive side effect, code size shrinks a bit.
I'm not sure why this hasn't bitten us yes, but it is certainly possible and
there are no real drawbacks to this change anyway.
Submitted by: pfg
Obtained from: NetBSD
MFC after: 1 week
- Note that ftruncate(2) can operate on shared memory objects and cross
reference shm_open(2).
- Note that ftruncate(2) does not change the file position pointer (aka
seek pointer) of the file descriptor.
- ftruncate(2) will fail with EINVAL for all sorts of other fd types than
just sockets, so instead note that it fails for all but regular files and
shared memory objects.
- Note that ftruncate(2) also appeared in 4.2BSD along with truncate(2).
(Or at least the manpage for both appeared in 4.2, I did not check the
kernel code itself to see if either predated 4.2.)
PR: 199472 (2)
Submitted by: andrew@ugh.net.au (2)
MFC after: 1 week
domain, not a file descriptor. Use 'domain' instead of the original 'd'
for this argument to match socket(2).
PR: 199491
Reported by: sp55aa@qq.com
MFC after: 1 week
pwrite(2) syscalls are wrapped to provide compatibility with pre-7.x
kernels which required padding before the off_t parameter. The
fcntl(2) contains compatibility code to handle kernels before the
struct flock was changed during the 8.x CURRENT development. The
shims were reasonable to allow easier revert to the older kernel at
that time.
Now, two or three major releases later, shims do not serve any
purpose. Such old kernels cannot handle current libc, so revert the
compatibility code.
Make padded syscalls support conditional under the COMPAT6 config
option. For COMPAT32, the syscalls were under COMPAT6 already.
Remove WITHOUT_SYSCALL_COMPAT build option, which only purpose was to
(partially) disable the removed shims.
Reviewed by: jhb, imp (previous versions)
Discussed with: peter
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
waitid() function is required to be cancellable by the standard. The
wait6() and ppoll() follow the other syscalls in their groups.
Reviewed by: jhb, jilles (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Both .weak and .alias assembler directives only work when assembling
the file which defines the symbol.
Reported and tested by: andrew
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Note that to cancel blocked kevent(2) call, changelist must be empty,
since we cannot cancel a call which already made changes to the
process state. And in reverse, call which only makes changes to the
kqueue state, without waiting for an event, is not cancellable. This
makes a natural usage model to migrate kqueue loop to support
cancellation, where existing single kevent(2) call must be split into
two: first uncancellable update of kqueue, then cancellable wait for
events.
Note that this is ABI-incompatible change, but it is believed that
there is no cancel-safe code that relies on kevent(2) not being a
cancellation point. Option to preserve the ABI would be to keep
kevent(2) as is, but add new call with flags to specify cancellation
behaviour, which only value seems to add complications.
Suggested and reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
particular, stdio locking was affected.
Reported and tested by: "Matthew D. Fuller" <fullermd@over-yonder.net>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
The core kernel part is patch file utimes.2008.4.diff from
pluknet@FreeBSD.org. I updated the code for API changes, added the manual
page and added compatibility code for old kernels. There is also audit and
Capsicum support.
A new UTIME_* constant might allow setting birthtimes in future.
Differential Revision: https://reviews.freebsd.org/D1426
Submitted by: pluknet (partially)
Reviewed by: delphij, pluknet, rwatson
Relnotes: yes
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.
Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
cancellation-handling code in the libthr. Translate some syscalls
into their more generic counterpart, and remove translated syscalls
from the table.
List of the affected syscalls:
creat, open -> openat
raise -> thr_kill
sleep, usleep -> nanosleep
pause -> sigsuspend
wait, wait3, waitpid -> wait4
Suggested and reviewed by: jilles (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
though GOT, by staticizing and hiding. Add setter for
__error_selector to hide it as well.
Suggested and reviewed by: jilles
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
(or loading a dso linked to libthr.so into process which was not
linked against threading library).
- Remove libthr interposers of the libc functions, including
__error(). Instead, functions calls are indirected through the
interposing table, similar to how pthread stubs in libc are already
done. Libc by default points either to syscall trampolines or to
existing libc implementations. On libthr load, libthr rewrites the
pointers to the cancellable implementations already in libthr. The
interposition table is separate from pthreads stubs indirection
table to not pull pthreads stubs into static binaries.
- Postpone the malloc(3) internal mutexes initialization until libthr
is loaded. This avoids recursion between calloc(3) and static
pthread_mutex_t initialization.
- Reinstall signal handlers with wrapper on libthr load. The
_rtld_is_dlopened(3) is used to avoid useless calls to sigaction(2)
when libthr is statically referenced from the main binary.
In the process, fix openat(2), swapcontext(2) and setcontext(2)
interposing. The libc symbols were exported at different versions
than libthr interposers. Export both libc and libthr versions from
libc now, with default set to the higher version from libthr.
Remove unused and disconnected swapcontext(3) userspace implementation
from libc/gen.
No objections from: deischen
Tested by: pho, antoine (exp-run) (previous versions)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.
Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
It is automatically set when -fPIC is passed to the compiler.
Reviewed by: dim, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D1179
POSIX compliance and to improve compatibility with Linux and NetBSD
The issue was identified with lib/libc/sys/t_access:access_inval from
NetBSD
Update the manpage accordingly
PR: 181155
Reviewed by: jilles (code), jmmv (code), wblock (manpage), wollman (code)
MFC after: 4 weeks
Phabric: D678 (code), D786 (manpage)
Sponsored by: EMC / Isilon Storage Division
- Fail with EINVAL if an invalid protection mask is passed to mmap().
- Fail with EINVAL if an unknown flag is passed to mmap().
- Fail with EINVAL if both MAP_PRIVATE and MAP_SHARED are passed to mmap().
- Require one of either MAP_PRIVATE or MAP_SHARED for non-anonymous
mappings.
Reviewed by: alc, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D698
Define the precision macros as bits sets to conform with XNU equivalent.
Test fflags passed for EVFILT_TIMER and return EINVAL in case an invalid flag
is passed.
Phabric: https://phabric.freebsd.org/D421
Reviewed by: kib
o Document PF_LOCAL as being an alias for PF_UNIX
o Document POSIX standardization of this interface using AF_*
constants rather than PF_* constants, and note the three particular
families which POSIX standardizes.
o Note anticipated POSIX standardization of SOCK_CLOEXEC.
o Delete from listing protocol families that FreeBSD doesn't support
(in some cases, like PF_PUP, has never supported).
o Add to listing some current protocol families that have been
introduced in the last decade or so.
o Document the correspondence of PF_* and AF_* constants.
We should probably change the documentation to make the AF_* constants
primary, but this commit does not do so.
Reviewed by: kevlo@
MFC after: 1 month
and prevents the request from deleting existing mappings in the
region, failing instead.
Reviewed by: alc
Discussed with: jhb
Tested by: markj, pho (previous version, as part of the bigger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vm.max_wired is a system-wide limit, not per-process. Reword the
section to make this more clear.
PR: docs/189214
Submitted by: Lawrence Chen (original text)
Approved by: hrs (mentor)
allowed range or when one or more pages are not mapped. This according to
The Open Group Base Specifications Issue 7.
Discussed with: attilio, Bruce Evans
Reviewed by: alc, Garrett Cooper
Reported by: ATF
MFC after: 2 weeks
Sponsored by: EMC / Isilon storage division
kqueue(2) already supports EVFILT_PROC. Add an EVFILT_PROCDESC that
behaves the same, but operates on a procdesc(4) instead. Only implement
NOTE_EXIT for now. The nice thing about NOTE_EXIT is that it also
returns the exit status of the process, meaning that we can now obtain
this value, even if pdwait4(2) is still unimplemented.
Notes:
- Simply reuse EVFILT_NETDEV for EVFILT_PROCDESC. As both of these will
be used on totally different descriptor types, this should not clash.
- Let procdesc_kqops_event() reuse the same structure as filt_proc().
The only difference is that procdesc_kqops_event() should also be able
to deal with the case where the process was already terminated after
registration. Simply test this when hint == 0.
- Fix some style(9) issues in filt_proc() to keep it consistent with the
newly added procdesc_kqops_event().
- Save the exit status of the process in pd->pd_xstat, as we cannot pick
up the proctree_lock from within procdesc_kqops_event().
Discussed on: arch@
Reviewed by: kib@
The pdfork(2) man page states:
"pdfork() returns a PID, 0 or -1, as fork(2) does."
As it returns a PID, the return type should obviously be pid_t. As int
and pid_t have the same size on all architectures, this change does not
affect the ABI in any way.
if not already defined. This allows building libc from outside of
lib/libc using a reach-over makefile.
A typical use-case is to build a standard ILP32 version and a COMPAT32
version in a single iteration by building the COMPAT32 version using a
reach-over makefile.
Obtained from: Juniper Networks, Inc.
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.
MFC after: 3 days
user. Kqueue now saves the ucred of the allocating thread, to
correctly decrement the counter on close.
Under some specific and not real-world use scenario for kqueue, it is
possible for the kqueues to consume memory proportional to the square
of the number of the filedescriptors available to the process. Limit
allows administrator to prevent the abuse.
This is kernel-mode side of the change, with the user-mode enabling
commit following.
Reported and tested by: pho
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
The accept(2) man page warns that O_NONBLOCK and other properties on the
new socket may vary across implementations. However, this issue only
applies to accept() and not to accept4(). On the other hand, accept4()
is not commonly available yet.
Reported by: pluknet
Reviewed by: bjk
Approved by: re (kib)
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.
Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month
an address in the first 2GB of the process's address space. This flag should
have the same semantics as the same flag on Linux.
To facilitate this, add a new parameter to vm_map_find() that specifies an
optional maximum virtual address. While here, fix several callers of
vm_map_find() to use a VMFS_* constant for the findspace argument instead of
TRUE and FALSE.
Reviewed by: alc
Approved by: re (kib)
in the future in a backward compatible (API and ABI) way.
The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.
The structure definition looks like this:
struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};
The initial CAP_RIGHTS_VERSION is 0.
The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.
The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.
To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.
#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:
#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
There is new API to manage the new cap_rights_t structure:
cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);
bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);
Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);
There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:
#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);
Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:
cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);
Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.
This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.
Sponsored by: The FreeBSD Foundation
and CIFS file attributes as BSD stat(2) flags.
This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.
The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.
The summary of the flags is as follows:
UF_SYSTEM: Command line name: "system" or "usystem"
ZFS name: XAT_SYSTEM, ZFS_SYSTEM
Windows: FILE_ATTRIBUTE_SYSTEM
This flag means that the file is used by the
operating system. FreeBSD does not enforce any
special handling when this flag is set.
UF_SPARSE: Command line name: "sparse" or "usparse"
ZFS name: XAT_SPARSE, ZFS_SPARSE
Windows: FILE_ATTRIBUTE_SPARSE_FILE
This flag means that the file is sparse. Although
ZFS may modify this in some situations, there is
not generally any special handling for this flag.
UF_OFFLINE: Command line name: "offline" or "uoffline"
ZFS name: XAT_OFFLINE, ZFS_OFFLINE
Windows: FILE_ATTRIBUTE_OFFLINE
This flag means that the file has been moved to
offline storage. FreeBSD does not have any special
handling for this flag.
UF_REPARSE: Command line name: "reparse" or "ureparse"
ZFS name: XAT_REPARSE, ZFS_REPARSE
Windows: FILE_ATTRIBUTE_REPARSE_POINT
This flag means that the file is a Windows reparse
point. ZFS has special handling code for reparse
points, but we don't currently have the other
supporting infrastructure for them.
UF_HIDDEN: Command line name: "hidden" or "uhidden"
ZFS name: XAT_HIDDEN, ZFS_HIDDEN
Windows: FILE_ATTRIBUTE_HIDDEN
This flag means that the file may be excluded from
a directory listing if the application honors it.
FreeBSD has no special handling for this flag.
The name and bit definition for UF_HIDDEN are
identical to the definition in MacOS X.
UF_READONLY: Command line name: "urdonly", "rdonly", "readonly"
ZFS name: XAT_READONLY, ZFS_READONLY
Windows: FILE_ATTRIBUTE_READONLY
This flag means that the file may not written or
appended, but its attributes may be changed.
ZFS currently enforces this flag, but Illumos
developers have discussed disabling enforcement.
The behavior of this flag is different than MacOS X.
MacOS X uses UF_IMMUTABLE to represent the DOS
readonly permission, but that flag has a stronger
meaning than the semantics of DOS readonly permissions.
UF_ARCHIVE: Command line name: "uarch", "uarchive"
ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
Windows name: FILE_ATTRIBUTE_ARCHIVE
The UF_ARCHIVED flag means that the file has changed and
needs to be archived. The meaning is same as
the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.
msdosfs and ZFS have special handling for this flag.
i.e. they will set it when the file changes.
sys/param.h: Bump __FreeBSD_version to 1000047 for the
addition of new stat(2) flags.
chflags.1: Document the new command line flag names
(e.g. "system", "hidden") available to the
user.
ls.1: Reference chflags(1) for a list of file flags
and their meanings.
strtofflags.c: Implement the mapping between the new
command line flag names and new stat(2)
flags.
chflags.2: Document all of the new stat(2) flags, and
explain the intended behavior in a little
more detail. Explain how they map to
Windows file attributes.
Different filesystems behave differently
with respect to flags, so warn the
application developer to take care when
using them.
zfs_vnops.c: Add support for getting and setting the
UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.
All of these flags are implemented using
attributes that ZFS already supports, so
the on-disk format has not changed.
ZFS currently doesn't allow setting the
UF_REPARSE flag, and we don't really have
the other infrastructure to support reparse
points.
msdosfs_denode.c,
msdosfs_vnops.c: Add support for getting and setting
UF_HIDDEN, UF_SYSTEM and UF_READONLY
in MSDOSFS.
It supported SF_ARCHIVED, but this has been
changed to be UF_ARCHIVE, which has the same
semantics as the DOS archive attribute instead
of inverse semantics like SF_ARCHIVED.
After discussion with Bruce Evans, change
several things in the msdosfs behavior:
Use UF_READONLY to indicate whether a file
is writeable instead of file permissions, but
don't actually enforce it.
Refuse to change attributes on the root
directory, because it is special in FAT
filesystems, but allow most other attribute
changes on directories.
Don't set the archive attribute on a directory
when its modification time is updated.
Windows and DOS don't set the archive attribute
in that scenario, so we are now bug-for-bug
compatible.
smbfs_node.c,
smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM,
UF_READONLY and UF_ARCHIVE in SMBFS.
This is similar to changes that Apple has
made in their version of SMBFS (as of
smb-583.8, posted on opensource.apple.com),
but not quite the same.
We map SMB_FA_READONLY to UF_READONLY,
because UF_READONLY is intended to match
the semantics of the DOS readonly flag.
The MacOS X code maps both UF_IMMUTABLE
and SF_IMMUTABLE to SMB_FA_READONLY, but
the immutable flags have stronger meaning
than the DOS readonly bit.
stat.h: Add definitions for UF_SYSTEM, UF_SPARSE,
UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
and UF_HIDDEN.
The definition of UF_HIDDEN is the same as
the MacOS X definition.
Add commented-out definitions of
UF_COMPRESSED and UF_TRACKED. They are
defined in MacOS X (as of 10.8.2), but we
do not implement them (yet).
ufs_vnops.c: Add support for getting and setting
UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
Alphabetize the flags that are supported.
These new flags are only stored, UFS does
not take any action if the flag is set.
Sponsored by: Spectra Logic
Reviewed by: bde (earlier version)
address alignment of mappings.
- MAP_ALIGNED(n) requests a mapping aligned on a boundary of (1 << n).
Requests for n >= number of bits in a pointer or less than the size of
a page fail with EINVAL. This matches the API provided by NetBSD.
- MAP_ALIGNED_SUPER is a special case of MAP_ALIGNED. It can be used
to optimize the chances of using large pages. By default it will align
the mapping on a large page boundary (the system is free to choose any
large page size to align to that seems best for the mapping request).
However, if the object being mapped is already using large pages, then
it will align the virtual mapping to match the existing large pages in
the object instead.
- Internally, VMFS_ALIGNED_SPACE is now renamed to VMFS_SUPER_SPACE, and
VMFS_ALIGNED_SPACE(n) is repurposed for specifying a specific alignment.
MAP_ALIGNED(n) maps to using VMFS_ALIGNED_SPACE(n), while
MAP_ALIGNED_SUPER maps to VMFS_SUPER_SPACE.
- mmap() of a device object now uses VMFS_OPTIMAL_SPACE rather than
explicitly using VMFS_SUPER_SPACE. All device objects are forced to
use a specific color on creation, so VMFS_OPTIMAL_SPACE is effectively
equivalent.
Reviewed by: alc
MFC after: 1 month
- NOTE_TRACK has never triggered a NOTE_TRACK event from the parent pid.
If NOTE_FORK is set, the listener will get a NOTE_FORK event from
the parent pid, but not a separate NOTE_TRACK event.
- Explicitly note that the event added to monitor the child process
preserves the fflags from the original event.
- Move the description of NOTE_TRACKERR under NOTE_TRACK as it is not a
bit for the user to set (which is what this list pupports to be).
Also, explicitly note that if an error occurs, the NOTE_CHILD event
will not be generated.
MFC after: 1 week
The pipe2() function is similar to pipe() but allows setting FD_CLOEXEC and
O_NONBLOCK (on both sides) as part of the function.
If p points to two writable ints, pipe2(p, 0) is equivalent to pipe(p).
If the pointer is not valid, behaviour differs: pipe2() writes into the
array from the kernel like socketpair() does, while pipe() writes into the
array from an architecture-specific assembler wrapper.
Reviewed by: kan, kib
The accept4() function, compared to accept(), allows setting the new file
descriptor atomically close-on-exec and explicitly controlling the
non-blocking status on the new socket. (Note that the latter point means
that accept() is not equivalent to any form of accept4().)
The linuxulator's accept4 implementation leaves a race window where the new
file descriptor is not close-on-exec because it calls sys_accept(). This
implementation leaves no such race window (by using falloc() flags). The
linuxulator could be fixed and simplified by using the new code.
Like accept(), accept4() is async-signal-safe, a cancellation point and
permitted in capability mode.
There are no getdtablesize() bounds on the file descriptor to be duplicated;
it only has to be open. If the RLIMIT_NOFILE rlimit was decreased after
opening the file descriptor, it may be greater than or equal to
getdtablesize() but still valid.
MFC after: 1 week
extattr_set_{fd,file,link} is logically a write(2)-like operation and
should return ssize_t, just like extattr_get_*. Also, the user-space
utility was using an int for the return value of extattr_get_* and
extattr_list_*, both of which return an ssize_t.
MFC after: 1 week
While almost nobody uses O_ASYNC, and rightly so, the inheritance of the
related properties across accept() is a portability issue like the
inheritance of O_NONBLOCK.
u_long. Before this change it was of type int for syscalls, but prototypes
in sys/stat.h and documentation for chflags(2) and fchflags(2) (but not
for lchflags(2)) stated that it was u_long. Now some related functions
use u_long type for flags (strtofflags(3), fflagstostr(3)).
- Make path argument of type 'const char *' for consistency.
Discussed on: arch
Sponsored by: The FreeBSD Foundation
This change allows creating file descriptors with close-on-exec set in some
situations. SOCK_CLOEXEC and SOCK_NONBLOCK can be OR'ed in socket() and
socketpair()'s type parameter, and MSG_CMSG_CLOEXEC to recvmsg() makes file
descriptors (SCM_RIGHTS) atomically close-on-exec.
The numerical values for SOCK_CLOEXEC and SOCK_NONBLOCK are as in NetBSD.
MSG_CMSG_CLOEXEC is the first free bit for MSG_*.
The SOCK_* flags are not passed to MAC because this may cause incorrect
failures and can be done later via fcntl() anyway. On the other hand, audit
is expected to cope with the new flags.
For MSG_CMSG_CLOEXEC, unp_externalize() is extended to take a flags
argument.
Reviewed by: kib
int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);
which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.
- Add manual pages for the new syscalls.
- Make the new syscalls available for processes in capability mode sandbox.
- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.
- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.
- Update procstat(1) to recognize the new capability rights.
- Document the new capability rights in cap_rights_limit(2).
Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des
- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.
- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.
- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.
- The cap_getrights(2) syscall is renamed to cap_rights_get(2).
- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.
- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).
- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.
- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.
- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:
CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.
Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).
Added CAP_SYMLINKAT:
- Allow for symlinkat(2).
Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).
Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.
Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.
Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.
Removed CAP_MAPEXEC.
CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.
Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).
Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.
CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).
CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).
Added convinient defines:
#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE
#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)
Added defines for backward API compatibility:
#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)
Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
system call, which has a nice property - it never fails, so it is a bit
easier to use. If there is no support for capability mode in the kernel
the function will return false (not in a sandbox). If the kernel is compiled
with the support for capability mode, the function will return true or false
depending if the calling process is in the capability mode sandbox or not
respectively.
Sponsored by: The FreeBSD Foundation
now disables read-ahead. It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.
Reviewed by: kib
but use normal references instead of weak. This makes the statically
linked binaries to use fast gettimeofday(2) by forcing the linker to
resolve references and providing the neccessary functions.
Reported by: bde
Tested by: marius (sparc64)
MFC after: 2 weeks
output and replace it with a new visible sysctl kern.ipc.acceptqueue
of the same functionality. It specifies the maximum length of the
accept queue on a listen socket.
The old kern.ipc.somaxconn remains available for reading and writing
for compatibility reasons so that existing programs, scripts and
configurations continue to work. There no plans to ever remove the
orginal and now hidden kern.ipc.somaxconn.
Passing an invalid pointer results in undefined behaviour.
The wrappers in libthr access some of the data pointed to by the arguments
in userland, so that an invalid pointer will cause a signal and not an
[EFAULT] error return.
Furthermore, if the [EFAULT] error occurs when the kernel is writing, it is
not a proper error in the sense that the call still commits (changing the
signal disposition or accepting the signal).
MFC after: 1 week
Append '__' prefix to the tag of struct oflock, and put it under BSD
namespace. Structure is needed both by libc and kernel, thus cannot be
hidden under #ifdef _KERNEL.
Move a set of non-standard F_* and O_* constants into BSD namespace.
SUSv4 explicitely allows implemenation to pollute F_* and O_* names
after fcntl.h is included, but it costs us nothing to adhere
to the specification if exact POSIX compliance level is requested by
user code.
Change some spaces after #define to tabs.
Noted by and discussed with: bde
MFC after: 1 week
clock_gettime(2) functions if supported. The speedup seen in
microbenchmarks is in range 4x-7x depending on the hardware.
Only amd64 and i386 architectures are supported. Libc uses rdtsc and
kernel data to calculate current time, if enabled by kernel.
Hopefully, this code is going to migrate into vdso in some future.
Discussed with: bde
Reviewed by: jhb
Tested by: flo
MFC after: 1 month
First, extend the changes in r230782 to better handle the common case
of using NOREUSE with sequential reads. A NOREUSE file descriptor
will now track the last implicit DONTNEED request it made as a result
of a NOREUSE read. If a subsequent NOREUSE read is adjacent to the
previous range, it will apply the DONTNEED request to the entire range
of both the previous read and the current read. The effect is that
each read of a file accessed sequentially will apply the DONTNEED
request to the entire range that has been read. This allows NOREUSE
to properly handle misaligned reads by flushing each buffer to cache
once it has been completely read.
Second, apply the same changes made to read(2) by r230782 and this
change to writes. This provides much better performance in the
sequential write case as it allows writes to still be clustered. It
also provides much better performance for misaligned writes. It does
mean that NOREUSE will be generally ineffective for non-sequential
writes as the current implementation relies on a future NOREUSE
write's implicit DONTNEED request to flush the dirty buffer from the
current write.
MFC after: 2 weeks