This is a minor follow-up to r297422, prompted by a Coverity warning. (It's
not a real defect, just a code smell.) OSD slot array reservations are an
array of pointers (void **) but were cast to void* and back unnecessarily.
Keep the correct type from reservation to use.
osd.9 is updated to match, along with a few trivial igor fixes.
Reported by: Coverity
CID: 1353811
Sponsored by: EMC / Isilon Storage Division
1. Limit secs to INT32_MAX / 2 to avoid errors from kern_setitimer().
Assert that kern_setitimer() returns 0.
Remove bogus cast of secs.
Fix style(9) issues.
2. Increment the return value if the remaining tv_usec value more than 500000 as a Linux does.
Pointed out by: [1] Bruce Evans
MFC after: 1 week
is always successfull.
So, ignore any errors and return 0 as a Linux do.
XXX. Unlike POSIX, Linux in case when the invalid seconds value specified
always return 0, so in that case Linux does not return proper remining time.
MFC after: 1 week
need to include it explicitly when <vm/vm_param.h> is already included.
Suggested by: alc
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D5379
and replace crcopysafe by crcopy as crcopysafe is is not intended to be
safe in a threaded environment, it drops PROC_LOCK() in while() that
can lead to unexpected results, such as overwrite kernel memory.
In my POV crcopysafe() needs special attention. For now I do not see
any problems with this function, but who knows.
Submitted by: dchagin
Found by: trinity
Security: SA-16:04.linux
The set_robust_list system call request the kernel to record the head
of the list of robust futexes owned by the calling thread. The head
argument is the list head to record.
The get_robust_list system call should return the head of the robust
list of the thread whose thread id is specified in pid argument.
The list head should be stored in the location pointed to by head
argument.
In contrast, our implemenattion of get_robust_list system call copies
the known portion of memory pointed by recorded in set_robust_list
system call pointer to the head of the robust list to the location
pointed by head argument.
So, it is possible for a local attacker to read portions of kernel
memory, which may result in a privilege escalation.
Submitted by: mjg
Security: SA-16:03.linux
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.
MFC after: 1 week
r161611 added some of the code from sys_vfork() directly into the Linux
module wrappers since they use RFSTOPPED. In r232240, the RFFPWAIT handling
was moved to syscallret(), thus this code in the Linux module is no longer
needed as it will be called later.
This also allows the Linux wrappers to benefit from the fix in r275616 for
threads not getting suspended if their vforked child is stopped while they
wait on them.
Reviewed by: jhb, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3828
SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.
Perhaps SDT_PROBE should be made a private implementation detail.
MFC after: 20 days
On CloudABI we want to create file descriptors with just the minimal set
of Capsicum rights in place. The reason for this is that it makes it
easier to obtain uniform behaviour across different operating systems.
By explicitly whitelisting the operations, we can return consistent
error codes, but also prevent applications from depending OS-specific
behaviour.
Extend kern_kqueue() to take an additional struct filecaps that is
passed on to falloc_caps(). Update the existing consumers to pass in
NULL.
Differential Revision: https://reviews.freebsd.org/D3259
On CloudABI, the rights bits returned by cap_rights_get() match up with
the operations that you can actually perform on the file descriptor.
Limiting the rights is good, because it makes it easier to get uniform
behaviour across different operating systems. If process descriptors on
FreeBSD would suddenly gain support for any new file operation, this
wouldn't become exposed to CloudABI processes without first extending
the rights.
Extend fork1() to gain a 'struct filecaps' argument that allows you to
construct process descriptors with custom rights. Use this in
cloudabi_sys_proc_fork() to limit the rights to just fstat() and
pdwait().
Obtained from: https://github.com/NuxiNL/freebsd
Summary:
Pipes in CloudABI are unidirectional. The reason for this is that
CloudABI attempts to provide a uniform runtime environment across
different flavours of UNIX.
Instead of implementing a custom pipe that is unidirectional, we can
simply reuse Capsicum permission bits to support this. This is nice,
because CloudABI already attempts to restrict permission bits to
correspond with the operations that apply to a certain file descriptor.
Replace kern_pipe() and kern_pipe2() by a single kern_pipe() that takes
a pair of filecaps. These filecaps are passed to the newly introduced
falloc_caps() function that creates the descriptors with rights in
place.
Test Plan:
CloudABI pipes seem to be created with proper rights in place:
https://github.com/NuxiNL/cloudlibc/blob/master/src/libc/unistd/pipe_test.c#L44
Reviewers: jilles, mjg
Reviewed By: mjg
Subscribers: imp
Differential Revision: https://reviews.freebsd.org/D3236
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).
Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.
Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Use the same scheme implemented to manage credentials.
Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.
Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
writes the remaining time into the structure pointed to by rmtp
unless rmtp is NULL. The value of *rmtp can then be used to call
nanosleep() again and complete the specified pause if the previous
call was interrupted.
Note. clock_nanosleep() with an absolute time value does not write
the remaining time.
While here fix whitespaces and typo in SDT_PROBE.
implemented via ioctl interface. First of all return ENOTSUP for this
operation as a cp fallback to usual method in that case. Secondly, do
not print out the message about unimplemented operation.
1. Linux sigset always 64 bit on all platforms. In order to move Linux
sigset code to the linux_common module define it as 64 bit int. Move
Linux sigset manipulation routines to the MI path.
2. Move Linux signal number definitions to the MI path. In general, they
are the same on all platforms except for a few signals.
3. Map Linux RT signals to the FreeBSD RT signals and hide signal conversion
tables to avoid conversion errors.
4. Emulate Linux SIGPWR signal via FreeBSD SIGRTMIN signal which is outside
of allowed on Linux signal numbers.
PR: 197216
argument is not a null pointer, and the ss_flags member pointed to by ss
contains flags other than SS_DISABLE. However, in fact, Linux also
allows SS_ONSTACK flag which is simply ignored.
For buggy apps (at least mono) ignore other than SS_DISABLE
flags as a Linux do.
While here move MI part of sigaltstack code to the appropriate place.
Reported by: abi at abinet dot ru
"highly experimental" remove /dev/shm magic commited
in r218497 and convert tmpfs type to an expected magic number.
Differential Revision: https://reviews.freebsd.org/D1497
Reviewed by: emaste, trasz
as it already zeroed by malloc with M_ZERO flag
and move zeroing to the proper place in exec path.
Differential Revision: https://reviews.freebsd.org/D1462
Reviewed by: trasz
around kqueue() to implement epoll subset of functionality.
The kqueue user data are 32bit on i386 which is not enough for
epoll user data, so we keep user data in the proc emuldata.
Initial patch developed by rdivacky@ in 2007, then extended
by Yuri Victorovich @ r255672 and finished by me
in collaboration with mjg@ and jillies@.
Differential Revision: https://reviews.freebsd.org/D1092
Check wait options as a Linux do.
Linux always set WEXITED option not a WUNTRACED|WNOHANG
which is a strange bug.
Differential Revision: https://reviews.freebsd.org/D1085
Reviewed by: trasz
The AT_EACCESS and AT_SYMLINK_NOFOLLOW flags are actually implemented
within the glibc wrapper function for faccessat(). If either of these
flags are specified, then the wrapper function employs fstatat() to
determine access permissions.
Differential Revision: https://reviews.freebsd.org/D1078
Reviewed by: trasz
thread emuldata to proc emuldata as it was originally intended.
As we can have both 64 & 32 bit Linuxulator running any eventhandler
can be called twice for us. To prevent this move eventhandlers code
from linux_emul.c to the linux_common.ko module.
Differential Revision: https://reviews.freebsd.org/D1073
following primary purposes:
1. Remove the dependency of linsysfs and linprocfs modules from linux.ko,
which will be architecture specific on amd64.
2. Incorporate into linux_common.ko general code for platforms on which
we'll support two Linuxulator modules (for both instruction set - 32 & 64 bit).
3. Move malloc(9) declaration to linux_common.ko, to enable getting memory
usage statistics properly.
Currently linux_common.ko incorporates a code from linux_mib.c and linux_util.c
and linprocfs, linsysfs and linux kernel modules depend on linux_common.ko.
Temporarily remove dtrace garbage from linux_mib.c and linux_util.c
Differential Revision: https://reviews.freebsd.org/D1072
In collaboration with: Vassilis Laganakos.
Reviewed by: trasz
Move struct ipc_perm definition to the MD path as it differs for 64 and
32 bit platform.
Differential Revision: https://reviews.freebsd.org/D1068
Reviewed by: trasz
All fields of type l_int in struct statfs are defined
as l_long on i386 and amd64.
Differential Revision: https://reviews.freebsd.org/D1064
Reviewed by: trasz
exposes functions from kernel with proper DWARF CFI information so that
it becomes easier to unwind through them.
Using vdso is a mandatory for a thread cancelation && cleanup
on a modern glibc.
Differential Revision: https://reviews.freebsd.org/D1060
Use it in linux_wait4() system call and move linux_wait4() to the MI path.
While here add a prototype for the static bsd_to_linux_rusage().
Differential Revision: https://reviews.freebsd.org/D2138
Reviewed by: trasz
of harcoded pr_osrelease, pr_osrel values. This will be used later in
the VDSO.
Differential Revision: https://reviews.freebsd.org/D1042
Reviewed by: trasz
of multiple simultaneous calls to pthread_join() specifying the same
target thread are undefined wake up the one thread.
Differential Revision: https://reviews.freebsd.org/D1040
The reasons:
1. Get rid of the stubs/quirks with process dethreading,
process reparent when the process group leader exits and close
to this problems on wait(), waitpid(), etc.
2. Reuse our kernel code instead of writing excessive thread
managment routines in Linuxulator.
Implementation details:
1. The thread is created via kern_thr_new() in the clone() call with
the CLONE_THREAD parameter. Thus, everything else is a process.
2. The test that the process has a threads is done via P_HADTHREADS
bit p_flag of struct proc.
3. Per thread emulator state data structure is now located in the
struct thread and freed in the thread_dtor() hook.
Mandatory holdig of the p_mtx required when referencing emuldata
from the other threads.
4. PID mangling has changed. Now Linux pid is the native tid
and Linux tgid is the native pid, with the exception of the first
thread in the process where tid and pid are one and the same.
Ugliness:
In case when the Linux thread is the initial thread in the thread
group thread id is equal to the process id. Glibc depends on this
magic (assert in pthread_getattr_np.c). So for system calls that
take thread id as a parameter we should use the special method
to reference struct thread.
Differential Revision: https://reviews.freebsd.org/D1039
threads refactor kern_sched_rr_get_interval() and sys_sched_rr_get_interval().
Add a kern_sched_rr_get_interval() counterpart which takes a targettd
parameter to allow specify target thread directly by callee (new Linuxulator).
Linuxulator temporarily uses first thread in proc.
Move linux_sched_rr_get_interval() to the MI part.
Differential Revision: https://reviews.freebsd.org/D1032
Reviewed by: trasz
threads introduce linux_exit() stub instead of sys_exit() call
(which terminates process).
In the new linuxulator exit() system call terminates the calling
thread (not a whole process).
Differential Revision: https://reviews.freebsd.org/D1027
Reviewed by: trasz
argument. This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.
Differential Revision: https://reviews.freebsd.org/D2335
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
The goal here is to provide one place altering process credentials.
This eases debugging and opens up posibilities to do additional work when such
an action is performed.
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.
These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.
While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.
Pointed out by: bde
MFC after: 1 week
- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.
- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().
- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().
- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().
The split is done mostly for code clarity, and should not affect
scalability.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
have both kern_open() and kern_openat(); change the callers to use
kern_openat().
This removes one (sometimes two) levels of indirection and
consolidates arguments checks.
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Update linux compat minimum revision to match linux-c6 now in ports. This
is a candidate for 10.1 R as it matches the current state of supported
linux compat packages in the ports tree.
PR: 187786
Reviewed by: xmj
MFC after: 2 days
Relnotes: yes
for amd64/linux32. Fix the entirely bogus (untested) version from
r161310 for i386/linux using the same shared code in compat/linux.
It is unclear to me if we could support more clock mappings but
the current set allows me to successfully run commercial
32bit linux software under linuxolator on amd64.
Reviewed by: jhb
Differential Revision: D784
MFC after: 3 days
Sponsored by: DARPA, AFRL
Make it really work for native FreeBSD programs. Before this it was broken
for years due to different number of pointer dereferences in Linux and
FreeBSD IOCTL paths, permanently returning errors to FreeBSD programs.
This change breaks the driver FreeBSD IOCTL ABI, making it more strict,
but since it was not working any way -- who bother.
Add shims for 32-bit programs on 64-bit host, translating the argument
of the SG_IO IOCTL for both FreeBSD and Linux ABIs.
With this change I was able to run 32-bit Linux sg3_utils tools and simple
32 and 64-bit FreeBSD test tools on both 32 and 64-bit FreeBSD systems.
MFC after: 1 month
flag has been added instead of FUTEX_WAIT to replace the FUTEX_WAIT
logic which needs to do gettimeofday() calls before the futex syscall
to convert the absolute timeout to a relative timeout.
Before this the CLOCK_MONOTONIC used by the FUTEX_WAIT_BITSET op.
When the FUTEX_CLOCK_REALTIME is specified the timeout is an absolute
time, not a relative time. Rework futex_wait to handle this.
On the side fix the futex leak in error case and remove useless
parentheses.
Properly calculate the timeout for the CLOCK_MONOTONIC case.
MFC after: 3 days
Some Linux futex ops atomically verifies that the futex address uaddr
(uval) contains the value val. Comparing signed uval and unsigned val
may lead to an unexpected result, mostly to a deadlock.
So copyin uaddr to an unsigned int to compare the parameters correctly.
While here change ktr records to print parameters in more readable format.
Tested by eadler@
MFC after: 3 days
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip