file and processes information retrieval from the running kernel via sysctl
in the form of new library, libprocstat. The library also supports KVM backend
for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have
been modified to take advantage of the library (as the bonus point the fstat(1)
utility no longer need superuser privileges to operate), and the procstat(1)
utility is now able to display information from memory dumps as well.
The newly introduced fuser(1) utility also uses this library and able to operate
via sysctl and kvm backends.
The library is by no means complete (e.g. KVM backend is missing vnode name
resolution routines, and there're no manpages for the library itself) so I
plan to improve it further. I'm commiting it so it will get wider exposure
and review.
We won't be able to MFC this work as it relies on changes in HEAD, which
was introduced some time ago, that break kernel ABI. OTOH we may be able
to merge the library with KVM backend if we really need it there.
Discussed with: rwatson
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).
Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.
The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN
while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.
Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now
The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.
Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno
structure, which acts as a proxy between them. This makes jail rules
persistent, i.e. they can be added before jail gets created, and they
don't disappear when the jail gets destroyed.
kern.sched.ipiwakeup.onecpu
kern.sched.ipiwakeup.htt2
Because they are absolutely obsolete. Probabilly the whole wakeup
forward mechanism should be revisited for a better fitting in modern
hw.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
arguments passing
Approved by: julian
wrapper around rman_adjust_resource(). Include a generic implementation,
bus_generic_adjust_resource() which passes the request up to the parent
bus. There is currently no default implementation. A
bus_adjust_resource() wrapper is provided for use in drivers.
Specifically, these changes allow a resource to back a relocatable and
resizable resource such as the I/O window decoders in PCI-PCI bridges.
- rman_adjust_resource() can adjust the start and end address of an
existing resource. It only succeeds if the newly requested address
space is already free. It also supports shrinking a resource in
which case the freed space will be marked unallocated in the rman.
- rman_first_free_region() and rman_last_free_region() return the
start and end addresses for the first or last unallocated region in
an rman, respectively. This can be used to determine by how much
the resource backing an rman must be adjusted to accomodate an
allocation request that does not fit into the existing rman.
While here, document the rm_start and rm_end fields in struct rman,
rman_is_region_manager(), the bound argument to
rman_reserve_resource_bound(), and rman_init_from_resource().
constraints on the rman and reject attempts to manage a region that is out
of range.
- Fix various places that set rm_end incorrectly (to ~0 or ~0u instead of
~0ul).
- To preserve existing behavior, change rman_init() to set rm_start and
rm_end to allow managing the full range (0 to ~0ul) if they are not set by
the caller when rman_init() is called.
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.
I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).
Sponsored by: Sandvine Incorporated
Discussed with: emaste, des
MFC after: 2 weeks
bound to an AP before SMP has started, the system will panic when we try
to touch per-CPU state for that AP because that state has not been
initialized yet. Fix this in the same way as ULE: place all threads in
the global run queue before SMP has started.
Reviewed by: jhb
MFC after: 1 month
mechanism. The caller may specify a timeout in ticks after which the
task will be scheduled.
Sponsored by: The FreeBSD Foundation
Reviewed by: jeff, jhb
MFC after: 1 month
the mutexes in the wrong order for the case where the
MBF_MNTLSTLOCK is set. I believe this did have the
potential for deadlock. For example, if multiple nfsd threads
called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK.
Thanks go to pho for catching this during his testing.
Tested by: pho
Submitted by: kib
MFC after: 2 weeks
vfs_sanitizeopts() can handle "ro" and "rw" options properly, there is
no more need to add "noro" in vfs_donmount() to cancel "ro".
This also fixes a problem of canceling options beginning with "no".
For example, "noatime" didn't cancel "nonoatime". Thus it was possible
that both "noatime" and "nonoatime" were active at the same time.
Reviewed by: bde
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.
Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.
Also reserve space in the syscall table for posix_fadvise(2).
Reviewed by: -arch (previous version)
The code provides information on how the signal was generated.
Formerly, the code was only logged for traps, much like only signal handlers
for traps received a meaningful si_code before FreeBSD 7.0.
In rare cases, no information is available and 0 is still logged.
MFC after: 1 week
details of each rman header, but not the contents of all rman structures
in the system. This is especially useful on platforms where some rmans
have many thousands of entries in rmans, making scrolling through the
output of "show all rman" impractical. Individual rmans can then be viewed
including their contents with "show rman 0xaddr" as usual.
Reviewed by: jhb
for a detailed explanation of the problems).
The only difference with the previous fix is in Solution2:
CPUBLOCK is no longer set when exiting from callout_reset_*() functions,
which avoid the deadlock (leading to r217161).
There is no need to CPUBLOCK there because the running-and-migrating
assumption is strong enough to avoid problems there.
Furthermore add a better !SMP compliancy (leading to shrinked code and
structures) and facility macros/functions.
Tested by: gianni, pho, dim
MFC after: 3 weeks
for racct.
Note that after this commit, ipcs(1) needs to be rebuilt. Otherwise, it will
fail with "ipcs: sysctlbyname: kern.ipc.msqids: Cannot allocate memory".
Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)
In particular:
- implement compat shims for old stat(2) variants and ogetdirentries(2);
- implement delivery of signals with ancient stack frame layout and
corresponding sigreturn(2);
- implement old getpagesize(2);
- provide a user-mode trampoline and LDT call gate for lcall $7,$0;
- port a.out image activator and connect it to the build as a module
on amd64.
The changes are hidden under COMPAT_43.
MFC after: 1 month
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.
One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.
As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.
Reviewed by: jhb, jeff
MFC after: 1 month
on the set of rules it maintains and the current resource usage. It also
privides userland API to manage that ruleset.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)
and per-loginclass resource accounting information, to be used by the new
resource limits code. It's connected to the build, but the code that
actually calls the new functions will come later.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)
The new function fallocf(9), that is renamed falloc(9) with added
flag argument, is provided to facilitate the merge to stable branch.
Reviewed by: jhb
MFC after: 1 week
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.
MFC after: 1 week
vfs_equalopts(). This allows vfs_sanitizeopts() to filter redundant
occurrences of these options. It was possible that for example both "ro"
and "rw" options became active concurrently.
PR: kern/133614
Discussed on: freebsd-hackers
MFC after: 1 month
Also, express this new maximum as a fraction of the kernel's address
space size rather than a constant so that increasing KVA_PAGES will
automatically increase this maximum. As a side-effect of this change,
kern.maxvnodes will automatically increase by a proportional amount.
While I'm here ensure that this change doesn't result in an unintended
increase in maxpipekva on i386. Calculate maxpipekva based upon the
size of the kernel address space and the amount of physical memory
instead of the size of the kmem map. The memory backing pipes is not
allocated from the kmem map. It is allocated from its own submap of
the kernel map. In short, it has no real connection to the kmem map.
(In fact, the commit messages for the maxpipekva auto-sizing talk
about using the kernel map size, cf. r117325 and r117391, even though
the implementation actually used the kmem map size.) Although the
calculation is now done differently, the resulting value for
maxpipekva should remain almost the same on i386. However, on amd64,
the value will be reduced by 2/3. This is intentional. The recent
change to VM_KMEM_SIZE_SCALE on amd64 for the benefit of ZFS also had
the unnecessary side-effect of increasing maxpipekva. This change is
effectively restoring maxpipekva on amd64 to its prior value.
Eliminate init_param3() since it is no longer used.
since before r127501. Strictly speaking, the buffer pages are not
"wired". They remain in the paging queues. However, they are pinned in
memory using vm_page_hold().
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.
Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.
While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.
Discussed with: kib
MFC after: 2 Week
possible option and script path in the place of argv[0] supplied to
execve(2). It is possible and valid for the substitution to be shorter
then the argv[0].
Avoid signed underflow in this case.
Submitted by: Devon H. O'Dell <devon.odell gmail com>
PR: kern/155321
MFC after: 1 week
The reason for this is a bug at ktrops() where process dereferenced
without having a lock. This might cause a panic if ktrace was runned
with -p flag and the specified process exited between the dropping
a lock and writing sv_flags.
Since it is impossible to acquire sx lock while holding mtx switch
to use asynchronous enqueuerequest() instead of writerequest().
Rename ktr_getrequest_ne() to more understandable name [1].
Requested by: jhb [1]
MFC after: 1 Week
it possible for the kernel to track login class the process is assigned to,
which is required for RCTL. This change also make setusercontext(3) call
setloginclass(2) and makes it possible to retrieve current login class using
id(1).
Reviewed by: kib (as part of a larger patch)
a driver during kldunload. Specifically, recursively walk the tree of
subclasses of a given driver attachment's bus device class detaching all
instances of that driver for each class and its subclasses.
Reported by: bschmidt
Reviewed by: imp
MFC after: 1 week
If a system call wasn't listed in capabilities.conf, return ECAPMODE at
syscall entry.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Sponsored by: Google, Inc.
Obtained from: Capsicum Project
MFC after: 3 months
Add a new system call flag, SYF_CAPENABLED, which indicates that a
particular system call is available in capability mode.
Add a new configuration file, kern/capabilities.conf (similar files
may be introduced for other ABIs in the future), which enumerates
system calls that are available in capability mode. When a new
system call is added to syscalls.master, it will also need to be
added here (if needed). Teach sysent parts to use this file to set
values for SYF_CAPENABLED for the native ABI.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months
compiled conditionally on options CAPABILITIES:
Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.
Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.
Export the capability mode flag via process information sysctls.
Sponsored by: Google, Inc.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months
traced process by adding two new events which records value of process
sv_flags to the trace file at process creation/execing/exiting time.
MFC after: 1 Month.
1) do not take a lock around the single atomic operation.
2) do not lose the invariant of lock by dropping/acquiring
ktrace_mtx around free() or malloc().
MFC after: 1 Month.
PMC/SYSV/...).
No FreeBSD version bump, the userland application to query the features will
be committed last and can serve as an indication of the availablility if
needed.
Sponsored by: Google Summer of Code 2010
Submitted by: kibab
Reviewed by: arch@ (parts by rwatson, trasz, jhb)
X-MFC after: to be determined in last commit with code from this project
file where they are used. Declare the kern.threads sysctl node at the
same location. Since no external use for the variables exists, make them
static.
Discussed with: dchagin
MFC after: 1 week
vfs_export() fails. Restoring old options and flags after successful
VFS_MOUNT(9) call may cause the file system internal state to become
inconsistent with mount options and flags. Specifically the FFS super
block fs_ronly field and the MNT_RDONLY flag may get out of sync.
PR: kern/133614
Discussed on: freebsd-hackers
the debugger back-end has changed. This means that switching from ddb
to gdb no longer requires a "step" which can be dangerous on an
already-crashed kernel.
Also add a capability to get from the gdb back-end back to ddb, by
typing ^C in the console window.
While here, simplify kdb_sysctl_available() by using
sbuf_new_for_sysctl(), and use strlcpy() instead of strncpy() since the
strlcpy semantic is desired.
MFC after: 1 month
VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.
While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.
The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec
Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks
Catch a set vnet upon return to user space. This usually
means return paths with CURVNET_RESTORE() missing.
If VNET_DEBUG is turned on we can even tell the function
that did the CURVNET_SET() which is really helpful; else
we print "N/A".
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
MFC after: 11 days
KASSERT()s and eliminate the rest.
Replace excessive printf()s and a panic() in bufdone_finish() with a
KASSERT() in vm_page_io_finish().
Reviewed by: kib
Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS.
Change the syntax to match KASSERT() to allow more flexible panic
messages rather than having a printf with hardcoded arguments
before panic.
Adjust the few assertions we have to the new format (and enhance
the output).
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
MFC after: 2 weeks
attributes for preloaded modules/images. In particular, MODINFO_ADDR has
the added complexity of not always being relocated properly. Rather than
kluging this in the various components that are affected, we handle it
in a centralized place (preload_fetch_addr()). To that end, expose a new
variable, preload_addr_relocate, that MD initialization code can set and
that turns the address attribute into a valid kernel VA.
Architectures that need the relocation: arm & powerpc (at least).
Components that can utilize this: acpi(4), md(4), fb(4), pci(4), ZFS, geli.
Sponsored by: Juniper Networks
misnamed since it was introduced and should not be globally exposed
with this name. The equivalent functionality is now available using
kern_yield(curthread->td_user_pri). The function remains
undocumented.
Bump __FreeBSD_version.
- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.
- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h
- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().
- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.
- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.
- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.
MI ucontext_t and x86 MD parts.
Kernel allocates the structures on the stack, and not clearing
reserved fields and paddings causes leakage.
Noted and discussed with: bde
MFC after: 2 weeks
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().
Change several checks for a magic number of iterations to use
should_yield() instead.
MFC after: 1 week
collect phases. The unp_discard() function executes
unp_externalize_fp(), which might make the socket eligible for gc-ing,
and then, later, taskqueue will close the socket. Since unp_gc()
dropped the list lock to do the malloc, close might happen after the
mark step but before the collection step, causing collection to not
find the socket and miss one array element.
I believe that the race was there before r216158, but the stated
revision made the window much wider by postponing the close to
taskqueue sometimes.
Only process as much array elements as we find the sockets during
second phase of gc [1]. Take linkage lock and recheck the eligibility
of the socket for gc, as well as call fhold() under the linkage lock.
Reported and tested by: jmallett
Submitted by: jmallett [1]
Reviewed by: rwatson, jeff (possibly)
MFC after: 1 week
each of the threads needs more while current pool of the buffers is
exhausted, then neither thread can make progress.
Switch to nowait allocations after we got first buffer already.
Reported by: az
Reviewed by: alc (previous version)
Tested by: pho
MFC after: 1 week
The fdcheckstd() function makes sure fds 0, 1 and 2 are open by opening
/dev/null. If this fails (e.g. missing devfs or wrong permissions),
fdcheckstd() will return failure and the process will exit as if it received
SIGABRT. The KASSERT is only to check that kern_open() returns the expected
fd, given that it succeeded.
Tripping the KASSERT is most likely if fd 0 is open but fd 1 or 2 are not.
MFC after: 2 weeks
sbuf_new_for_sysctl(9). This allows using an sbuf with a SYSCTL_OUT
drain for extremely large amounts of data where the caller knows that
appropriate references are held, and sleeping is not an issue.
Inspired by: rwatson
unfunctional. Wiring the user buffer has only been done explicitly
since r101422.
Mark the kern.disks sysctl as MPSAFE since it is and it seems to have
been mis-using the NOLOCK flag.
Partially break the KPI (but not the KBI) for the sysctl_req 'lock'
field since this member should be private and the "REQ_LOCKED" state
seems meaningless now.
before checking the validity of the next buffer pointer. Otherwise, the
buffer might be reclaimed after the check, causing iteration to run into
wrong buffer.
Reported and tested by: pho
MFC after: 1 week
- Move the realtime priority range up above kernel sleep priorities and
just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
the timesharing priority band can be increased. The new timeshare range
is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
interactive threads. Instead, the larger timeshare range is now split
into separate subranges for interactive and non-interactive ("batch")
threads. The end result is that interactive threads and non-interactive
threads still use the same priority ranges as before, but realtime
threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
or via cv_broadcastpri(). Realtime and idle priority threads will
no longer have their priorities affected by sleeping in the kernel.
Reviewed by: jeff
interactive timeshare threads (PRI_*_INTERACTIVE) and non-interactive
timeshare threads (PRI_*_BATCH) and use these instead of PRI_*_REALTIME
and PRI_*_TIMESHARE. No functional change.
Reviewed by: jeff
PI_DISKLOW. While here, rename PI_TTYLOW to PI_TTY.
- Add a macro PI_SWI() that takes a SWI_* constant as an argument and
returns the suitable thread priority.
That revision is introducing a bug which is more visible than problems
it is trying to fix.
As long as my time is very limited in this period I am going to
commit back this patch just once it is fully fixed.
Reported by: dim, Nicholas Esborn
PT_GNU_STACK program header, if present and enabled. Two new sysctls
are provided, kern.elf32.nxstack and kern.elf64.nxstack, that allow to
enable PT_GNU_STACK for ABIs of specified bitsize, if ABI decided to
support shared page.
Inform rtld about access mode of the stack initial mapping by
AT_STACKPROT aux vector.
At the moment, the default is disabled, waiting for the usermode
support bits.
setting SV_SHP flag and providing pointer to the vm object and mapping
address. Provide simple allocator to carve space in the page, tailored
to put the code with alignment restrictions.
Enable shared page use for amd64, both native and 32bit FreeBSD
binaries. Page is private mapped at the top of the user address
space, moving a start of the stack one page down. Move signal
trampoline code from the top of the stack to the shared page.
Reviewed by: alc
to match the desired priority in td_priority. Otherwise the first time
thread0 used a borrowed priority it would drop down to PUSER instead of
PVM.
- Explicitly initialize the starting priority of new kprocs to PVM to
avoid inheriting some random priority from thread0.
MFC after: 2 weeks
thread and proc have been copied and zeroed from the old thread and
proc. Otherwise attempts to modify thread or process data in sched_fork()
could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
sched_fork_thread() in ULE. This is already done courtesy the bcopy()
of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
new thread's base priority (td_base_pri) to avoid bogusly inheriting a
borrowed priority from the parent thread.
MFC after: 2 weeks
This was lost when it was converted to using a condition variable instead
of lbolt.
- Drop the priority of flowtable down to PPAUSE when it is idle as well
since it is a similar background task.
MFC after: 2 weeks
the dependency of which was preloaded, but failed to initialize. Previously,
kernel dereferenced NULL pointer returned by modlist_lookup2(); now, when this
happens, we unload the dependent module. Since the depended_files list is
sorted in dependency order, this properly propagates, unloading modules that
depend on failed ones.
From the user point of view, this prevents the kernel from panicing when
trying to boot kernel compiled without KDTRACE_HOOKS with dtraceall_load="YES"
in /boot/loader.conf.
Reviewed by: kib