opnsense-src

mirror of https://github.com/opnsense/src.git synced 2026-07-15 04:01:09 -04:00

Author	SHA1	Message	Date
David Xu	945488297b	Make UMTX_OP_WAIT_UINT actually wait for an unsigned integer on 64-bits machine. MFC after: 1 week	2009-04-13 05:21:17 +00:00
Kip Macy	f0b9868d3a	sendfile doesn't modify the vnode - acquire vnode lock shared Reviewed by: ups, jeffr	2009-04-12 05:19:35 +00:00
Robert Watson	89f28b1b86	Remove conditionally compiled time counter statistics; tools like DTrace, kernel profiling, etc, can provide this information without the overhead. MFC after: 3 days Suggested by: bde	2009-04-11 22:01:40 +00:00
Alexander Kabaev	9d75482f99	Fix v_cache_dd handling for negative entries. v_cache_dd pointer was not populated in parent directory if negative entry was being created, yet entry itself was added to the nc_neg list. It was possible for parent vnode to get discarded later, leaving negative entry pointing to now unused memory block. Reported by: dho Revewed by: kib	2009-04-11 20:23:08 +00:00
Konstantin Belousov	fd409594c6	When zapping v_cache_dd for !MAKEENTRY case in cache_lookup(), we shall lock cache as writer. Reviewed by: kan	2009-04-11 16:12:20 +00:00
Marko Zec	bfe1aba468	Introduce vnet module registration / initialization framework with dependency tracking and ordering enforcement. With this change, per-vnet initialization functions introduced with r190787 are no longer directly called from traditional initialization functions (which cc in most cases inlined to pre-r190787 code), but are instead registered via the vnet framework first, and are invoked only after all prerequisite modules have been initialized. In the long run, this framework should allow us to both initialize and dismantle multiple vnet instances in a correct order. The problem this change aims to solve is how to replay the initialization sequence of various network stack components, which have been traditionally triggered via different mechanisms (SYSINIT, protosw). Note that this initialization sequence was and still can be subtly different depending on whether certain pieces of code have been statically compiled into the kernel, loaded as modules by boot loader, or kldloaded at run time. The approach is simple - we record the initialization sequence established by the traditional mechanisms whenever vnet_mod_register() is called for a particular vnet module. The vnet_mod_register_multi() variant allows a single initializer function to be registered multiple times but with different arguments - currently this is only used in kern/uipc_domain.c by net_add_domain() with different struct domain * as arguments, which allows for protosw-registered initialization routines to be invoked in a correct order by the new vnet initialization framework. For the purpose of identifying vnet modules, each vnet module has to have a unique ID, which is statically assigned in sys/vimage.h. Dynamic assignment of vnet module IDs is not supported yet. A vnet module may specify a single prerequisite module at registration time by filling in the vmi_dependson field of its vnet_modinfo struct with the ID of the module it depends on. Unless specified otherwise, all vnet modules depend on VNET_MOD_NET (container for ifnet list head, rt_tables etc.), which thus has to and will always be initialized first. The framework will panic if it detects any unresolved dependencies before completing system initialization. Detection of unresolved dependencies for vnet modules registered after boot (kldloaded modules) is not provided. Note that the fact that each module can specify only a single prerequisite may become problematic in the long run. In particular, INET6 depends on INET being already instantiated, due to TCP / UDP structures residing in INET container. IPSEC also depends on INET, which will in turn additionally complicate making INET6-only kernel configs a reality. The entire registration framework can be compiled out by turning on the VIMAGE_GLOBALS kernel config option. Reviewed by: bz Approved by: julian (mentor)	2009-04-11 05:58:58 +00:00
Robert Watson	885868cd8f	Remove VOP_LEASE and supporting functions. This hasn't been used since the removal of NQNFS, but was left in in case it was required for NFSv4. Since our new NFSv4 client and server can't use it for their requirements, GC the old mechanism, as well as other unused lease- related code and interfaces. Due to its impact on kernel programming and binary interfaces, this change should not be MFC'd. Proposed by: jeff Reviewed by: jeff Discussed with: rmacklem, zach loafman @ isilon	2009-04-10 10:52:19 +00:00
Konstantin Belousov	3f54086eba	Cache_lookup() for DOTDOT drops dvp vnode lock, allowing dvp to be reclaimed. Check the condition and return ENOENT then. In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused by dvp reclaim. Reported and tested by: pho	2009-04-10 10:22:44 +00:00
Andrew Thompson	853a10a581	Revert r190676,190677 The geom and CAM changes for root_hold are the wrong solution for USB design quirks. Requested by: scottl	2009-04-10 04:08:34 +00:00
Ed Schouten	e3b0b98073	Fix tty_wait_background() to comply with standards. It turns out my handling of SIGTTOU and SIGTTIN didn't entirely comply to the standards. It is true that in the SIGTTOU case we should not return EIO when the signal is ignored/blocked, but in the SIGTTIN case we must. See also: POSIX issue 7 section 11.1.4	2009-04-08 15:56:50 +00:00
Robert Watson	5d5c174869	Nul-terminate strings in the VFS name cache, which negligibly change the size and cost of name cache entries, but make adding debugging and tracing easier. Add SDT DTrace probes for various namecache events: vfs:namecache:enter:done - new entry in the name cache, passed parent directory vnode pointer, name added to the cache, and child vnode pointer. vfs:namecache:enter_negative:done - new negative entry in the name cache, passed parent vnode pointer, name added to the cache. vfs:namecache:fullpath:enter - call to vn_fullpath1() is made, passed the vnode to resolve to a name. vfs:namecache:fullpath:hit - vn_fullpath1() successfully resolved a search for the parent of an object using the namecache, passed the discovered parent directory vnode pointer, name, and child vnode pointer. vfs:namecache:fullpath:miss - vn_fullpath1() failed to resolve a search for the parent of an object using the namecache, passed the child vnode pointer. vfs:namecache:fullpath:return - vn_fullpath1() has completed, passed the error number, and if that is zero, the vnode to resolve, and the returned path. vfs:namecache:lookup:hit - postive name cache entry hit, passed the parent directory vnode pointer, name, and child vnode pointer. vfs:namecache:lookup:hit_negative - negative name cache entry hit, passed the parent directory vnode pointer and name. vfs:namecache:lookup:miss - name cache miss, passed the parent directory pointer and the full remaining component name (not terminated after the cache miss component). vfs:namecache:purge:done - name cache purge for a vnode, passed the vnode pointer to purge. vfs:namecache:purge_negative:done - name cache purge of negative entries for children of a vnode, passed the vnode pointer to purge. vfs:namecache:purgevfs - name cache purge for a mountpoint, passed the mount pointer. Separate probes will also be invoked for each cache entry zapped. vfs:namecache:zap:done - name cache entry zapped, passed the parent directory vnode pointer, name, and child vnode pointer. vfs:namecache:zap_negative:done - negative name cache entry zapped, passed the parent directory vnode pointer and name. For any probes involving an extant name cache entry (enter, hit, zapp), we use the nul-terminated string for the name component. For misses, the remainder of the path, including later components, is provided as an argument instead since there is no handy nul-terminated version of the string around. This is arguably a bug. MFC after: 1 month Sponsored by: Google, Inc. Reviewed by: jhb, kan, kib (earlier version)	2009-04-07 20:58:56 +00:00
Robert Watson	4b4e58badf	Add SDT DTrace probes for namei(): vfs:namei:lookup:entry takes parent directory vnode pointer, path to look up, and lookup flags. vfs:namei:lookup:return takes an error value, and if successful, the returned vnode pointer. MFC after: 1 month	2009-04-06 10:32:40 +00:00
Dmitry Chagin	cd899aad76	Fix KBI breakage by r190520 which affects older linux.ko binaries: 1) Move the new field (brand_note) to the end of the Brandinfo structure. 2) Add a new flag BI_BRAND_NOTE that indicates that the brand_note pointer is valid. 3) Use the brand_note field if the flag BI_BRAND_NOTE is set and as old modules won't have the flag set, so the new field brand_note would be ignored. Suggested by: jhb Reviewed by: jhb Approved by: kib (mentor) MFC after: 6 days	2009-04-05 09:27:19 +00:00
Alexander Kabaev	bb6418cbe3	Revert change 190655 temporarily. It breaks many setups where nullfs is used and needs to be revisited.	2009-04-04 17:48:38 +00:00
Marcel Moolenaar	27457a80e2	PowerPC, meet kernel core dumps. The support is based on a generic dumper that creates an ELF core file and uses PMAP functions to scan and iterate over memory chunks, as well as handle memory mappings used during dumping. the PMAP layer can choose to return physical memory chunks or virtual memory chunks. For minidumps, the chunks should be virtual. The default MMU I/F implementation for the scan_md() method returns NULL. Thus, when a PMAP implementation does not implement the required methods, an empty core file is created. Here, empty means having an ELF header only. Obtained from: Juniper Networks	2009-04-04 02:12:37 +00:00
Andrew Thompson	626fc9fe3d	Add a how argument to root_mount_hold() so it can be passed NOWAIT and be called in situations where sleeping isnt allowed.	2009-04-03 19:46:12 +00:00
Peter Wemm	0e875ecafe	vn_vptocnp() unlocks the name cache and forgets to re-lock it before returning in one error case, and mistakenly unlocks it for the umount -f case.	2009-04-02 21:16:20 +00:00
Christian Brueffer	1fa80eb15c	Fix memory leak in semunload(). PR: 133064 Submitted by: Mateusz Guzik <mjguzik@gmail.com> MFC after: 1 week	2009-03-30 15:01:29 +00:00
Andrew Thompson	46b70f07bc	Further rate limit the root wait status, it will be printed once per root_mount_rel() wakeup.	2009-03-30 05:57:55 +00:00
Alexander Kabaev	607fc40b04	Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache in directory vnodes. Allow namecache dotdot entry to be created pointing from child vnode to parent vnode if no existing links in opposite direction exist. Use direct link from parent to child for dotdot lookups otherwise. This restores more efficient dotdot caching in NFS filesystems which was lost when vnodes stoppped being type stable. Reviewed by: kib	2009-03-29 21:25:40 +00:00
Jamie Gritton	8571af59e5	Whitespace/spelling fixes in advance of upcoming functional changes. Approved by: bz (mentor)	2009-03-27 13:13:59 +00:00
Andrew Thompson	d24d45d9a9	Skip the allocation of the root hold token if the mount already happened.	2009-03-27 03:52:08 +00:00
John Baldwin	2401c73637	When looking up the parent devclass of a new devclass, create the parent devclass if it doesn't already exist.	2009-03-25 17:02:05 +00:00
John Baldwin	049ce0934f	When a file lookup fails due to encountering a doomed vnode from a forced unmount, consistently return ENOENT rather than EBADF. Reviewed by: kib MFC after: 1 month	2009-03-24 18:16:42 +00:00
Jung-uk Kim	eae44ae03d	Clean up MI inittodr(9) and kill noop code. It was derived from i386 version long ago but never resync'ed again. Originally, i386 version compared the current time from realtime clock with time_second (which was just `time' in the old days). When this MI version was written, it was wrongly compared against `base' AND never used because of a bug (typo?) in the code. This check was killed in i386 version when home-rolled calendaric calculation was removed. Now, we just remove the code here as well to make the code simpler.	2009-03-23 21:16:21 +00:00
John Baldwin	9b84ba1cbb	Improve the description of a few sysctls. Submitted by: bde (partially) MFC after: 3 days	2009-03-23 20:18:06 +00:00
Alexander Kabaev	9999864a87	Add safety check that does not allow empty strings to be queued to the devctl notification queue. Empty strings cause devctl read call to return 0 and result in devd exiting prematurely. The actual offender (ugen notes for root hubs) will be fixed by separate commit.	2009-03-23 01:13:34 +00:00
Colin Percival	3f935cf342	Correctly sanity-check timer IDs. [SA-09:06] Limit the size of malloced buffer when dumping environment variables. [EN-09:01] Approved by: so (cperciva) Approved by: re (kensmith) Security: FreeBSD-SA-09:06.ktimer Errata: FreeBSD-EN-09:01.kenv	2009-03-23 00:00:50 +00:00
Konstantin Belousov	267c52fc98	Fix several issues with parsing the notes for ELF objects. Badly formed ELF note may cause the caclulated pointer to the next note to point both after the note region, that was checked in the code, but also to point before the region, that was not checked [1]. Remember the first note location in note0 and leap out if the note is not between note0 and note_end. In the similar way, badly formed note may cause infinite loop by pointing next note into the same or previous note. Guard against this by limiting amount of loop iterations by arbitrary choosen big number. For clarity, check the calculated note alignment in each iteration. Reported by: Chris Palmer <chris noncombatant org> [1] PR: kern/132886 Reviewed and tested by: dchagin MFC after: 3 days	2009-03-22 13:42:41 +00:00
Konstantin Belousov	15fb32c07d	Do not underflow the buffer and then report the problem. Check for the condition before the buffer write. Also, since buflen is unsigned, previous check was ignored. Reviewed by: marcus Tested by: pho	2009-03-20 11:08:57 +00:00
Konstantin Belousov	83817ce3b1	Remove unneeded braces to reduce used vertical screen space. The location was missed in r190140.	2009-03-20 11:03:55 +00:00
Konstantin Belousov	9194007261	Do not forget to adjust buflen for the first resolution of the path from namecache. While there, compare pointers for equiality. Reviewed by: marcus Tested by: pho	2009-03-20 11:00:39 +00:00
Konstantin Belousov	065fc451f8	The nc_nlen member of the struct namecache contains the length of the cached name, not the length + 1. PR: 132620, 132542 Reported by: bf2006a yahoo com Tested by: bf2006a, pho Reviewed by: marcus	2009-03-20 10:59:06 +00:00
Konstantin Belousov	c4a8c2ee24	When ktracing namei operations, log a result of the __getcwd(). MFC after: 1 week	2009-03-20 10:47:16 +00:00
Konstantin Belousov	bf5c835e1c	Remove unneeded braces to reduce used vertical screen space.	2009-03-20 10:04:00 +00:00
Attilio Rao	76ed3c71f1	Fix an old-standing bug that crept in along the several revisions: B_DELWRI cleanup and vnode disassociation should happen just before to assign the buffer to a queue. Reported by: miwi, Volker <volker at vwsoft dot com>, Ben Kaduk <minimarmot at gmail dot com>, Christopher Mallon <christoph dot mallon at gmx dot de> Tested by: lulf, miwi	2009-03-17 16:30:49 +00:00
Konstantin Belousov	3ff063577b	Supply AT_EXECPATH auxinfo entry to the interpreter, both for native and compat32 binaries. Tested by: pho Reviewed by: kan	2009-03-17 12:53:28 +00:00
Konstantin Belousov	429f5a589b	Use the properly sized types for ELF object header and program headers. This fixes osrel fetching from the FreeBSD branding note for the 64bit platforms. Reported by: swell.k gmail com Reviewed by: dchagin Tested by: dchagin, swell.k gmail com	2009-03-17 09:50:40 +00:00
Jung-uk Kim	c66d2b38c8	Initial suspend/resume support for amd64. This code is heavily inspired by Takanori Watanabe's experimental SMP patch for i386 and large portion was shamelessly cut and pasted from Peter Wemm's AP boot code.	2009-03-17 00:48:11 +00:00
Konstantin Belousov	c1d8b5e82c	Fix two issues with bufdaemon, often causing the processes to hang in the "nbufkv" sleep. First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed. Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked. Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag. In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks	2009-03-16 15:39:46 +00:00
Robert Watson	e5adda3d51	Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced in FreeBSD 5.x to allow network device drivers to run with Giant despite the network stack being Giant-free. This significantly simplifies calls into ioctl() on network interfaces, especially in the multicast code, as well as eliminates deferred invocation of interface if_start routines. Disable the build on device drivers still depending on IFF_NEEDSGIANT as they no longer compile. They will be removed in a few weeks if they haven't been made MPSAFE in that time. Disabled drivers: if_ar if_axe if_aue if_cdce if_cue if_kue if_ray if_rue if_rum if_sr if_udav if_ural if_zyd Drivers that were already disabled because of tty changes: if_ppp if_sl Discussed on: arch@	2009-03-15 14:21:05 +00:00
Jeff Roberson	1723a06485	- Wrap lock profiling state variables in #ifdef LOCK_PROFILING blocks.	2009-03-15 08:03:54 +00:00
Jeff Roberson	2e6b8de462	- Implement a new mechanism for resetting lock profiling. We now guarantee that all cpus have acknowledged the cleared enable int by scheduling the resetting thread on each cpu in succession. Since all lock profiling happens within a critical section this guarantees that all cpus have left lock profiling before we clear the datastructures. - Assert that the per-thread queue of locks lock profiling is aware of is clear on thread exit. There were several cases where this was not true that slows lock profiling and leaks information. - Remove all objects from all lists before clearing any per-cpu information in reset. Lock profiling objects can migrate between per-cpu caches and previously these migrated objects could be zero'd before they'd been removed Discussed with: attilio Sponsored by: Nokia	2009-03-15 06:41:47 +00:00
Jeff Roberson	d3df4af368	- When a mutex is destroyed while locked we need to inform lock profiling that it has been released.	2009-03-14 11:43:38 +00:00
Jeff Roberson	04a2868980	- Call lock_profile_release when we're transitioning a lock to be owned by LK_KERNPROC. Discussed with: attilio	2009-03-14 11:43:02 +00:00
Jeff Roberson	53a6c8b3ac	- Fix an error that occurs when mp_ncpu is an odd number. steal_thresh is calculated as 0 which causes errors elsewhere. Submitted by: KOIE Hidetaka <koie@suri.co.jp> - When sched_affinity() is called with a thread that is not curthread we need to handle the ON_RUNQ() case by adding the thread to the correct run queue. Submitted by: Justin Teller <justin.teller@gmail.com> MFC after: 1 Week	2009-03-14 11:41:36 +00:00
Dmitry Chagin	32c01de21c	Implement new way of branding ELF binaries by looking to a ".note.ABI-tag" section. The search order of a brand is changed, now first of all the ".note.ABI-tag" is looked through. Move code which fetch osreldate for ELF binary to check_note() handler. PR: 118473 Approved by: kib (mentor)	2009-03-13 16:40:51 +00:00
David Xu	326bf9493d	1) Check NULL pointer before calling umtx_pi_adjust_locked(), this avoids a PANIC. 2) Rework locking for POSIX priority-mutex, this fixes a race where a thread may wait there forever even if the mutex is unlocked.	2009-03-13 06:06:20 +00:00
John Baldwin	42dd14bada	Change the sysctls for maxbcache and maxswzone from int to long. I missed this earlier since these sysctls don't exist in 7.x yet.	2009-03-12 17:23:02 +00:00
John Baldwin	b9f2a7da58	Export the current values of nbuf, ncallout, and nswbuf via read-only sysctls that match the tunable names. MFC after: 3 days	2009-03-12 17:21:58 +00:00
Bruce M Simpson	77d8bf9cc7	Ensure that the semaphore value is re-checked after sem_lock is re-acquired, after the condition variable is signalled. PR: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545 MFC after: 5 days Reviewed by: attilio	2009-03-12 10:36:39 +00:00
Bruce M Simpson	b2966a5a2f	Make semaphore debugging output more useful. PR: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/127545 MFC after: 5 days Submitted by: Philip Semanchuk	2009-03-12 10:34:16 +00:00
Robert Watson	ae81968fd1	When writing out updated pollfd records when returning from poll(), only copy out the revents field, not the whole pollfd structure. Otherwise, if the events field is updated concurrently by another thread, that update may be lost. This issue apparently causes problems for the JDK on FreeBSD, which expects the Linux behavior of not updating all fields (somewhat oddly, Solaris does not implement the required behavior, but presumably our adaptation of the JDK is based on the Linux port?). MFC after: 2 weeks PR: kern/130924 Submitted by: Kurt Miller <kurt @ intricatesoftware.com> Discussed with: kib	2009-03-11 22:00:03 +00:00
John Baldwin	a56be37e68	Add a new type of KTRACE record for sysctl(3) invocations. It uses the internal sysctl_sysctl_name() handler to map the MIB array to a string name and logs this name in the trace log. This can be useful to see exactly which sysctls a thread is invoking. MFC after: 1 month	2009-03-11 21:48:36 +00:00
John Baldwin	a6b6eb6b6b	Gah, fix the code to match the comment. For non-open lookups use a shared vnode lock for the leaf vnode if LOCKSHARED is set. Submitted by: rdivacky	2009-03-11 14:39:55 +00:00
John Baldwin	33fc362512	Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a filesystem supports additional operations using shared vnode locks. Currently this is used to enable shared locks for open() and close() of read-only file descriptors. - When an ISOPEN namei() request is performed with LOCKSHARED, use a shared vnode lock for the leaf vnode only if the mount point has the extended shared flag set. - Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but not O_CREAT. - Use a shared vnode lock around VOP_CLOSE() if the file was opened with O_RDONLY and the mountpoint has the extended shared flag set. - Adjust md(4) to upgrade the vnode lock on the vnode it gets back from vn_open() since it now may only have a shared vnode lock. - Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since FIFO's require exclusive vnode locks for their open() and close() routines. (My recent MPSAFE patches for UDF and cd9660 already included this change.) - Enable extended shared operations on UFS, cd9660, and UDF. Submitted by: ups Reviewed by: pjd (ZFS bits) MFC after: 1 month	2009-03-11 14:13:47 +00:00
Warner Losh	4782ea6768	Minor nits notice by jhb@	2009-03-11 08:19:31 +00:00
John Baldwin	64ecd1399f	- Make maxpipekva a signed long rather than an unsigned long as overflow is more likely to be noticed with signed types. - Make amountpipekva a long as well to match maxpipekva. Discussed with: bde	2009-03-10 21:28:43 +00:00
John Baldwin	060e911cf4	In the ABI shim for vfs.bufspace, rather than truncating values larger than INT_MAX to INT_MAX, just go ahead and write out the full long to give an error of ENOMEM to the user process. Requested by: bde	2009-03-10 21:27:15 +00:00
John Baldwin	72150f4cf7	- Remove a recently added comment from kernel_sysctlbyname() that isn't needed. - Move the release of the sysctl sx lock after the vsunlock() in userland_sysctl() to restore the original memlock behavior of minimizing the amount of memory wired to handle sysctl requests. MFC after: 1 week	2009-03-10 17:00:28 +00:00
John Baldwin	38cce81ab3	Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that try to fetch it as an int rather than a long. If the current value is greater than INT_MAX it reports a value of INT_MAX.	2009-03-10 15:26:50 +00:00
John Baldwin	5bd65606f4	Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month	2009-03-09 19:35:20 +00:00
John Baldwin	4ab2a9a022	Move the debug.hashstat sysctl tree under DIAGNOSTIC. I measured the debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on a 3GHz Intel Xeon (4x2) running 7.1. It accounted for almost a quarter of the total runtime of 'sysctl -a'. It also performs lots of copyout's while holding the namecache lock (this does not attempt to fix that). MFC after: 2 weeks	2009-03-09 19:04:53 +00:00
Warner Losh	2a7e13e5ad	Fix a long-standing bug in newbus. It was introduced when subclassing was introduced. If you have a bus, say cardbus, that is derived from a base-bus (say PCI), then ordinarily all PCI drivers would attach to cardbus devices. However, there had been one exception: kldload wouldn't work. The problem is in devclass_add_driver. In this routine, all we did was call to the pci device's BUS_DRIVER_ADDED routine. However, since cardbus bus instances had a different devclass, none of them were called. The solution is to call all subclass devclasses, recursively down the tree, of the class that was loaded. Since we don't have a 'children class' pointer, we search the whole list of devclasses for a class whose parent matches. Since just done a kldload time, this isn't as bad as it sounds. In addition, we short-circuit the whole process by marking those classes with subclasses with a flag. We'll likely have to reevaluate this method the number of devclasses with subclasses gets large. This means we can remove the "cardbus" lines from all the PCI drivers since we have no cardbus specific attach device attachments in the tree. # Also: minor tweak to an error message	2009-03-09 13:20:23 +00:00
Robert Watson	83160d1408	By default, don't compile in counters of calls to various time query functions in the kernel, as these effectively serialize parallel calls to the gettimeofday(2) system call, as well as other kernel services that use timestamps. Use the NetBSD version of the fix (kern_tc.c:1.32 by ad@) as they have picked up our timecounter code and also ran into the same problem. Reported by: kris Obtained from: NetBSD MFC after: 3 days	2009-03-08 22:19:28 +00:00
Robert Watson	3dab55bc86	Decompose the global UNIX domain sockets rwlock into two different locks: a global list/counter/generation counter protected by a new mutex unp_list_lock, and a global linkage rwlock, unp_global_rwlock, which protects the connections between UNIX domain sockets. This eliminates conditional lock acquisition that was previously a property of the global lock being held over sonewconn() leading to a call to uipc_attach(), which also required the global lock, but couldn't rely on it as other paths existed to uipc_attach() that didn't hold it: now uipc_attach() uses only the list lock, which follows the linkage lock in the lock order. It may also reduce contention on the global lock for some workloads. Add global UNIX domain socket locks to hard-coded witness lock order. MFC after: 1 week Discussed with: kris	2009-03-08 21:48:29 +00:00
Joe Marcus Clarke	f8ecc40737	Add a default implementation for VOP_VPTOCNP(9) which scans the parent directory of a vnode to find a dirent with a matching file number. The name from that dirent is then used to provide the component name. Note: if the initial vnode argument is not a directory itself, then the default VOP_VPTOCNP(9) implementation still returns ENOENT. Reviewed by: kib Approved by: kib Tested by: pho	2009-03-08 19:05:53 +00:00
Robert Watson	fefd0ac8a9	Remove 'uio' argument from MAC Framework and MAC policy entry points for extended attribute get/set; in the case of get an uninitialized user buffer was passed before the EA was retrieved, making it of relatively little use; the latter was simply unused by any policies. Obtained from: TrustedBSD Project Sponsored by: Google, Inc.	2009-03-08 12:32:06 +00:00
Robert Watson	6f6174a762	Improve the consistency of MAC Framework and MAC policy entry point naming by renaming certain "proc" entry points to "cred" entry points, reflecting their manipulation of credentials. For some entry points, the process was passed into the framework but not into policies; in these cases, stop passing in the process since we don't need it. mac_proc_check_setaudit -> mac_cred_check_setaudit mac_proc_check_setaudit_addr -> mac_cred_check_setaudit_addr mac_proc_check_setauid -> mac_cred_check_setauid mac_proc_check_setegid -> mac_cred_check_setegid mac_proc_check_seteuid -> mac_cred_check_seteuid mac_proc_check_setgid -> mac_cred_check_setgid mac_proc_check_setgroups -> mac_cred_ceck_setgroups mac_proc_check_setregid -> mac_cred_check_setregid mac_proc_check_setresgid -> mac_cred_check_setresgid mac_proc_check_setresuid -> mac_cred_check_setresuid mac_proc_check_setreuid -> mac_cred_check_setreuid mac_proc_check_setuid -> mac_cred_check_setuid Obtained from: TrustedBSD Project Sponsored by: Google, Inc.	2009-03-08 10:58:37 +00:00
Konstantin Belousov	125dcf8c7d	Extract the no_poll() and vop_nopoll() code into the common routine poll_no_poll(). Return a poll_no_poll() result from devfs_poll_f() when filedescriptor does not reference the live cdev, instead of ENXIO. Noted and tested by: hps MFC after: 1 week	2009-03-06 15:35:37 +00:00
Konstantin Belousov	45329b60da	Systematically use vm_size_t to specify the size of the segment for VM KPI. Do not overload the local variable size in kern_shmat() due to vm_size_t change. Fix style bug by adding explicit comparision with 0. Discussed with: bde MFC after: 1 week	2009-03-05 11:45:42 +00:00
Dmitry Chagin	b2421c29f6	as suggested by jhb@, panic in case the ncpus == 0. it helps to catch bugs in the callers. Approved by: kib (mentor) MFC after: 5 days	2009-03-03 17:34:09 +00:00
Robert Watson	73e416e35d	Reduce the verbosity of SDT trace points for DTrace by defining several wrapper macros that allow trace points and arguments to be declared using a single macro rather than several. This means a lot less repetition and vertical space for each trace point. Use these macros when defining privilege and MAC Framework trace points. Reviewed by: jb MFC after: 1 week	2009-03-03 17:15:05 +00:00
Jamie Gritton	f86bce5ed0	Extend the "vfsopt" mount options for more general use. Make struct vfsopt and the vfs_buildopts function public, and add some new fields to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and vfs_opterror. Further extend the interface to allow reading options from the kernel in addition to sending them to the kernel, with vfs_setopt and related functions. While this allows the "name=value" option interface to be used for more than just FS mounts (planned use is for jails), it retains the current "vfsopt" name and <sys/mount.h> requirement. Approved by: bz (mentor)	2009-03-02 23:26:30 +00:00
Alexander Kabaev	5ab4bb35fb	Change vfs_busy to wait until an outcome of pending unmount operation is known and to retry or fail accordingly to that outcome. This fixes the problem with namespace traversing programs failing with random ENOENT errors if someone just happened to try to unmount that same filesystem at the same time. Reported by: dhw Reviewed by: kib, attilio Sponsored by: Juniper Networks, Inc.	2009-03-02 20:51:39 +00:00
Konstantin Belousov	65067cc8b0	Correct types of variables used to track amount of allocated SysV shared memory from int to size_t. Implement a workaround for current ABI not allowing to properly save size for and report more then 2Gb sized segment of shared memory. This makes it possible to use > 2 Gb shared memory segments on 64bit architectures. Please note the new BUGS section in shmctl(2) and UPDATING note for limitations of this temporal solution. Reviewed by: csjp Tested by: Nikolay Dzham <i levsha org ua> MFC after: 2 weeks	2009-03-02 18:53:30 +00:00
Konstantin Belousov	2883703e00	Use the p_sysent->sv_flags flag SV_ILP32 to detect 32bit process executing on 64bit kernel. This eliminates the direct comparisions of p_sysent with &ia32_freebsd_sysvec, that were left intact after r185169.	2009-03-02 18:43:50 +00:00
Dmitry Chagin	6485a22ccb	Fix range-check error introduced in r182292. Also do not do anything if all processors in the map are not available, simply return. Approved by: kib (mentor) MFC after: 1 week	2009-03-01 14:26:24 +00:00
Ed Schouten	c4d4bcdaf6	Improve my previous changes to the TTY code: also remove memcpy(). It's better to just use internal language constructs, because it is likely the compiler has a better opinion on whether to perform inlining, which is very likely to happen to struct winsize. Submitted by: Christoph Mallon <christoph mallon gmx de>	2009-03-01 09:50:13 +00:00
Andrew Thompson	fef11cb704	Move the NORELEASE check to after the recurse count decrement and bailout, this is not counted as actually releasing the lock.	2009-02-28 19:10:43 +00:00
Ed Schouten	4b2d6aaf4b	Replace bcopy() calls inside the TTY layer with memcpy()/strlcpy(). In all these cases the buffers never overlap. Program names are also likely to be shorter, so use a regular strlcpy() to copy p_comm.	2009-02-28 14:20:26 +00:00
Bjoern A. Zeeb	33553d6e99	For all files including net/vnet.h directly include opt_route.h and net/route.h. Remove the hidden include of opt_route.h and net/route.h from net/vnet.h. We need to make sure that both opt_route.h and net/route.h are included before net/vnet.h because of the way MRT figures out the number of FIBs from the kernel option. If we do not, we end up with the default number of 1 when including net/vnet.h and array sizes are wrong. This does not change the list of files which depend on opt_route.h but we can identify them now more easily.	2009-02-27 14:12:05 +00:00
Ed Schouten	91c3cbfe1f	Remove redundant code in printf() and vprintf(). printf() and vprintf() are exactly the same, except the way arguments are passed. Just like we see in other pieces of code (i.e. libc's printf()), implement printf() using vprintf(). Submitted by: Christoph Mallon <christoph mallon gmx de>	2009-02-27 13:28:54 +00:00
Ed Schouten	ff7b7d9039	Revert previous commit to subr_prf.c and make it more tidy. As mentioned by bz and bde, the change I made wasn't the proper way to fix. Inspired by bde's patch, perform some small cleanups to uprintf(). Reviewed by: bz	2009-02-27 12:50:25 +00:00
Ed Schouten	69c9eff894	Remove unneeded pointer `ndp'. Inside do_execve(), we have a pointer `ndp', which always points to `&nd'. I can imagine a primitive (non-optimizing) compiler to really reserve space for such a pointer, so just remove the variable and use `&nd' directly.	2009-02-26 16:32:48 +00:00
Ed Schouten	c90c9021e9	Remove even more unneeded variable assignments. kern_time.c: - Unused variable `p'. kern_thr.c: - Variable `error' is always caught immediately, so no reason to initialize it. There is no way that error != 0 at the end of create_thread(). kern_sig.c: - Unused variable `code'. kern_synch.c: - `rval' is always assigned in all different cases. kern_rwlock.c: - `v' is always overwritten with RW_UNLOCKED further on. kern_malloc.c: - `size' is always initialized with the proper value before being used. kern_exit.c: - `error' is always caught and returned immediately. abort2() never returns a non-zero value. kern_exec.c: - `len' is always assigned inside the if-statement right below it. tty_info.c: - `td' is always overwritten by FOREACH_THREAD_IN_PROC(). Found by: LLVM's scan-build	2009-02-26 15:51:54 +00:00
Ed Schouten	318b1c3fd0	Remove unneeded variable `ocn_mute'. Found by: LLVM's scan-build	2009-02-26 13:01:45 +00:00
Ed Schouten	5225593633	Remove unused variables `p' and unneeded assignments of` rval'. Found by: LLVM's scan-build	2009-02-26 13:00:13 +00:00
Ed Schouten	2bbada90c8	Remove redundant assignment of `p'. `p' is already initialized with `td->td_proc'. Because td is always curthread, it is safe to initialize it without any locks. Found by: LLVM's scan-build	2009-02-26 12:12:34 +00:00
Robert Watson	6efcc2f26a	Add static tracing for privilege checking: priv:kernel:priv_check:priv_ok fires for granted privileges priv:kernel:priv_check:priv_errr fires for denied privileges The first argument is the requested privilege number. The naming convention is a little different from the OpenSolaris equivilent because we can't have '-' in probefunc names, and our privilege namespace is different. MFC after: 1 week	2009-02-26 10:56:13 +00:00
Ed Schouten	9e5775857d	Silence compiler warning inside our ^T handler. It turns out we're casting fixpt_t* to int*. Spotted by: clang	2009-02-26 10:38:19 +00:00
Ed Schouten	1d952ed28c	Use unsigned longs for the TTY's sysctl stats. Spotted by: clang	2009-02-26 10:28:32 +00:00
Ed Schouten	1e737f33a0	Don't use PTY name as format string, even though it isn't insecure here. It's guaranteed that the `name' variable always contains a string of the form pty[l‐sL‐S][0‐9a‐v], but I'd rather keep the compiler happy (LLVM).	2009-02-26 10:14:10 +00:00
Jamie Gritton	613042491b	Add support for methods to the OSD subsystem. Each object type has a predefined set of methods, which are set in osd_register() and called via osd_call(). Currently, no methods are defined, though prison objects will have some in the future. Expand the locking from a single per-type mutex to three different kinds of locks (four if you include the requirement that the container (e.g. prison) be locked when getting/setting data). This clears up one existing issue, as well as others added by the method support. Approved by: bz (mentor)	2009-02-21 11:15:38 +00:00
Ed Schouten	0eee862a54	Don't make Linux stat() open character devices to resolve its name. The existing code calls kern_open() to resolve the vnode of a pathname right after a stat(). This is not correct, because it causes random character devices to be opened in /dev. This means ls'ing a tape streamer will cause it to rewind, for example. Changes I have made: - Add kern_statat_vnhook() to allow binary emulators to `post-process' struct stat, using the proper vnode. - Remove unneeded printf's from stat() and statfs(). - Make the Linuxolator use kern_statat_vnhook(), replacing translate_path_major_minor_at(). - Let translate_fd_major_minor() use vp->v_rdev instead of vp->v_un.vu_cdev. Result: crw-rw-rw- 1 root root 0, 14 Feb 20 13:54 /dev/ptmx crw--w---- 1 root adm 136, 0 Feb 20 14:03 /dev/pts/0 crw--w---- 1 root adm 136, 1 Feb 20 14:02 /dev/pts/1 crw--w---- 1 ed tty 136, 2 Feb 20 14:03 /dev/pts/2 Before this commit, ptmx also had a major number of 136, because it silently allocated and deallocated a pseudo-terminal. Device nodes that cannot be opened now have proper major/minor-numbers. Reviewed by: kib, netchild, rdivacky (thanks!)	2009-02-20 13:05:29 +00:00
John Baldwin	03964c8e09	Enable caching of negative pathname lookups in the NFS client. To avoid stale entries, we save a copy of the directory's modification time when the first negative cache entry was added in the directory's NFS node. When a negative cache entry is hit during a pathname lookup, the parent directory's modification time is checked. If it has changed, all of the negative cache entries for that parent are purged and the lookup falls back to using the RPC. This required adding a new cache_purge_negative() method to the name cache to purge only negative cache entries for a given directory. Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp Reviewed by: mohans	2009-02-19 22:28:48 +00:00
Ed Schouten	40d05103d8	Squash some small bugs in pts(4). - Don't return a negative errno when using an unknown ioctl() on a pseudo-terminal master device. Be sure to convert ENOIOCTL to ENOTTY, just like the TTY layer does. - Even though we should return st_rdev of the master device node when emulating pty(4) devices, FIODGNAME should still return the name of the slave device. Otherwise ptsname(3) and ttyname(3) return an invalid device name.	2009-02-19 17:54:42 +00:00
Attilio Rao	f8d9048018	- Add a function (fill_kinfo_aggregate()) which aggregates relevant members for a kinfo entry on a process-wide system. - Use the newly introduced function in order to fix cases like KERN_PROC_PROC where aggregating stats are broken because they just consider the first thread in the pool for each process. (Note, additively, that KERN_PROC_PROC is rather inaccurate on thread-wide informations like the 'state' of the process. Such informations should maybe be invalidated and being forceably discarded by the consumers?). - Simplify the logic of sysctl_out_proc() and adjust the fill_kinfo_thread() accordingly. - Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add assertives. This patch should fix aggregate statistics for KERN_PROC_PROC. This is one of the reasons why top doesn't use this option and now it can be use it safely. ps, when launched in order to display just processes, now should report correct cpu utilization percentages and times (as opposed by the old code). Reviewed by: jhb, emaste Sponsored by: Sandvine Incorporated	2009-02-18 21:52:13 +00:00
Joe Marcus Clarke	0618630015	Remove the printf's when the vnode to be exported for procstat is not a VDIR. If the file system backing a process' cwd is removed, and procstat -f PID is called, then these messages would have been printed. The extra verbosity is not required in this situation. Requested by: kib Approved by: kib	2009-02-14 21:55:09 +00:00
Joe Marcus Clarke	03fd9c2092	Change two KASSERTS to printfs and simple returns. Stress testing has revealed that a process' current working directory can be VBAD if the directory is removed. This can trigger a panic when procstat -f PID is run. Tested by: pho Discovered by: phobot Reviewed by: kib Approved by: kib	2009-02-14 21:12:24 +00:00
Andrew Thompson	a1797ef6c8	Remove semicolon left in the last commit Spotted by: csjp	2009-02-13 18:51:39 +00:00
John Baldwin	ea77ff0a15	Use shared vnode locks when invoking VOP_READDIR(). MFC after: 1 month	2009-02-13 18:18:14 +00:00
Luigi Rizzo	d4619572b4	Clarify and reimplement the bioq API so that bioq_disksort() has the correct behaviour (sorting by distance from the current head position in the scan direction) and bioq_insert_head() and bioq_insert_tail() have a well defined (and useful) behaviour, especially when intermixed with calls to bioq_disksort(). In particular: - fix a bug in the existing bioq_disksort() that did not use the current head position correctly; - redefine semantics of bioq_insert_head() and bioq_insert_tail(). bioq_insert_tail() can now be used as a barrier between previous and subsequent calls to bioq_disksort(). The code is heavily documented in the source code so please refer to that for the details. Much of this code comes from Fabio Checconi. Also thanks to Kirk for feedback on the (re)definition of bioq_insert_tail(). NOTE: in the current tree there is only a handful of files which intermix calls to bioq_disksort() with bioq_insert_head() and bioq_insert_tail(). The ordering of the queue in these situation was not specified (nor easy to figure out) before, so I doubt any of that code could be affected by the specification of the API. Also note that the current implementation is significantly simpler than the previous one (also used in ata_sort_queue()). It would be useful to reimplement ata_sort_queue() using the same code used in bioq_disksort(). MFC after: 1 week	2009-02-13 11:36:32 +00:00
Andrew Thompson	24ef070126	Check the exit flag at the start of the taskqueue loop rather than the end. It is possible to tear down the taskqueue before the thread has run and the taskqueue loop would sleep forever. Reviewed by: sam MFC after: 1 week	2009-02-13 01:16:51 +00:00
Ed Schouten	c0086bf202	Serialize write() calls on TTYs. Just like the old TTY layer, the current MPSAFE TTY layer does not make any attempt to serialize calls of write(). Data is copied into the kernel in 256 (TTY_STACKBUF) byte chunks. If a write() call occurs at the same time, the data may interleave. This is especially likely when the TTY starts blocking, because the output queue reaches the high watermark. I've implemented this by adding a new flag, TTY_BUSY_OUT, which is used to mark a TTY as having a thread stuck in write(). Because I don't want non-blocking processes to be possibly blocked by a sleeping thread, I'm still allowing it to bypass the protection. According to this message, the Linux kernel returns EAGAIN in such cases, but I think that's a little too restrictive: http://kerneltrap.org/index.php?q=mailarchive/linux-kernel/2007/5/2/85418/thread PR: kern/118287	2009-02-11 16:28:49 +00:00
Robert Watson	54fffe2d67	Modify fdcopy() so that, during fork(2), it won't copy file descriptors from the parent to the child process if they have an operation vector of &badfileops. This narrows a set of races involving system calls that allocate a new file descriptor, potentially block for some extended period, and then return the file descriptor, when invoked by a threaded program that concurrently invokes fork(2). Similar approches are used in both Solaris and Linux, and the wideness of this race was introduced in FreeBSD when we moved to a more optimistic implementation of accept(2) in order to simplify locking. A small race necessarily remains because the fork(2) might occur after the finit() in accept(2) but before the system call has returned, but that appears unavoidable using current APIs. However, this race is vastly narrower. The fix can be validated using the newfileops_on_fork regression test. PR: kern/130348 Reported by: Ivan Shcheklein <shcheklein at gmail dot com> Reviewed by: jhb, kib MFC after: 1 week	2009-02-11 15:22:01 +00:00
Warner Losh	c9584ebe61	o Use NULL in pereference to 0 in pointer contexts. o Use newly minted KOBJMETHOD_END as appropriate o fix prototype for root_setup_intr.	2009-02-11 04:54:02 +00:00
Alexander Motin	e05e00bcae	Check for device_set_devclass() errors and skip driver probe/attach if any. Attach call without devclass set crashes the system. On resume AHCI driver sometimes tries to create duplicate adX device. It is surely his own problem, but IMHO it is not a reason to crash here. Other reasons are also possible.	2009-02-10 23:22:29 +00:00
Attilio Rao	a1d7ce03ea	Scanning all the formats for binary translation of modules loading can result in errors for a format loading but subsequent correct recognizing for another format. File format loading functions should avoid printing any additional informations but just returning appropriate (and different between each other) error condition, characterizing different informations. Additively, the linker should handle appropriately different format loading errors. While a general mechanism is desired, fix a simple and common case on amd64: file type is not recognized for link elf and confuses the linker. Printout an error if all the registered linker classes can't recognize and load the module. Reviewed by: jhb Sponsored by: Sandvine Incorporated	2009-02-10 15:50:19 +00:00
Robert Watson	e2757609ec	Remove extra 'comma = 0' in socket state printing code, which otherwise could lead to an extra comma in output. Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de>	2009-02-09 18:19:58 +00:00
Martin Blapp	37e399b26e	s/SS_FDREF/SS_NOFDREF/	2009-02-09 13:29:01 +00:00
Ed Schouten	89d647cb30	Remove a stale comment from the clists code. We don't support quote bits.	2009-02-09 11:27:56 +00:00
John Baldwin	8941aad19b	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().	2009-02-06 20:06:48 +00:00
Edward Tomasz Napierala	ec48c16f14	Add KASSERTs to make it easier to debug problems like the one fixed in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation	2009-02-06 18:16:01 +00:00
John Baldwin	875b66a05b	Expand the scope of the sysctllock sx lock to protect the sysctl tree itself. Back in 1.1 of kern_sysctl.c the sysctl() routine wired the "old" userland buffer for most sysctls (everything except kern.vnode.*). I think to prevent issues with wiring too much memory it used a 'memlock' to serialize all sysctl(2) invocations, meaning that only one user buffer could be wired at a time. In 5.0 the 'memlock' was converted to an sx lock and renamed to 'sysctl lock'. However, it still only served the purpose of serializing sysctls to avoid wiring too much memory and didn't actually protect the sysctl tree as its name suggested. These changes expand the lock to actually protect the tree. Later on in 5.0, sysctl was changed to not wire buffers for requests by default (sysctl_handle_opaque() will still wire buffers larger than a single page, however). As a result, user buffers are no longer wired as often. However, many sysctl handlers still wire user buffers, so it is still desirable to serialize userland sysctl requests. Kernel sysctl requests are allowed to run in parallel, however. - Expose sysctl_lock()/sysctl_unlock() routines to exclusively lock the sysctl tree for a few places outside of kern_sysctl.c that manipulate the sysctl tree directly including the kernel linker and vfs_register(). - sysctl_register() and sysctl_unregister() require the caller to lock the sysctl lock using sysctl_lock() and sysctl_unlock(). The rest of the public sysctl API manage the locking internally. - Add a locked variant of sysctl_remove_oid() for internal use so that external uses of the API do not need to be aware of locking requirements. - The kernel linker no longer needs Giant when manipulating the sysctl tree. - Add a missing break to the loop in vfs_register() so that we stop looking at the sysctl MIB once we have changed it. MFC after: 1 month	2009-02-06 14:51:32 +00:00
John Baldwin	e4d9b9eb18	Drop the kernel linker lock while running SYSUNINIT routines and removing sysctls during a linker file unload. We drop the lock when doing similar operations during a linker file load. To close races, clear the LINKED flag before dropping the lock so that the linker file is no longer visible to userland. MFC after: 1 week	2009-02-05 23:01:36 +00:00
Attilio Rao	feabc903d9	Add more KTR_VFS logging point in order to have a more effective tracing. Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>	2009-02-05 15:03:35 +00:00
Ed Schouten	c3328b2ab8	Don't leave the console TTY constantly open. When we leave the console TTY constantly open, we never reset the termios attributes. This causes output processing, echoing, etc. not to be reset to the proper values when going into single user mode after the system has booted. It also causes nl-to-crnl-conversion not to take place during shutdown, which causes a `staircase effect'. This patch adds a new TTY flag, TF_OPENED_CONS, which is set when the TTY is opened through /dev/console. Because the flags are only used by the kernel and the pstat(8) utility, I've decided to renumber the TTY flags. This shouldn't be an issue, because the TTY layer is not yet part of a stable release. Reported by: Mark Atkinson <atkin901 yahoo com> Tested by: sepotvin	2009-02-05 14:21:09 +00:00
Jamie Gritton	ca04ba6430	Don't allow creating a socket with a protocol family that the current jail doesn't support. This involves a new function prison_check_af, like prison_check_ip[46] but that checks only the family. With this change, most of the errors generated by jailed sockets shouldn't ever occur, at least until jails are changeable. Approved by: bz (mentor)	2009-02-05 14:15:18 +00:00
Jamie Gritton	b89e82dd87	Standardize the various prison_foo_ip[46] functions and prison_if to return zero on success and an error code otherwise. The possible errors are EADDRNOTAVAIL if an address being checked for doesn't match the prison, and EAFNOSUPPORT if the prison doesn't have any addresses in that address family. For most callers of these functions, use the returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or EINVAL. Always include a jailed() check in these functions, where a non-jailed cred always returns success (and makes no changes). Remove the explicit jailed() checks that preceded many of the function calls. Approved by: bz (mentor)	2009-02-05 14:06:09 +00:00
Edward Tomasz Napierala	27dd8057d3	In some situations, mnt_lockref could go negative due to vfs_unbusy() being called without calling vfs_busy() first. This made umount(8) hang waiting for mnt_lockref to become zero, which would never happen. Reviewed by: kib Approved by: rwatson (mentor) Reported by: pho Found with: stress2 Sponsored by: FreeBSD Foundation	2009-02-05 08:46:18 +00:00
Robert Watson	fd4f1ebdfe	Remove written-to but never read local variable 'offset' from soreceive_dgram(). Submitted by: Christoph Mallon <christoph dot mallon at gmx dot de> MFC after: 1 week	2009-02-04 20:00:17 +00:00
Ed Schouten	f98f752202	Remove slush space from clists. Right now we only have a very small amount of drivers that use clists, but we still allocate 50 cblocks as slush space, which allows drivers to temporarily overcommit their storage. Most of the drivers don't allow this anyway. I've performed the following changes: - We don't allocate any cblocks on startup. - I've removed the DDB command, because it has nothing useful to print now. You can obtain the amount of allocated blocks by running `vmstat -m \| grep clist'. - I've removed cfreecount, which is now unused. - The old code first tries to allocate using M_NOWAIT, followed by M_WAITOK. This doesn't make any sense, so just remove this logic. It seems the drivers allow us to sleep anyway. We can even remove ccmax from clist_alloc_cblocks and c_cbmax from struct clist, but this breaks binary compatibility. This reduces the amount of allocated cblocks on my system from 54 to 4.	2009-02-04 17:10:01 +00:00
Ed Schouten	41ba7e9b13	Slightly improve the design of the TTY buffer. The TTY buffers used the standard <sys/queue.h> lists. Unfortunately they have a big shortcoming. If you want to have a double linked list, but no tail pointer, it's still not possible to obtain the previous element in the list. Inside the buffers we don't need them. This is why I switched to custom linked list macros. The macros will also keep track of the amount of items in the list. Because it doesn't use a sentinel, we can just initialize the queues with zero. In its simplest form (the output queue), we will only keep two references to blocks in the queue, namely the head of the list and the last block in use. All free blocks are stored behind the last block in use. I noticed there was a very subtle bug in the previous code: in a very uncommon corner case, it would uma_zfree() a block in the queue before calling memcpy() to extract the data from the block.	2009-02-03 19:58:28 +00:00
Warner Losh	2c204a1631	Use NULL in preference to 0 in pointer contexts.	2009-02-03 07:54:42 +00:00
Warner Losh	13b4c4c3a3	Make bioq_disksort have a ANSI-C definition rather than a K&R definition.	2009-02-03 07:53:51 +00:00
Warner Losh	8ed4d9c970	rman_debug should be static, so make it static.	2009-02-03 07:53:08 +00:00
Warner Losh	bada728732	Use ANSI function definition for profil.	2009-02-03 07:52:36 +00:00
Warner Losh	04d17b6283	Prefer ANSI function definitions to K&R ones.	2009-02-03 07:52:07 +00:00
Warner Losh	d710cae75a	Use NULL in preference to 0 for pointers.	2009-02-03 07:51:41 +00:00
Warner Losh	4592c621f3	Use NULL in preference to 0 for pointers.	2009-02-03 07:51:11 +00:00
Warner Losh	8260e3a4c0	o Use unsigned for bit fields. o Use NULL for pointers in preference to 0.	2009-02-03 07:50:41 +00:00
Warner Losh	9483543dfc	int foo(void) is the proper ANSI function definition when there's no parameters. Use it for resettodr().	2009-02-03 07:50:01 +00:00
Warner Losh	bdf331d450	Declare bus_data_devices to be static: it isn't used elsewhere. Use NULL in a couple of places rather than 0 in the context of pointers to be consistent with the rest of the file.	2009-02-03 00:10:21 +00:00
Stephane E. Potvin	60b7f468da	Fix select on platforms where sizeof(long) != sizeof(int). This used to work by accident before the cleanup done in revision 187693. Approved by: kan (mentor)	2009-02-02 03:34:40 +00:00
Robert Watson	ad765b0945	If a process is a zombie and we couldn't identify another useful state, print out the state as "zombine" in preference to "unknown" when ^T is pressed. MFC after: 3 days Sponsored by: Google, Inc.	2009-01-29 09:32:56 +00:00
Ed Schouten	f3b86a5fd7	Mark most often used sysctl's as MPSAFE. After running a `make buildkernel', I noticed most of the Giant locks in sysctl are only caused by a very small amount of sysctl's: - sysctl.name2oid. This one is locked by SYSCTL_LOCK, just like sysctl.oidfmt. - kern.ident, kern.osrelease, kern.version, etc. These are just constant strings. - kern.arandom, used by the stack protector. It is already protected by arc4_mtx. I also saw the following sysctl's show up. Not as often as the ones above, but still quite often: - security.jail.jailed. Also mark security.jail.list as MPSAFE. They don't need locking or already use allprison_lock. - kern.devname, used by devname(3), ttyname(3), etc. This seems to reduce Giant locking inside sysctl by ~75% in my primitive test setup.	2009-01-28 19:58:05 +00:00
John Baldwin	9078981ab1	Convert the global mutex protecting the directory lookup name cache from a mutex to a reader/writer lock. Lookup operations first grab a read lock and perform the lookup. If the operation results in a need to modify the cache, then it tries to do an upgrade. If that fails, it drops the read lock, obtains a write lock, and redoes the lookup.	2009-01-28 19:05:18 +00:00
Ed Schouten	8e700fb80c	Use the proper flag to let kern.ttys be executed without Giant. Pointed out by: jhb	2009-01-26 16:43:18 +00:00
John Baldwin	4e30a2db51	Whitespace tweak.	2009-01-26 15:32:39 +00:00
Jeff Roberson	9cdacff1d3	- bit has to be fd_mask to work properly on 64bit platforms. Constants must also be cast even though the result ultimately is promoted to 64bit. - Correct a loop index upper bound in selscan().	2009-01-25 18:38:42 +00:00
Robert Watson	95c807cf5e	When a statically linked binary is executed (or at least, one without an interpreter definition in its program header), set the auxiliary ELF argument AT_BASE to 0 rather than to the address that we would have mapped the interpreter at if there had been one. The ELF ABI specifications appear to be ambiguous as to the desired behavior in this situation, as they define AT_BASE as the base address of the interpreter, but do not mention what to do if there is none. On Solaris, AT_BASE will be set to the base address of the static binary if there is no interpreter, and on Linux, AT_BASE is set to 0. We go with the Linux semantics as they are of more immediate utility and allow the early runtime environment to know that the kernel has not mapped an interpreter, but because AT_PHDR points at the ELF header for the running binary, it is still possible to retrieve all required mapping information when the process starts should it be required. Either approach would be preferable to our current behavior of passing a pointer to an unmapped region of user memory as AT_BASE. MFC after: 3 weeks	2009-01-25 12:07:43 +00:00
Bjoern A. Zeeb	1cecba0fcd	For consistency with prison_{local,remote,check}_ipN rename prison_getipN to prison_get_ipN. Submitted by: jamie (as part of a larger patch) MFC after: 1 week	2009-01-25 10:11:58 +00:00
Jeff Roberson	748b9df687	- Correct a typo in a comment. Noticed by: danger	2009-01-25 09:17:16 +00:00
Jeff Roberson	e20a199fd5	- Make the keg abstraction more complete. Permit a zone to have multiple backend kegs so it may source compatible memory from multiple backends. This is useful for cases such as NUMA or different layouts for the same memory type. - Provide a new api for adding new backend kegs to secondary zones. - Provide a new flag for adjusting the layout of zones to stagger allocations better across cache lines. Sponsored by: Nokia	2009-01-25 09:11:24 +00:00
Ed Schouten	30bf032c76	Remove unneeded use of device unit numbers from pty(4). A much more simple approach to generate the slave device name, is to obtain the device name of the master and replace 'p' by 't'.	2009-01-25 08:27:11 +00:00
Jeff Roberson	0d2cf8374a	- Use __XSTRING where I want the define to be expanded. This resulted in sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.	2009-01-25 07:35:10 +00:00
Jeff Roberson	11b763df19	Fix errors introduced when I rewrote select. - Restructure selscan() and selrescan() to avoid producing extra selfps when we have a fd in multiple sets. As described below multiple selfps may still exist for other reasons. - Make selrescan() tolerate multiple selfds for a given descriptor set since sockets use two selinfos per fd. If an event on each selinfo fires selrescan() will see the descriptor twice. This could result in select() returning 2x the number of fds actually existing in fd sets. Reported by: mgleason@ncftp.com	2009-01-25 07:24:34 +00:00
Ed Schouten	bfcbfff0c7	Mark kern.ttys as MPSAFE. sysctl now allows Giantless calls, so make kern.ttys use this. If it needs Giant, it locks the proper TTY anyway.	2009-01-24 18:20:15 +00:00
Robert Watson	91dd9aae1a	Add explicit static DTrace tracing to the callout mechanism, capturing pointers to the callout handler just before and just after the callout it invoked. I attempted to do this in a manner congruent to tracing in Solaris's callout mechanism, but couldn't quite use the same names due to convention and syntax differences. Example DTrace script to generate a distribution graph of callout execution times: callout_execute:::callout_start { self->cstart = timestamp; } callout_execute:::callout_end { @length = quantize(timestamp - self->cstart); } Reviewed by: jb MFC after: 3 days	2009-01-24 10:22:49 +00:00

1 2 3 4 5 ...

11133 commits