rid of the MTX_DUPOK flag on channel mutexes, which allows witness to
do a better job of lock order checking. Nuke snd_chnmtxcreate() since
it is no longer needed.
Tested by: matk
channel at a time unless it is actually necessary to lock both.
This avoids problems with lock order reversal and malloc() calls
with a mutex held when lower level code unlocks a channel, calls malloc(),
and relocks the channel. This also avoids the cost of some unnecessary
locking and unlocking.
Tested by: matk
Testing on cluster ref machine with just delaying make_dev() seems
to work, and results in printf() output appearing sooner in boot
cycle instead of going to /dev/null.
Caught by: bde
Pointy hat: kensmith
Approved by: rwatson (mentor)
patterns. (These lines are correct the other two times they appear.)
Reported by: "Ted Unangst" <tedu@coverity.com>
Approved by: rwatson (mentor), ken (scsi)
remove unused pid field of file context struct
map nfs4 error codes to errnos
eliminate redundant code from nfs4_request
use zero stateid on setattr that doesn't set file size
use same clientid on all mounts until reboot
invalidate dirty bufs in nfs4_close, to play it safe
open file for writing if truncating and it's not already open
Approved by: alfred
subtle problems with how alpha was handling the promcons device. This
moves the call to make_dev() for the promcons device to a later point of
the boot-up sequence than where promcons initially gets attached, make_dev()
called during the first attach crashes due to kernel stack issues.
Reviewed by: gallatin, marcel, phk
Discussed on: -current@, -alpha@
Approved by: rwatson (mentor)
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
the process state to zombie when a process exits to avoid a lock order
reversal with the sleepqueue locks. This appears to be the only place
that we call wakeup() with sched_lock held.
to queue threads sleeping on a wait channel similar to how turnstiles are
used to queue threads waiting for a lock. This subsystem will be used as
the backend for sleep/wakeup and condition variables initially. Eventually
it will also be used to replace the ithread-specific iwait thread
inhibitor.
Sleep queues are also not locked by sched_lock, so this splits sched_lock
up a bit further increasing concurrency within the scheduler. Sleep queues
also natively support timeouts on sleeps and interruptible sleeps allowing
for the reduction of a lot of duplicated code between the sleep/wakeup and
condition variable implementations. For more details on the sleep queue
implementation, check the comments in sys/sleepqueue.h and
kern/subr_sleepqueue.c.
statements and nowhere else in the kernel seems to use them for single
statements. Also, all other users of do { } while(0) use multiple lines
rather than cramming it all onto one line.
work. This is odd because loader(8) doesn't suffer from this problem.
Perhaps pxeboot bootstrap can be fixed to handle this better.
Anyway, PXE booting should work again.
are employed in entry points later in the same include file.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Air Force Research Laboratory, McAfee Research
struct vattr in mac_policy.h. This permits policies not
implementing entry points using these types to compile without
including include files with these types.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Air Force Research Laboratory
This enables pf to track dynamic address changes on interfaces (dailup) with
the "on (<ifname>)"-syntax. This also brings hooks in anticipation of
tracking cloned interfaces, which will be in future versions of pf.
Approved by: bms(mentor)
pf/pflog/pfsync as modules. Do not list them in NOTES or modules/Makefile
(i.e. do not connect it to any (automatic) builds - yet).
Approved by: bms(mentor)
to a new mac_inet.c. This code is now conditionally compiled based
on inet support being compiled into the kernel.
Move socket related MAC Framework entry points from mac_net.c to a new
mac_socket.c.
To do this, some additional _enforce MIB variables are now non-static.
In addition, mbuf_to_label() is now mac_mbuf_to_label() and non-static.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research
for a long time and is run in production use. This is the code present in
portversion 2.03 with some additional tweaks.
The rather extensive diff accounts for:
- locking (to enable pf to work with a giant-free netstack)
- byte order difference between OpenBSD and FreeBSD for ip_len/ip_off
- conversion from pool(9) to zone(9)
- api differences etc.
Approved by: bms(mentor) (in general)
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.
Approved by: jlemon (years ago, for a more invasive fix)
also prints the actual numerical value of the symbol in question.
Users of addr2line(1) will be less proficient in hex arithmetic as a
consequence.
This amongst other things means that traceback lines change from:
siointr1(c4016800,c073bda0,0,c06b699c,69f) at siointr1+0xc5
to
siointr1(c4016800,c073bda0,0,c06b699c,69f) at 0xc062b0bd = siointr1+0xc5
I made this an option to avoid bikesheds.
~
~
~
amount of segments it will hold.
The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:
net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).
net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).
net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.
net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.
Tested by: bms, silby
Reviewed by: bms, silby
address, even if we subsequently ignore its value by applying a >>8
to it.
Reported by: "Ted Unangst" <tedu@coverity.com>
Approved by: rwatson (mentor), {ume, suz} (KAME)
The nonstandard formatting made my mega-patch scripts miss it.
Retire the static major number while we're here anyway.
Reported by: Niels Chr. Bank-Pedersen <ncbp@bank-pedersen.dk>
AFTER the call to vn_start_write(), not before it. Otherwise, it is
possible to unlock it multiple times if the vn_start_write() fails.
Submitted by: Juergen Hannken-Illjes <hannken@eis.cs.tu-bs.de>
In ufs_lock, check for attempts to acquire shared locks on
snapshot files and change them to be exclusive locks. This
change eliminates deadlocks and machine lockups reported in
-current since most read requests started using shared lock
requests.
Submitted by: Jun Kuriyama <kuriyama@imgsrc.co.jp>
swap_pager_putpages()'s buffer completion code. Note: the only
difference between swp_pager_sync_iodone() and bdone(), aside from
the locking in the latter, was the unnecessary clearing of B_ASYNC.
- Remove an unnecessary pmap_page_protect() from
swp_pager_async_iodone().
Reviewed by: tegge
of all, PIPE_EOF is not checked pervasively after everything that can drop
the pipe mutex and msleep(), so fix. Additionally, though it might not
harm anything, pipelock() and pipeunlock() are not used consistently.
Third, the kqueue support functions do not use the pipe mutex correctly.
Last, but absolutely not least, is a race: if pipe_busy is not set on
the closing side of the pipe, the other side that is trying to write to
that will crash BECAUSE PIPE_EOF IS NOT SET! Unconditionally set
PIPE_EOF, and get rid of all the lockups/crashes I have seen trying
to build ports.
an integer type and the a cast to (void *) was added in the
definition of NULL for the kernel, we need to use 0 here instead.
Partly submitted by: cperciva
Now I believe it is done in the right way.
Removed some XXMAC cases, we now assume 'high' integrity level for all
sysctls, except those with CTLFLAG_ANYBODY flag set. No more magic.
Reviewed by: rwatson
Approved by: rwatson, scottl (mentor)
Tested with: LINT (compilation), mac_biba(4) (functionality)
with a memory mapped I/O range that's immediately before it and is
not 256MB aligned. As a result, when an address is accessed in the
memory mapped range and a direct mapping is added for it, it overlaps
with the pre-mapped I/O port space and causes a machine check.
Based on a patch from: arun@
to use the "year1-year3" format, as opposed to "year1, year2, year3".
This seems to make lawyers more happy, but also prevents the
lines from getting excessively long as the years start to add up.
Suggested by: imp
idmap_add failure case (found by Ted Unangst via Colin Percival)
also convert idmap_hashf to return void, since it can't fail
also change some panics to error returns
by 1 u_int if the number of clusters was 1 more than a multiple of
(8 * sizeof(u_int)). The bitmap is malloced and large (often huge), so
fatal overrun probably only occurred if the number of clusters was 1
more than 1 multiple of PAGE_SIZE/8.
This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:
Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.
Hold a dev_t reference while the device is open.
KASSERT that we do not enter the driver on a non-referenced dev_t.
Remove old D_NAG code, anonymous dev_t's are not a problem now.
When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.
Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.
Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
Add missing D_TTY flags to various drivers.
Complete asserts that dev_t's passed to ttyread(), ttywrite(),
ttypoll() and ttykqwrite() have (d_flags & D_TTY) and a struct tty
pointer.
Make ttyread(), ttywrite(), ttypoll() and ttykqwrite() the default
cdevsw methods for D_TTY drivers and remove the explicit initializations
in various drivers cdevsw structures.
This commit adds a couple of functions for pseudodrivers to use for
implementing cloning in a manner we will be able to lock down (shortly).
Basically what happens is that pseudo drivers get a way to ask for
"give me the dev_t with this unit number" or alternatively "give
me a dev_t with the lowest guaranteed free unit number" (there is
unfortunately a lot of non-POLA in the exact numeric value of this
number, just live with it for now)
Managing the unit number space this way removes the need to use
rman(9) to do so in the drivers this greatly simplifies the code in
the drivers because even using rman(9) they still needed to manage
their dev_t's anyway.
I have taken the if_tun, if_tap, snp and nmdm drivers through the
mill, partly because they (ab)used makedev(), but mostly because
together they represent three different problems for device-cloning:
if_tun and snp is the plain case: just give me a device.
if_tap has two kinds of devices, with a flag for device type.
nmdm has paired devices (ala pty) can you can clone either of them.
Free approx 86 major numbers with a mostly automatically generated patch.
A number of strategic drivers have been left behind by caution, and a few
because they still (ab)use their major number.
This removes the packet header in certain cases which later on
will give panic. Clarify what the atm_intr expects in the comment
and de-obscurify the code a little bit by replacing the portability
macros with the BSD names. The code isn't maintained externally anymore
so there's no point in keeping the extra level of obscurity.
- allow for ifp->if_ioctl being NULL, as the rest of ifioctl() does;
- give the interface driver a chance to report a error to the caller;
- don't forget to update ifp->if_lastchange upon successful modification
of interface operation parameters.
layering violation. As pointed out, there is much better way to do this.
Sorry guys, I need to find a better way to force reviews.
Requested by: harti, julian, scottl (mentor)
Pointy hat to: pjd
Instead of creating a mutex that we msleep on but don't actually lock when
doing the corresponding wakeup(), in the kthread, lock the mutex associated
with our taskqueue and msleep while the queue is empty. Assert that the
queue is locked when the callback function is called to wake the kthread.
return events on the fixed handler even after defining a duplicate in the
AML. While this violates the spec, hopefully we can get by with leaving
both installed.
It works as follows:
In every 'interval' seconds defined links are checked.
If they are non-active they will not be used by to data transfer.
No response from: julian, archie
Silent on: net@
Approved by: scottl (mentor)
mode is applied, since tunneled packets are considered to be
generated packets from a tunnel encapsulating node.
- tunnel mode may not be applied if SA mode is ANY and policy
does not say "tunnel it". check if we have extra IPv6 header
on the packet after ipsec6_output_tunnel() and call ip6_output()
only if additional IPv6 header is added.
- free the copyed packet before returning.
Obtained from: KAME
It returns 1 is process is inside of jail and 0 if it is not.
Information if we are in jail or not is not a secret, there is plenty of
ways to discover it. Many people are using own hack to check this and
this will be a legal way from now on.
It will be great if our starting scripts will take advantage of this sysctl
to allow clean "boot" inside jail.
Approved by: rwatson, scottl (mentor)
to size_t *, which is incorrect because they may have different widths.
This caused some subtle forms of corruption, the mostly frequently
reported one being that the last character of a filename was sometimes
duplicated on amd64.
stopped returning events. Don't disable the event when removing
the handler because it still needs to be enabled for the other
handler. Also, remove duplicate AcpiEnableEvent calls since the
install function now does this for us.
into its own file:
- All of the $PIR interrupt routing is now done in a link-centric fashion.
When a host-PCI bridge that uses the $PIR attaches, it calls pir_parse()
to parse the table. This scans for link devices and merges all the masks
for each link device from the table entries. It then looks at the intline
register of PCI devices connected to a link to figure out if the BIOS has
routed this link and if so to which IRQ.
- The IRQ for any given link can be overridden via a hint like so:
'hw.pci.link.0x62.irq=10' Any IRQ set in this matter is treated as if it
were set that way by the BIOS.
- We only call the BIOS to route each link device once.
- When a PCI device wants to route an interrupt, we look it up in the $PIR
to find the associated link. If the link is routed, we simply return the
IRQ it is using. If it is not routed, we have to pick one. This uses a
different algorithm from the old code. First off, when we try to pick
an interrupt from a mask of possible interrupts, we try to pick the one
that is least loaded as far as PCI devices. We maintain this weight based
on the number of devices attached to each link device. When choosing an
IRQ, we first attempt to route using any PCI only interrupts (the old
code did this as well). If that doesn't work, we try to use the list of
IRQs that the BIOS has used. This is a new step that the new code didn't
do and avoids using IRQ 3 or 4 for every virgin interrupt routing. If
none of the IRQs that the BIOS used worked, then we fall back to trying
anything.
- The fallback mask for !PC98 was fixed to include IRQ 3 and not allow IRQ
2.
- We don't use the $PIR to route interrupts on a PCI-PCI bridge unless it
has already been used to route on at least one Host-PCI bridge. This
helps to avoid mixing and matching x86 firmware PCI interrupt routing
methods (which is a Bad Thing(tm)).
Silence on: current@
Previously the "struct disk" were owned by the device driver and this
gave us problems when the device disappared and the users of that device
were not immediately disappearing.
Now the struct disk is allocate with a new call, disk_alloc() and owned
by geom_disk and just abandonned by the device driver when disk_create()
is called.
Unfortunately, this results in a ton of "s/\./->/" changes to device
drivers.
Since I'm doing the sweep anyway, a couple of other API improvements
have been carried out at the same time:
The Giant awareness flag has been flipped from DISKFLAG_NOGIANT to
DISKFLAG_NEEDSGIANT
A version number have been added to disk_create() so that we can detect,
report and ignore binary drivers with old ABI in the future.
Manual page update to follow shortly.
support is partial in that it will refuse to create large files on
filesystems that haven't been upgraded to EXT2_DYN_REV or that don't
have the EXT2_FEATURE_RO_COMPAT_LARGE_FILE flag set in the superblock.
MFC after: 2 weeks
kernel. I'm not happy with it yet - refinements are to come.
This hack allows the kern.ps_strings and kern.usrstack sysctls to respond
to a 32 bit request, such as those coming from emulated i386 binaries.
value for MSGBUF_SIZE is configured. MSGBUF_SIZE =
(32768 * bootverbose ? 2 : 1) is always 1 or 2, so there is not enough space
in the buffer for metadata, and blindly using the nonexistent space tends
to cause fatal pagefaults. I think
MSGBUF_SIZE = (32768 * (bootverbose ? 2 : 1)) would be always 32768 since
bootverbose is only statically initialized to 0 early when MSGBUF_SIZE is
used. MSGBUF_SIZE = (32768 * ((boothowto & RB_VERBOSE) ? 2 : 1)) should
work, but this belongs in <sys/msgbuf.h> even less than previous versions.
MSGBUF_SIZE shouldn't be a macro.
it means that the correct value is unknown. Since this value is just
a hint to improve performance, initially assume that the first non-reserved
cluster is free, then correct this assumption if necessary before writing
the FSInfo block back to disk.
PR: 62826
MFC after: 2 weeks
result in a panic "vm_page_cache: caching a dirty page, ...": Access to the
page must be restricted or removed before calling vm_page_cache(). This
race condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.
MFC after: 7 days
- When adding new waiting threads to the waitlist for an object,
use INSERT_LIST_TAIL() instead of INSERT_LIST_HEAD() so that new
waiters go at the end of the list instead of the beginning. When we
wake up a synchronization object, only the first waiter is awakened,
and this needs to be the first thread that actually waited on the object.
- Correct missing semicolon in INSERT_LIST_TAIL() macro.
- Implement lookaside lists correctly. Note that the Am1771 driver
uses lookaside lists to manage shared memory (i.e. DMAable) buffers
by specifying its own alloc and free routines. The Microsoft documentation
says you should avoid doing this, but apparently this did not deter
the developers at AMD from doing it anyway.
With these changes (which are the result of two straight days of almost
non-stop debugging), I think I finally have the object/thread handling
semantics implemented correctly. The Am1771 driver no longer crashes
unexpectedly during association or bringing the interface up.
resources. (Note that the correct range is 0x3f7,0x3f0-0x3f5.) Such
devices will be detected as follows:
fdc0: <Enhanced floppy controller (i82077, NE72065 or clone)> port
0x3f7,0x3f4-0x3f5,0x3f2-0x3f3,0x3f0-0x3f1 irq 6 drq 2 on acpi0
To do this, we find the minimum and maximum start addresses for the
resources and use them as the base for the IO and control ports.
Help from: jhb
missing parentheses). Use default handling (trap to debugger) for
udev2dev(x, 1) since it is an error and doesn't happen anywhere in
the sys tree except in bogusly commented out code in coda.
and given a value, but never used. This has no effect on the
resulting binaries, since gcc optimizes the variable away anyway.
PR: kern/62684
Approved by: rwatson (mentor)
panic "vm_page_cache: caching a dirty page, ...": Access to the page must
be restricted or removed before calling vm_page_cache(). This race
condition is identical in nature to that which was addressed by
vm_pageout.c's revision 1.251 and vm_page.c's revision 1.275.
Reviewed by: tegge
MFC after: 7 days
The Am1771 driver will sometimes do the following:
- Some thread-> NdisScheduleWorkItem(some work)
- Worker thread -> do some work, KeWaitForSingleObject(some event)
- Some other thread -> NdisScheduleWorkItem(some other work)
When the second call to NdisScheduleWorkItem() occurs, the NDIS worker
thread (in our case ndis taskqueue) is suspended in KeWaitForSingleObject()
and waiting for an event to be signaled. This is different from when
the worker thread is idle and waiting on NdisScheduleWorkItem() to
send it more jobs. However, the ndis_sched() function in kern_ndis.c
always calls kthread_resume() when queueing a new job. Normally this
would be ok, but here this causes KeWaitForSingleObject() to return
prematurely, which is not what we want.
To fix this, the NDIS threads created by kern_ndis.c maintain a state
variable to indicate whether they are running (scanning the job list
and executing jobs) or sleeping (blocked on kthread_suspend() in
ndis_runq()), and ndis_sched() will only call kthread_resume() if
the thread is in the sleeping state.
Note that we can't just check to see if the thread is on the run queue:
in both cases, the thread is sleeping, but it's sleeping for different
reasons.
This stops the Am1771 driver from emitting various "NDIS ERROR" messages
and fixes some cases where it crashes.
data for the file system on which the jail's root vnode is located.
Previous behavior (show data for all mountpoints) can be restored
by setting security.jail.getfsstatroot_only to 0. Note: this also
has the effect of hiding other mounts inside a jail, such as /dev,
/tmp, and /proc, but errs on the side of leaking less information.
called until DEVFS had a chance to initialize. Since DEVFS is mandatory
and things over in that department coincidentally works from without
any initialization now, this is safe.
could result in a panic "vm_page_cache: caching a dirty page, ...":
Access to the page must be restricted or removed before calling
vm_page_cache(). This race condition is identical in nature to that
which was addressed by vm_pageout.c's revision 1.251.
- Simplify the code surrounding the fix to this same race condition
in vm_pageout.c's revision 1.251. There should be no behavioral
change. Reviewed by: tegge
MFC after: 7 days
- don't unlock the vnode after vinvalbuf() only to have to relock it
almost immediately.
- don't refer to devices classified by vn_isdisk() as block devices.
is reserved by the loader, and thus any tunable name with that suffix will
be silently discarded.
Document this in the header and man page so that other developers do not
develop so many bumps on the head after banging it against the wall.
Detective work by: Mark Santcroos, grehan
the vnode around calls to vinvalbuf()). Apparently no one has tested
ext2fs with DEBUG_VOP_LOCKS. Vnode locking for vinvalbuf() might not
be required in non-soft-updates cases, but it is now asserted.
MFffs (uncommitted related and nearby cleanups: don't unlock the vnode
after vinvalbuf() only to have to relock it almost immediately; don't
refer to devices classified by vn_isdisk() as block devices).
them mostly with packet tags (one case is handled by using an mbuf flag
since the linkage between "caller" and "callee" is direct and there's no
need to incur the overhead of a packet tag).
This is (mostly) work from: sam
Silence from: -arch
Approved by: bms(mentor), sam, rwatson
ISA. npx has few isa dependencies, but it does unconditional outb()'s to
the isa bus in the !SMP case, and it attaches to isa if "device isa" is
configured in order to support PNP-ISA. The ifdef for the latter was
misplaced.
PR: 62595
dishonored in rev.1.1 by commenting out the code that honored it. This
gave the worst disadvantages of async mounts in an uncontrollable way.
Honoring the flag costs about 50% in real time in worst cases on a new
but not very fast ATA drive with write caching (probably more on drives
without write caching). The old misbehavior can be recovered using
async mounts after implementing them in mount_ext2fs(8) (just put the
MNT_ASYNC flag in mount_ext2fs's table of supported options like it
is in mount's table).
- rejects IPv6 packet toward IPv4-mapped address if its source address
is not an IPv4-mapped IPv6 address, since the converted IPv4 packets
would have an unexpected IPv4 source address.
- when V6ONLY socket option is set, discard packets destined to a
v4/ipv4 mapped ipv6 address.
- have PULLDOWN_TEST codepath.
- get rid of in6_mcmatch().
Obtained from: KAME
to one, DEBUG_FLAGS, which is also compatible with <bsd.prog.mk>.
Previously one had to set both DEBUG and DEBUG_FLAGS to build the
.ko.debug with debugging symbols which was boring when doing this
manually.
versions of the modules, and unconditionally putting -g in CFLAGS
has negative impact on the size of the resulting .ko object, even
now that debugging symbols are always stripped.
shown that it is not useful.
Rename the relative count g_access_rel() function to g_access(), only
the name has changed.
Change all g_access_rel() calls in our CVS tree to call g_access() instead.
Add an #ifndef BURN_BRIDGES #define of g_access_rel() for source
code compatibility.
operators) in and near revs.1.169-1.170 (open mode bandaid). This
(or better a proper fix) should have been done before cloning the
bandaid to many other file systems.
set to SIGCHLD. This avoids the creation of orphaned Linux-threaded
zombies that init is unable to reap. This can occur when the parent
process sets its SIGCHLD to SIG_IGN. Fix a similar situation in the
PT_DETACH code.
Tested by: "Steven Hartland" <killing AT multiplay.co.uk>
- rev.1.42 of ffs_readwrite.c added a special case in ffs_read() for reads
that are initially at EOF, and rev.1.62 of ufs_readwrite.c fixed
timestamp bugs in it. Removal of most of vfs_ioopt made it just and
optimization, and removal of the vm object reference calls made it less
than an optimization. It was cloned in rev.1.94 of ufs_readwrite.c as
part of cloning ffs_extwrite() although it was always less than an
optimization in ffs_extwrite().
- some comments, compound statements and vertical whitespace were vestiges
of dead code.
- culled long-dead #define's
- segment register defs moved to sr.h
- NPMAPS moved to pmap.h
- KERNBASE moved to vmparam.h
- removed include of <machine/cpu.h> and fixed src files that
relied on this.
Modifying segment register code no longer causes gcc rebuilds :-)
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
systems define power/sleep buttons in both places but only deliver
notifies to the ones defined in the AML.
Also, reduce length of various function handler names.
PR:
Submitted by:
Reviewed by:
Approved by:
Obtained from:
MFC after:
pci-hi/med/lo + node 'interrupts' property. This worked by
accident until recent notebooks required correct operation.
Tested by: Suleiman Souhlal <refugee@segfaulted.com>
routines to do anything except return error if the miniport adapter context
is not set (meaning we either having init'ed the driver yet, or the
initialization failed).
Also, be sure to NULL out the adapter context along with the
miniport characteristics pointers if calling the MiniportInitialize()
method fails.
the added comment for low-level details.) The effect of this race
condition is a panic "vm_page_cache: caching a dirty page, ..."
Reviewed by: tegge
MFC after: 7 days
created with the same name, and vice versa:
- Immediately recycle vnodes of files & directories that have been deleted
or renamed.
- When looking an entry in the VFS name cache or smbfs's private
cache, make sure the vnode type is consistent with the type of file
the server thinks it is, and re-create the vnode if it isn't.
The alternative to this is to recycle vnodes unconditionally when their
use count drops to 0, but this would make all the caching we do
mostly useless.
PR: 62342
MFC after: 2 weeks
- Factor out common settings and put them in an upper level Makefile.inc.
- Properly use PROG for real programs, not their products.
- Further reduce diffs to i386 versions.
Tested on: sparc64 (panther)
- Now that bsd.prog.mk deals with programs linked with -nostdlib
better, and has a notion of an "internal" program, use PROG
where possible. This has a good impact on the contents of
.depend files and causes programs to be linked with cc(1).
XXX: boot2 couldn't be converted as it's actually two programs.
Tested on: i386, amd64
the "disappearing subdisks" problem when new subdisks can't be created
due to some errors.
This is in fact an ugly hack, but a more elegant solution would probably
require a redesign of vinum in several places.
Approved by: joerg (mentor)
mindful of blocking on disk I/O and instead return EBUSY when such
blocking would occur.
Results from the DeBox project indicate that blocking on disk I/O
can slow the performance of a kqueue/poll based webserver. Using
a flag such as SF_NODISKIO and throwing connections that would block
to helper processes/threads helped increase performance.
Currently, only the Flash webserver uses this flag, although it could
probably be applied to thttpd with relative ease.
Idea by: Yaoping Ruan & Vivek Pai
nodes, or if they did, they're now locked away on the Kurt Gdel
memorial home for the numerically confused:
Don't cast a kernel pointer (from makedev(9)) to an integer (maj+minor combo).
on the card, unmap it first. This allows it to be picked up properly when
the queue gets kicked again. This was the root problem for the lost command
(i.e. stuck in getblk/vinvalb) problem. While here, panic if commands don't
map correctly instead of just silently ignoring the problem and dropping
command. Also slow down the dynamic allocation of new commands.
It should be safe to go back into the aac waters. Thanks to everyone who
suffered through this and provided good feedback.
802.11b chipset work. This chip is present on the SMC2602W version 3
NIC, which is what was used for testing. This driver creates kernel
threads (12 of them!) for various purposes, and required the following
routines:
PsCreateSystemThread()
PsTerminateSystemThread()
KeInitializeEvent()
KeSetEvent()
KeResetEvent()
KeInitializeMutex()
KeReleaseMutex()
KeWaitForSingleObject()
KeWaitForMultipleObjects()
IoGetDeviceProperty()
and several more. Also, this driver abuses the fact that NDIS events
and timers are actually Windows events and timers, and uses NDIS events
with KeWaitForSingleObject(). The NDIS event routines have been rewritten
to interface with the ntoskrnl module. Many routines with incorrect
prototypes have been cleaned up.
Also, this driver puts jobs on the NDIS taskqueue (via NdisScheduleWorkItem())
which block on events, and this interferes with the operation of
NdisMAllocateSharedMemoryAsync(), which was also being put on the
NDIS taskqueue. To avoid the deadlock, NdisMAllocateSharedMemoryAsync()
is now performed in the NDIS SWI thread instead.
There's still room for some cleanups here, and I really should implement
KeInitializeTimer() and friends.
seems to work well in RELENG_4. However, 5.X locking foo means that I'll
have to do some quick redesign.
Add ioctl handlers for ISP_GETROLE and ISP_SETROLE ioctls.
handling of resources shortages. The driver is now so fast that it can
completely fill all 512 slots on the card, but for some reason only 511
slots are being allocated. Anything that tries to go into the 512th
slot gets silently lost. Both bugs need to be fixed at a later date,
but this should fix the reports of hangs in getblk and vinvalb.
edge cases in the loop.
- Try to grab a command before dequeueing the bio from the bioq. The old
behaviour of requeuing deferred bios to the end of the bioq is arguably
wrong. This should be fixed in the future to check the bioq head without
automatically dequeueing the bio.
- do not use PROG for what's not a real C program,
- use sys.mk transformation rules where possible,
- only create the "machine" symlink on AMD64,
- removed MAINTAINER lines in individual makefiles,
- added the LIBSTAND defitinion to <bsd.libnames.mk>,
- somewhat better contents in .depend files.
Tested on: i386, amd64
Prodded by: bde
to catch are already nicely caught by trapping the null pointer derefs.
Remove no-longer-used noswitch/nothrow strings. They were referenced
by the stub cpu_switch() etc functions before they were implemented.
Try something a little different for the lock prefixes.
Prompted by: bde (the first two items anyway)
RLIM_INFINITY case for ogetrlimit().
- Use %jd and intmax_t to output negative time in usec in calcru().
- Rework getrusage() to make a copy of the rusage struct into a local
variable while holding Giant and then do the copyout from the local
variable to avoid having to have the original process rusage struct
locked while doing the copyout (which would not be safe). This also
includes a few style fixes from Bruce to getrusage().
Submitted by: bde (1, parts of 3)
Suggested by: bde (2)
delete it each time its run and have it regenerated each time by make.
I used a quick hackish script rather than putting it in the files file
and used the before-depend rule to avoid the depend/no-depend hacks.
failed, the reference count for the virtual memory object referenced
by the specified shared memory segment would have been erroneously
incremented.
Reported by: Joost Pol <joost@pine.nl>
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
TIOCMSET before FreeBSD existed and have never been implemented by any
FreeBSD serial driver (not even as aliases).
Moved the TIOCM bit definitions to be with TIOCMGET.
Added more comments gaps between ioctl numbers. There are now a large
number of conflicts with ppp, slip, tap and tun ioctls (especially ppp
ones) from closing gaps that weren't there. This mainly breaks decoding
of ioctl numbers in kdump, but there are some serious conflicts where
the interpretation of a tty ioctl depends on the line discipline.
aic79xx.seq:
Convert the COMPLETE_DMA_SCB list to an "stailq". This allows us to
safely keep the SCB that is currently being DMA'ed back the host on
the head of the list while processing completions off of the bus. The
newly completed SCBs are appended to the tail of the queue. In the
past, we just dequeued the SCB that was in flight from the list, but
this could result in a lost completion should the host perform certain
types of error recovery that must cancel all in-flight SCB DMA operations.
Switch from using a 16bit completion entry, holding just the tag and the
completion valid bit, to a 64bit completion entry that also contains a
"status packet valid" indicator. This solves two problems:
o The SCB DMA engine on at least Rev B. silicon does not properly deal
with a PCI disconnect that occurs at a non-64bit aligned offset in the
chips "source buffer". When the transfer is resumed, the DMA engine
continues at the correct offset, but may wrap to the head of the buffer
causing duplicate completions to be reported to the host. By using a
completion buffer in host memory that is 64bit aligned and using 64bit
completion entries, such disconnects should only occur at aligned addresses.
This assumes that the host bridge will only disconnect on cache-line
boundaries and that cache-lines are multpiles of 64bits.
o By embedding the status information in the completion entry we can avoid
an extra memory reference to the HSCB for commands that complete without
error.
Use the comparison of a "host freeze count" and a "sequencer freeze count"
to allow the host to process most SCBs that complete with non-zero status
without having to clear critical sections. Instead the host can just pause the
sequencer, performs any necessary cleanup in the waiting for selection list,
increments its freeze count on the controller, and unpauses. This is only
possible because the sequencer defers completions of SCBs with bad status
until after all pending selections have completed. The sequencer then avoids
referencing any data structures the host may touch during completion of the
SCB until the freeze counts match.
aic79xx.c:
Change the strategy for allocating our sentinal HSCB for the QINFIFO. In
the past, this allocation was tacked onto the QOUTFIFO allocation. Now that
the qoutfifo has grown to accomodate larger completion entries, the old
approach will result in a 64byte allocation that costs an extra page of
coherent memory. We now do this extra allocation via ahd_alloc_scbs()
where the "unused space" can be used to allocate "normal" HSCBs.
In our packetized busfree handler, use the ENSELO bit to differentiate
between packetized and non-packetized unexpected busfree events that
occur just after selection, but before the sequencer has had the oportunity
to service the selection.
When cleaning out the waiting for selection list, use the SCSI mode
instead of the command channel mode. The SCB pointer in the command
channel mode may be referenced by the SCB dma engine even while the
sequencer is paused, whereas the SCSI mode SCB pointer is only accessed
by the sequencer.
Print the "complete on qfreeze" sequencer SCB completion list in
ahd_dump_card_state(). This list holds all SCB completions that are deferred
until a pending select-out qfreeze event has taken effect.
aic79xx.h:
Add definitions and structures to handle the new SCB completion scheme.
Add a controller flag that indicates if the controller is in HostRAID
mode.
aic79xx.reg:
Remove macros used for toggling from one data fifo mode to the other.
They have not been in use for some time.
Add scratch ram fields for our new qfreeze count scheme, converting
the complete dma list into an "stailq", and providing for the "complete
on qfreeze" SCB completion list. Some other fields were moved to retain
proper field alignment (alignment >= field size in bytes).
aic79xx.seq:
Add code to our idle loop to:
o Process deferred completions once a qfreeze event has taken full
effect.
o Thaw the queue once the sequencer and host qfreeze counts match.
Generate 64bit completion entries passing the SCB_SGPTR field as the
"good status" indicator. The first bit in this field is only set if
we have a valid status packet to send to the host.
Convert the COMPLETE_DMA_SCB list to an "stailq".
When using "setjmp" to register an idle loop handler, do not combine
the "ret" with the block move to pop the stack address in the same
instruction. At least on the A, this results in a return to the setjmp
caller, not to the new address at the top of the stack. Since we want
the latter (we want the newly registered handler to only be invoked from
the idle loop), we must use a separate ret instruction.
Add a few missing critical sections.
Close a race condition that can occur on Rev A. silicon. If both FIFOs
happen to be allocated before the sequencer has a chance to service the
FIFO that was allocated first, we must take special care to service the
FIFO that is not active on the SCSI bus first. This guarantees that a
FIFO will be freed to handle any snapshot requests for the FIFO that is
still on the bus. Chosing the incorrect FIFO will result in deadlock.
Update comments.
aic79xx_inline.h
Correct the offset calculation for the syncing of our qoutfifo.
Update ahd_check_cmdcmpltqueues() for the larger completion entries.
aic79xx_pci.c:
Attach to HostRAID controllers by default. In the future I may add a
sysctl to modify the behavior, but since FreeBSD does not have any
HostRAID drivers, failing to attach just results in more email and
bug reports for the author.
MFC After: 1week
namespace pollution in other headers. <sys/types.h> is now the only
prerequisite for <sys/sx.h>.
Fixed some style bugs:
- removed bogus LOCORE ifdef. Including this C header in assembler
sources is just nonsense.
- removed unused include of <sys/_mutex.h>. It finished rotting when
the mutex in struct sx became indirect in rev.1.15.
- removed most comments on #else and #endif's and cleaned up the others.
All were misindented...
- add an option for the output device in the hope that this can
be made non-blocking at some stage.
- define an alias for the disk device, required by dev/ofw/ofw_disk.c
- shift iobus to 0x9000000 so as not to clash with the OpenFirmware
entry point of 0x8000400 when address decoding.
- down-tone comments about the disk dev config :-)
using the direct-mapping of physmem to force PTE data structures
to be physically addressable so the interrupt-time real-mode
DSI trap handler could perform PTE spills. However, the memory
may have been > 256Mb, which would have caused a BAT spill and
double-interrupt.
The new trap code no longer handles PTE spills, so the requirement
that these pages be direct-mapped no longer applies. The irony is
UMA_MD_SMALL_ALLOC will return direct mappings for these structs :-)
- remove unused 601 and tlb exception code
- remove interrupt-time PTE spill code. The pmap code
will now take care of pinning kernel PTEs, and there
are no longer issues about physical mapping of PTE
data structures
- All segment registers are switched on kernel entry/exit,
allowing the kernel to have more virtual space and for
user virtual space to extend to 4G.
- The temporary register save area has been shifted from
unused exception vector space to the per-cpu data area.
This allows interrupts to be delivered to multiple CPUs
- ISI traps no longer spill to BAT tables. It is assumed
that all of kernel instruction memory is pinned.
- shift from 'ldmw/stmw' instructions to individual register
loads/stores when saving context. All PPC manuals indicate
this should be much faster.
- use '%r' for register names throughout.
TODO: need to test if DSI traps were the result of kernel stack
guard-page hits.
Reworked from: NetBSD
- Rename temporary variable names ("tmp", "tmp2") to more informative
names ("load", "pctcpu", "rss", ...)
- Unclutter indentation and return paths: rather than lots of nested
ifs, simply return earlier if it's not going to work out. Simplify
general structure and avoid "deep" code.
- Comment on the thread/process selection and locking.
- Correct handling of "running"/"runnable" states, avoid "unknown"
that people were seeing for running processes. This was due to
a misunderstanding of the more complex state machine / inhibitors
behavior of KSE.
- Do perform ttyinfo() printing on KSE (P_SA) processes, it seems
generally to work.
While I initially attempted to formulate this as two commits (one
layout, the other content), I concluded that the layout changes were
really structural changes.
Many elements submitted by: bde
Since we have a worker thread now, we can actually do the allocation
asynchronously in that thread's context. Also, we need to return a
status value: if we're unable to queue up the async allocation, we
return NDIS_STATUS_FAILURE, otherwise we return NDIS_STATUS_PENDING
to indicate the allocation has been queued and will occur later.
This replaces the kludge where we just invoked the callback routine
right away in the current context.
The basic process is to send a routing socket announcement that the
interface has departed, change if_xname, update the sockaddr_dl
associated with the interface, and announce the arrival of the interface
on the routing socket.
As part of this change, ifunit() is greatly simplified by testing
if_xname directly. if_clone_destroy() now uses if_dname to look up the
cloner for the interface and if_dunit to identify the unit number.
Reviewed by: ru, sam (concept)
Vincent Jardin <vjardin AT free.fr>
Max Laier <max AT love2party.net>
that Asus provides on its CDs has both a MiniportSend() routine
and a MiniportSendPackets() function. The Microsoft NDIS docs say
that if a driver has both, only the MiniportSendPackets() routine
will be used. Although I think I implemented the support correctly,
calling the MiniportSend() routine seems to result in no packets going
out on the air, even though no error status is returned. The
MiniportSendPackets() function does work though, so at least in
this case it doesn't matter.
In if_ndis.c:ndis_getstate_80211(), if ndis_get_assoc() returns
an error, don't bother trying to obtain any other state since the
calls may fail, or worse cause the underlying driver to crash.
(The above two changes make the Asus-supplied Centrino work.)
Also, when calling the OID_802_11_CONFIGURATION OID, remember
to initialize the structure lengths correctly.
In subr_ndis.c:ndis_open_file(), set the current working directory
to rootvnode if we're in a thread that doesn't have a current
working directory set.
machdep.c fixed the missing early initialization of curpcb, so curpcb
is now always set together with curthread and it cannot be NULL except
before the IDT has been set up (so trap() is unreachable) or after a
memory error. In any case, it was often used without checking.
curcpb shouldn't exist anyway. It doesn't exist for most non-i386 arches.
It just caches curthread->td_pcb in a global. This was a better idea
before it was per-cpu. trap() and some other places can get at it more
efficiently using td->td_pcb instead of PCPU_GET(curpcb). The main
exception is support.s which mostly wants only curpcb->pcb_onfault.
instead, just dec/inc in the ctor/dtor. For now, increment/decrement
in two's, since we're now performing the operation once per pair,
not once per pipe. Not really any measurable performance change
in my micro-benchmarks, but doing less work is good, especially when
it comes to atomic operations.
Suggested by: alc
it is still above the critical temperature on the next poll cycle. This
is a 10 second advance notice by default. Document the private
(non-standard) notify we will be using with devd(8).
changes to jointly allocated pipe pairs. Replace these checks
with pipe_present checks. This avoids a NULL pointer dereference
when a pipe is half-closed.
Submitted by: Peter Edwards <peter.edwards@openet-telecom.com>
used for the ICMP reply source in reponse to packets which are not
directly addressed to us. By default continue with with normal
source selection.
Reviewed by: bms
1. Root from inside a jail was able to unmount any file system
(except /).
2. Unprivileged root was able to unmount file systems mounted by
privileged root (execpt /).
3. User from inside a jail was able to mount file system when
sysctl vfs.usermount was set to 1.
4. User was able to mount file system when vfs.usermount was set to 1
(that's ok) and unmount it even if vfs.usermount was equal to 0
(that's not correct).
Possibility from point 1 was reported by: Dariusz Kowalski <darek@76.pl>
Only a part of this fix will be MFC'ed (if approved).
PR: kern/60149
Reviewed by: rwatson
Approved by: scottl (mentor)
MFC after: 3 days
the system. Also, decrease the poll interval to 10 seconds from 30
seconds. This is needed because some systems will report an invalid high
temperature for one poll cycle. It is suspected this is due to the
embedded controller timing out. A typical value is 138C for one cycle on a
system that is otherwise 65C. This prevents the system from prematurely
shutting down after one invalid reading. It will still shut down after 30
seconds of high temperature, which is the same as previous default
behavior.
Tested by: Scott Lambert <lambert AT lambertfam.org>
is for an 802.11 device or not. At least one driver I have does not
support the OID_802_11_NETWORK_TYPES_SUPPORTED OID.
Also, for now, don't do anything special in the ndis_suspend() method.
I originally wanted to shut down the NIC but leave the IFF_UP flag alone
since technically the interface is meant to remain up, but an interrupt
may be delivered to the ISR on suspend, and if this happens while the
NIC is halted, we will crash, since none of the miniport driver methods
will function.
This needs to be dealt with properly later, but for now this prevents
a panic, and the resume method properly re-inits the NIC.
the thread that calls pmap_pte_quick() and by virtue of the page queues
lock being held, we can manage PADDR1/PMAP1 as a CPU private mapping.
The most common effect of this change is to reduce the overhead of the page
daemon on multiprocessors.
In collaboration with: tegge
packet along with data, instead of in their own packet. When serving files
of size (packetsize - headersize) or smaller, this will result in one less
packet crossing the network. Quick testing with thttpd and http_load has
shown a noticeable performance improvement in this case (350 vs 330 fetches
per second.)
Included in this commit are two support routines, iov_to_uio, and m_uiotombuf;
these routines are used by sendfile to construct the header mbuf chain that
will be linked to the rest of the data in the socket buffer.
sense with sched_4bsd as it does with sched_ule.
- Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
not a thread should be accounted for in sched_tdcnt.
when uma_reclaim() was called. This was introduced when the zone
working-set algorithm was removed in favor of using the per cpu caches
as the working set.
would allocate two 'struct pipe's from the pipe zone, and malloc a
mutex.
- Create a new "struct pipepair" object holding the two 'struct
pipe' instances, struct mutex, and struct label reference. Pipe
structures now have a back-pointer to the pipe pair, and a
'pipe_present' flag to indicate whether the half has been
closed.
- Perform mutex init/destroy in zone init/destroy, avoiding
reallocating the mutex for each pipe. Perform most pipe structure
setup in zone constructor.
- VM memory mappings for pageable buffers are still done outside of
the UMA zone.
- Change MAC API to speak 'struct pipepair' instead of 'struct pipe',
update many policies. MAC labels are also handled outside of the
UMA zone for now. Label-only policy modules don't have to be
recompiled, but if a module is recompiled, its pipe entry points
will need to be updated. If a module actually reached into the
pipe structures (unlikely), that would also need to be modified.
These changes substantially simplify failure handling in the pipe
code as there are many fewer possible failure modes.
On half-close, pipes no longer free the 'struct pipe' for the closed
half until a full-close takes place. However, VM mapped buffers
are still released on half-close.
Some code refactoring is now possible to clean up some of the back
references, etc; this patch attempts not to change the structure
of most of the pipe implementation, only allocation/free code
paths, so as to avoid introducing bugs (hopefully).
This cuts about 8%-9% off the cost of sequential pipe allocation
and free in system call tests on UP and SMP in my micro-benchmarks.
May or may not make a difference in macro-benchmarks, but doing
less work is good.
Reviewed by: juli, tjr
Testing help: dwhite, fenestro, scottl, et al
track the load for the sched_load() function. In the SMP case this member
is not defined because it would be redundant with the ksg_load member
which already tracks the non ithd load.
- For sched_load() in the UP case simply return ksq_sysload. In the SMP
case traverse the list of kseq groups and sum up their ksg_load fields.
of sched_load(). This variable tracks the number of running and runnable
non ithd threads. This removes the need to traverse the proc table and
discover how many threads are runnable.
at packet arrival.
For benchmarking purposes SO_BINTIME is preferable to SO_TIMEVAL
since it has higher resolution and lower overhead. Simultaneous
use of the two options is possible and they will return consistent
timestamps.
This introduces an extra test and a function call for SO_TIMEVAL, but I have
not been able to measure that.
and ffs_write(). These calls trace their origins to the dead vfs_ioopt
code, first appearing in revision 1.39 of ufs_readwrite.c.
Observed by: bde
Discussed with: tegge
interrupt handler so that no locks are needed, and schedules the
command completion routine with a taskqueue_fast. This also corrects the
locking in the command thread and removes the need for operation flags.
Simple load tests show that this is now considerably faster than FreeBSD 4.x
in the SMP case when multiple i/o tasks are running.
"scheduler" here has very little to do with scheduling. It is actually
the swapper, and it really must be the last SYSINIT'ed item like its
comment says, since proc0 metamorphoses into swapper by calling
scheduler() last in mi_start(), and scheduler() never returns.. Rev.1.29
of subr_4bsd.c broke this by adding another SI_ORDER_FIRST item
(kproc_start() for schedcpu_thread() onto the SI_SUB_RUN_SCHEDULER_LIST.
The sorting of SYSINITs with identical orders (at all levels) is
apparently nondeterministic, so this resulted in schedule() sometimes
being called second last and schedcpu_thread() not being called at all.
This quick fix just changes the code to almost match the comment
(SI_ORDER_FIRST -> SI_ORDER_ANY). "LAST" is misspelled "ANY", and
there is no way to ensure that there is only 1 very lst SYSINIT.
A more complete fix would remove the SYSINIT obfuscation.
won't associate in BSS mode if you use AUTHMODE_SHARED. I probably don't
understand enough to know when SHARED should be used vs. OPEN or WPA.
For now, go back to what works.
bit for this being the last CTIO2. It didn't matter since it really was the
last CTIO2 and the resources recycled, but still....
Add in CTIO3 define for future DAC work.
for direct-mapped addresses. Assume that any address less than KVA
is one of these and return it. Also assert that an address is KVA
does have a valid mapping - callers of pmap_kextract don't check
the return value, since they assume that they have a valid virtual
address.
addressing of memory. Makes a substantial improvement for apps that
stress the limited amount of KVM on PPC (e.g. untarring the ports tree).
uma_machdep.c stolen from amd64/ia64.
updated for the regparm ABI on amd64.
Context switch debug regs.
Update for fpu simplification
Don't needlessly reload %cr3, in case the cpu has the tlb flush filter
turned off. Re-add LAZY_SWITCH stubs.
profiling buffers and hash table. This makes it a lot easier to
do multiple profiling runs without rebooting or performing
gratuitous arithmetic. Sysctl is named debug.mutex.prof.reset.
Reviewed by: jake
- witness_lock() is split into two pieces: witness_checkorder() and
witness_lock(). Witness_checkorder() determines if acquiring a specified
lock at the time it is called would result in a lock order. It
optionally adds a new lock order relationship as well. witness_lock()
updates witness's data structures to assume that a lock has been acquired
by stick a new lock instance in the appropriate lock instance list.
- The mutex and sx lock functions now call checkorder() prior to trying to
acquire a lock and continue to call witness_lock() after the acquire is
completed. This will let witness catch a deadlock before it happens
rather than trying to do so after the threads have deadlocked (i.e. never
actually report it).
- A new function witness_defineorder() has been added that adds a lock
order between two locks at runtime without having to acquire the locks.
If the lock order cannot be added it will return an error. This function
is available to programmers via the WITNESS_DEFINEORDER() macro which
accepts either two mutexes or two sx locks as its arguments.
- A few simple wrapper macros were added to allow developers to call
witness_checkorder() anywhere as a way of enforcing locking assertions
in code that might acquire a certain lock in some situations. The
macros are: witness_check_{mutex,shared_sx,exclusive_sx} and take an
appropriate lock as the sole argument.
- The code to remove a lock instance from a lock list in witness_unlock()
was unnested by using a goto to vastly improve the readability of this
function.
instead of taskqueue_swi. This shaves from 1 to 10% of the overhead.
Overhaul the locking once more, there was a few possible races that
are now closed.
panic() so that the buffer overflow just beyond this point is always
caught, even when the code is not compiled with INVARIANTS.
Change chn_setblocksize() buffer reallocation code to attempt to avoid
the feed_vchan16() buffer overflow by attempting to always keep the
bufsoft buffer at least as large as the bufhard buffer.
Print a diagnositic message
Danger! %s bufsoft size increasing from %d to %d after CHANNEL_SETBLOCKSIZE()
if our best attempts fail. If feed_vchan16() were to be called by
the interrupt handler while locks are dropped in chn_setblocksize()
to increase the size bufsoft to match the size of bufhard, the panic()
code in feed_vchan16() will be triggered. If the diagnostic message
is printed, it is a warning that a panic is possible if the system
were to see events in an "unlucky" order.
Change the locking code to avoid the need for MTX_RECURSIVE mutexes.
Add the MTX_DUPOK option to the channel mutexes and change the locking
sequence to always lock the parent channel before its children to avoid
the possibility of deadlock.
Actually implement locking assertions for the channel mutexes and fix
the problems found by the resulting assertion violations.
Clean up the locking code in dsp_ioctl().
Allocate the channel buffers using the malloc() M_WAITOK option instead
of M_NOWAIT so that buffer allocation won't fail. Drop locks across
the malloc() calls.
Add/modify KASSERTS() in attempt to detect problems early.
Abuse layering by adding a pointer to the snd_dbuf structure that points
back to the pcm_channel that owns it. This allows sndbuf_resize() to do
proper locking without having to change the its API, which is used by
the hardware drivers.
Don't dereference a NULL pointer when setting hw.snd.maxautovchans
if a hardware driver is not loaded. Noticed by Ryan Sommers
<ryans at gamersimpact.com>.
Tested by: Stefan Ehmann <shoesoft AT gmx.net>
Tested by: matk (Mathew Kanner)
Tested by: Gordon Bergling <gbergling AT 0xfce3.net>
debug.ddb_use_printf sysctl, output kernel debugger data to both the
console and kernel message buffer via printf. This fixes the case where
backtrace() went directly to the console and should help debugging greatly.
Thanks to Ian Dowse for the work, minor edits or any bugs are by myself.
Submitted by: iedowse