sockets into machine-dependent files. The rationale for this
migration is illustrated by the modified amd64 allocator. It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space. On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page. Yuck.
Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.
ultimate trigger for the follow-up fixes in revisions 1.78, 1.80,
1.81 and 1.82 of trap.c. I was simply too pre-occupied with the
gateway page and how it blurs kernel space with user space and
vice versa that I couldn't see that it was all a load of bollocks.
It's not the IP address that matters, it's the privilege level that
counts. We never run in user space with lifted permissions and we
sure can not run in kernel space without it. Sure, the gateway page
is the exception, but not if you look at the privilege level. It's
user space if you run with user permissions and kernel space otherwise.
So, we're back to looking at the privilege level like it should be.
There's no other way.
Pointy hat: marcel
prototypes of cpu_halt(), cpu_reset() and swi_vm() from md_var.h to
cpu.h. This affects db_command.c and kern_shutdown.c.
ia64: move all MD prototypes from cpu.h to md_var.h. This affects
madt.c, interrupt.c and mp_machdep.c. Remove is_physical_memory().
It's not used (vm_machdep.c).
alpha: the MD prototypes have been left in cpu.h with a comment
that they should be there. Moving them is left for later. It was
expected that the impact would be significant enough to be done in
a seperate commit.
powerpc: MD prototypes left in cpu.h. Comment added.
Suggested by: bde
Tested with: make universe (pc98 incomplete)
Sign extension happens after the shift, not before so that boundary
cases like 0x40000000 will not be caught properly.
Instead, right shift ndirty. It is guaranteed to be a multiple of 8.
While here, do some manual code motion and code commoning.
Range check bug pointed out by: iedowse
that were on the kernel stack into account. For now we write them
out to the register stack of the process before creating the dump.
This however is not the final solution. The problem is that we may
invalidate the coredump by overwriting vital information due to an
invalid backing store pointer. Instead we need to write the dirty
registers to an unused region of VM which will result in a seperate
segment in the coredump. For now we can at least get to all the
registers from a coredump.
and the move to control register to avoid dependency violations when
these functions are used. Note that explicit data and instruction
serialization also need to be in a subsequent instruction group.
This too requires that we have an igrp break here.
PT_SETKSTACK. These requests allow the tracing process to access the
dirty registers of the traced process that are on the kernel stack.
Note that there's currently no way to access the rnat register for
those dirty registers that are not (yet) covered by a nat collection
point. The interface for this is still being slept on.
Also note that implied by these requests is the division of work:
The tracing process has to keep track of where registers are spilled
and is responsible to figure out where the NaT bit of the stacked
registers are at any time during the execution of the traced process.
The kernel provides the interfaces but will not abstract the fact
that the register stack can be split. This model does not follow
the approach taken in Linux where PT_PEEK and PT_POKE deals with
this automagically.
in user space or kernel space. VM_MIN_KERNEL_ADDRESS starts after the
gateway page, which means that improper memory accesses to the gateway
page while in user mode would panic the kernel. Use VM_MAX_ADDRESS
instead. It ends before the gateway page. The difference between
VM_MIN_KERNEL_ADDRESS and VM_MAX_ADDRESS is exactly the gateway page.
move to ar.rsc. The RSE must be in enforced lazy mode when writing
to RSE modifyable registers. In this case we restore the RSE NaT
collection register ar.rnat. I have seen 2 general exception faults
on pluto1 now that indicate that the move to ar.rsc has already
happened prior to the move to ar.rnat, meaning that the RSE is not
in enforced lazy mode anymore. The ia64 dependency and instruction
ordering rules seem to allow having both registers written to in
the same instruction group, provided ar.rsc is written to later than
ar.rnat (based on the ordering semantics). It appears that we may
be pushing our luck. For now, put them in seperate cycles (by means
of the instruction group break). If we ever get a general exception
fault on the move to ar.rnat again, we have definite proof that
something else is fishy.
o Differentiate between CPU family and CPU model. There are multiple
Itanium 2 models and it's nice to differentiate between them.
o Seperately export the CPU family and CPU model with sysctl.
o Merced is the only model in the Itanium family.
o Add Madison to the Itanium 2 family. We already knew about McKinley.
o Print the CPU family between parenthesis, like we do with the i386
CPU class.
My prototype now identifies itself as:
CPU: Merced (800.03-Mhz Itanium)
pluto1 and pluto2 will eventually identify themselves as:
CPU: McKinley (900.00-Mhz Itanium 2)
magic from exec_setregs(). In set_mcontext() we now also don't have
to worry that we entered the kernel with more that 512 bytes of
dirty registers on the kernel stack. Note that we cannot make any
assumptions anymore WRT to NaT collection points in exec_setregs(),
so we have to deal with them now.
when we create contexts. The meaning of the flags are documented in
<machine/ucontext.h>. I only list them here to help browsing the
commit logs:
_MC_FLAGS_ASYNC_CONTEXT
_MC_FLAGS_HIGHFP_VALID
_MC_FLAGS_KSE_SET_MBOX
_MC_FLAGS_RETURN_VALID
_MC_FLAGS_SCRATCH_VALID
Yes, _MC_FLAGS_KSE_SET_MBOX is a hack and I'm proud of it :-)
o For trap-based upcalls the argument (the kse_mailbox) to
the UTS must be written onto the kernel stack, not the
user stack. While here, deal with the fact that we may
be at a NaT collection point.
path into the kernel. Normally it's due to a syscall, but one can
also be created as the result of a clock interrupt (for example).
This now even more looks like exec_setregs().
While here, add an assert that we don't expect more than 8KB of
dirty registers on the kernel stack.
unconditionally restore ar.k7 (kernel memory stack) and ar.k6
(kernel register stack). I don't know what I was smoking then,
but if you unconditionally restore ar.k6, you also want to
compute its value unconditionally. By having the computation
predicated and dependent on whether we return to user mode, we
would end up writing junk (= invalid value for ar.bspstore) if
we would return to kernel mode. But the whole point of the
unconditional restoration was that there is a grey area where
we still need to have ar.k6 restored. If we restore with a junk
value, we would end up wedging the machine on the next interrupt.
So, unconditionally calculate the value we unconditionally write
to ar.k6.
o The previous braino was found while making the following change:
We used to clear the lower 9 bits of the value we write to ar.k6.
The meaning being that we know that the kernel register stack is
at least 512 byte aligned and simply clearing the lower 9 bits
allows us to return to a context of which we don't have dirty
registers on the kernel stack, even though the context that
entered the kernel does have dirty registers on the kernel stack.
By masking-off the lower bits, we correctly obtain the base of
the register stack without having to worry that we didn't actually
reached the base while unwinding it.
The change is to mask off the lower 13 bits, knowing that the
kernel register stack is always 8KB aligned. The advantage is that
we don't have to worry anymore if there's more than 512 bytes of
dirty registers on the kernel stack. A situation that frequently
occurs. In exec_setregs() in machdep.c:1.147 or older, we had to
deal with that situation by copying the active portion of the
register stack down in multiples of 512 bytes. Now that we mask off
the lower 13 bits we don't have to do that at all. Contemporary
IPF processors have a register file that can hold up to 96 stacked
registers (=784 bytes [incl. 2 NaT collections]). With no indication
that register files grow beyond a couple of hundred registers, we
should not have to worry about it anymore... and yes, 640KB is
enough for everybody :-)
This change helps setcontext(2) and cpu_set_upcall_kse() in that
they can return to completely different contexts without having to
mess with the kernel stack. Of course exec_setregs() doesn't need
to do that anymore as well.
need this for swapcontext(), KSE upcalls initiated from ast()
also need to save them so that we properly return the syscall
results after having had a context switch. Note that we don't
use r11 in the kernel. However, the runtime specification has
defined r8-r11 as return registers, so we put r11 in the context
as well. I think deischen@ was trying to tell me that we should
save the return registers before. I just wasn't ready for it :-)
o The EPC syscall code has 2 return registers and 2 frame markers
to save. The first (rp/pfs) belongs to the syscall stub itself.
The second (iip/cfm) belongs to the caller of the syscall stub.
We want to put the second in the context (note that iip and cfm
relate to interrupts. They are only being misused by the syscall
code, but are not part of a regular context).
This way, when the context is switched to again, we return to
the caller of setcontext(2) as one would expect.
o Deal with dirty registers on the kernel stack. The getcontext()
syscall will flush the RSE, so we don't expect any dirty registers
in that case. However, in thread_userret() we also need to save
the context in certain cases. When that happens, we are sure that
there are dirty registers on the kernel stack.
This implementation simply copies the registers, one at a time,
from the kernel stack to the user stack. NAT collections are not
dealt with. Hence we don't preserve NaT bits. A better solution
needs to be found at some later time.
We also don't deal with this in all cases in set_mcontext. No
temporay solution is implemented because it's not a showstopper.
The problem is that we need to ignore the dirty registers and we
automaticly do that for at most 62 registers. When there are more
than 62 dirty registers we have a memory "leak".
This commit is fundamental for KSE support.
user space region. Hence, we need to test if 5 is greater than the
region; not greater equal.
This bug caused us to call ast() while interrupting kernel mode.
set in cpu_critical_fork_exit() anymore.
- As far as I can tell, cpu_thread_link() has never been used, not even
when it was originally added, so remove it.
o Remove alpha specific timer code (mc146818A) and compiled-out
calibration of said timer.
o Remove i386 inherited timer code (i8253) and related acquire and
release functions.
o Move sysbeep() from clock.c to machdep.c and have it return
ENODEV. Console beeps should be implemented using ACPI or if no
such device is described, using the sound driver.
o Move the sysctls related to adjkerntz, disable_rtc_set and
wall_cmos_clock from machdep.c to clock.c, where the variables
are.
o Don't hardcode a hz value of 1024 in cpu_initclocks() and don't
bother faking a stathz that's 1/8 of that. Keep it simple: hz
defaults to HZ and stathz equals hz. This is also how it's done
for sparc64.
o Keep a per-CPU ITC counter (pc_clock) and adjustment (pc_clockadj)
to calculate ITC skew and corrections. On average, we adjust the
ITC match register once every ~1500 interrupts for a duration of
2 consequtive interruprs. This is to correct the non-deterministic
behaviour of the ITC interrupt (there's a delay between the match
and the raising of the interrupt).
o Add 4 debugging sysctls to monitor clock behaviour. Those are
debug.clock_adjust_edges, debug.clock_adjust_excess,
debug.clock_adjust_lost and debug.clock_adjust_ticks. The first
counts the individual adjustment cycles (when the skew first
crosses the threshold), the second counts the number of times the
adjustment was excessive (any non-zero value is to be considered
a bug), the third counts lost clock interrupts and the last counts
the number of interrupts for which we applied an adjustment
(debug.clock_adjust_ticks / debug.clock_adjust_edges gives the
avarage duration of an individual adjustment -- should be ~2).
While here, remove some nearby (trivial) left-overs from alpha and
other cleanups.
interrupting user mode. The net effect of this bug is that a clock
interrupt does not cause rescheduling and processes are not
preempted. It only takes a "while (1);" to render the machine
useless.
This bug was introduced by the context changes and EPC syscall code.
Handling of ASTs was moved to C for clarity and ease of maintenance,
but was not added for the external interrupt case.
This needs to be revisited. We now have calls to do_ast() in trap(),
break_syscall() and ivt_External_Interrupt(). A single call in
exception_restore covers these 3 places without duplication. This
is where we handled ASTs prior to the overhaul, except that the
meat has been moved to do_ast(), a C function. This was the goal
to begin with.
Pointy hat: marcel
created not only with UMA_ZONE_VM but also with UMA_ZONE_NOFREE. In
the i386 case in particular, the pmap code would hook a special
page allocation routine that allocated from kernel_map and not kmem_map,
and so when/if the pageout daemon drained the zones, it could actually
push out slabs from the PV ENTRY zone but call UMA's default page_free,
which resulted in pages allocated from kernel_map being freed to
kmem_map; bad. kmem_free() ignores the return value of the
vm_map_delete and just returns. I'm not sure what the exact
repercussions could be, but it doesn't look good.
In the PAE case on i386, we also set-up a zone in pmap, so be
conservative for now and make that zone also ZONE_NOFREE and
ZONE_VM. Do this for the pmap zones for the other archs too,
although in some cases it may not be entirely necessarily. We'd
rather be safe than sorry at this point.
Perhaps all UMA_ZONE_VM zones should by default be also
UMA_ZONE_NOFREE?
May fix some of silby's crashes on the PV ENTRY zone.
memory in bus_dmamem_alloc(). This is possible now that
contigmalloc() supports the M_ZERO flag.
- Remove the locking of Giant around calls to contigmalloc() since
contigmalloc() now grabs Giant itself.
switching anymore, so there's no need to save and restore GP. This
change breaks threaded applications linked against libc_r. Pull the
tier 2 card again: relink. This will link against libthr instead.
a non-standard construct. Instead, redefine struct _ia64_fpreg as a
union and put a long double in it. On ia64 and for LP64, this is
defined by the ABI to have 16-byte alignment. For ILP32 a long double
has 4-byte alignment, but we don't support ILP32.
Note that the in-memory image of a long double does not match the in-
memory image of spilled FP registers. This means that one cannot use
the fpr_flt field to interpet the bits. For this reason we continue
to use an aggregate type.
but this just created a weird inconsistency when porting gdb(1).
Instead, we name each high FP register seperately, like we do for
all the other registers.
them again afterwards. This fixes a disabled FP fault while in the FPSWA
handler.
While here, merge the FP fault and FP trap handling code to reduce code
duplication. Where code was different, it was not sure it should be.
Trigger case: ports/math/atlas
our unwind information for functions that are entry points into the
kernel. When stepping to the next frame, the unwinder will let us
know when sych a marker was encountered. We use this to stop the
current unwind session, query the trapframe and restart a new
unwind session based on the new trapframe.
The implementation is a bit sloppy, but at this time there are
bigger fish to fry.
to get a stacktrace. This does not work even with M_NOWAIT when we
have WITNESS and is generally a bad idea (pointed out by bde@). We
allocate an 8K heap for use by the unwinder when ddb is active. A
stack trace roughly takes up half of that in any case, so we have
some room for complex unwind situations. We don't want to waste too
much space though. Due to the nature of unwinding, we don't worry
too much about fragmentation or performance of unwinding while in
the debugger. For now we have our own heap management, but we may
be able to leverage from existing code at some later time.
While here:
o Make sure we actually free the unwind environment after unwinding.
This fixes a memory leak.
o Replace Doug's license with mine in unwind.c and unwind.h. Both
files don't have much, if any, of Doug's code left since the EPC
syscall overhaul and the import of the unwinder.
o Remove dead code.
o Replace M_NOWAIT with M_WAITOK for all remaining malloc() calls.
order to avoid the overhead of later page faults. In general, it
implements two cases: one for vnode-backed objects and one for
device-backed objects. Only the device-backed case is really
machine-dependent, belonging in the pmap.
This commit moves the vnode-backed case into the (relatively) new
function vm_map_pmap_enter(). On amd64 and i386, this commit only
amounts to code rearrangement. On alpha and ia64, the new machine
independent (MI) implementation of the vnode case is smaller and more
efficient than their pmap-based implementations. (The MI
implementation takes advantage of the fact that objects in -CURRENT
are ordered collections of pages.) On sparc64, pmap_object_init_pt()
hadn't (yet) been implemented.
o use a mutex to protect the bounce pages structure.
o use a SYSINIT function to initialize the bounce pages structures
and thus avoid a race condition in alloc_bounce_pages().
o add support for the BUS_DMA_NOWAIT flag in bus_dmamap_load().
o remove obsolete splhigh()/splx() calls.
o remove printf() about incorrect locking in busdma_swi() and sync
busdma_swi() with the one of the alpha backend.
o use __FBSDID.
Add two new arguments to bus_dma_tag_create(): lockfunc and lockfuncarg.
Lockfunc allows a driver to provide a function for managing its locking
semantics while using busdma. At the moment, this is used for the
asynchronous busdma_swi and callback mechanism. Two lockfunc implementations
are provided: busdma_lock_mutex() performs standard mutex operations on the
mutex that is specified from lockfuncarg. dftl_lock() is a panic
implementation and is defaulted to when NULL, NULL are passed to
bus_dma_tag_create(). The only time that NULL, NULL should ever be used is
when the driver ensures that bus_dmamap_load() will not be deferred.
Drivers that do not provide their own locking can pass
busdma_lock_mutex,&Giant args in order to preserve the former behaviour.
sparc64 and powerpc do not provide real busdma_swi functions, so this is
largely a noop on those platforms. The busdma_swi on is64 is not properly
locked yet, so warnings will be emitted on this platform when busdma
callback deferrals happen.
If anyone gets panics or warnings from dflt_lock() being called, please
let me know right away.
Reviewed by: tmm, gibbs
implementation of a largely MI pmap_object_init_pt() for vnode-backed
objects. pmap_enter_quick() is implemented via pmap_enter() on sparc64
and powerpc.
- Correct a mismatch between pmap_object_init_pt()'s prototype and its
various implementations. (I plan to keep pmap_object_init_pt() as
the MD hook for device-backed objects on i386 and amd64.)
- Correct an error in ia64's pmap_enter_quick() and adjust its interface
to match the other versions. Discussed with: marcel
function behaves correctly in principle, but is not expected to be
100% complete. In any case, with this commit we have KSE ported
enough to start runtime testing with threaded applications and fix
whatever bugs or omissions we encounter. Yay!
bus_dma async callback scheme. Note that sparc64 does not seem to do
async callbacks. Note that ia64 callbacks might not be MPSAFE at the
moment. Note that powerpc doesn't seem to do async callbacks due to
the implementation being incomplete.
Reviewed by: mostly silence on arch@
to the machine-independent parts of the VM. At the same time, this
introduces vm object locking for the non-i386 platforms.
Two details:
1. KSTACK_GUARD has been removed in favor of KSTACK_GUARD_PAGES. The
different machine-dependent implementations used various combinations
of KSTACK_GUARD and KSTACK_GUARD_PAGES. To disable guard page, set
KSTACK_GUARD_PAGES to 0.
2. Remove the (unnecessary) clearing of PG_ZERO in vm_thread_new. In
5.x, (but not 4.x,) PG_ZERO can only be set if VM_ALLOC_ZERO is passed
to vm_page_alloc() or vm_page_grab().
PCB contains FP registers, whose alignment must be 16 bytes at least.
Since the PCB pointed to by pc_pcb is immediately after the PCPU
itself, round-up the size of thge PCPU to a multiple of 16 bytes. The
PCPU is page aligned.
This fixes a misalignment trap caused by stopping a CPU in a SMP
kernel, such as been done when entering the debugger.
Reported by: Alan Robinson <alan.robinson@fujitsu-siemens.com>
the caller assumes this to not happen by means of performing an
indirection without checking the return value. Add KASSERTs to
force a kernel with INVARIANTS to panic. This is a short-term
measure. The pmap code is scheduled to be overhauled.
to deliver a signal and the RSE backing store has been exhausted or
the backing store pointer has been clobbered, we need to make sure
we call userret() and do_ast() when we exit from trap(). Not adjusting
the local variable 'user' in this case will prevent the faulty process
from being terminated and we end up in an infinite fault repetition.
Faulty process provided by: bento
always kernel space. It should be treated as user space when run with
user privileges (which is the case for the signal trampolines). This
fixes its only use in a KASSERT in subr_trap.c.
got fixed two weeks after the ia64 version was copied from the alpha
version (see rev 1.32 of sys/alpha/alpha/mem.c). As such, we were
missing the same continue as on alpha.
While here, add a default case for the device minor switch and do
some general style(9) cleanups.
WARNING: this file still has bugs. When reading from region 6 or
region 7, we don't validate the physical address. One can trivially
cause a machine check by trying to read from address 0xFFFFFFFFFFFFFFF0
or something that uses the unimplemented physical address bits.
Reported by: Alan Robinson <alan.robinson@fujitsu-siemens.com>
we were passing in a void* representing the PCB of the parent thread.
Now we pass a pointer to the parent thread itself.
The prime reason for this change is to allow cpu_set_upcall() to copy
(parts of) the trapframe instead of having it done in MI code in each
caller of cpu_set_upcall(). Copying the trapframe cannot always be
done with a simply bcopy() or may not always be optimal that way. On
ia64 specifically the trapframe contains information that is specific
to an entry into the kernel and can only be used by the corresponding
exit from the kernel. A trapframe copied verbatim from another frame
is in most cases useless without some additional normalization.
Note that this change removes the assignment to td->td_frame in some
implementations of cpu_set_upcall(). The assignment is redundant.
A previous call to cpu_thread_setup() already did the exact same
assignment. An added benefit of removing the redundant assignment is
that we can now change td_pcb without nasty side-effects.
This change officially marks the ability on ia64 for 1:1 threading.
Not tested on: amd64, powerpc
Compile & boot tested on: alpha, sparc64
Functionally tested on: i386, ia64
o Use pcb and tf for the new pcb and the new trapframe and use pcb0
for the old (current) pcb. The mix of pcb, pcb2 and tf was slightly
confusing.
o Don't define td->td_frame here. It has already been set previously
by cpu_thread_setup. Add a KASSERT to make sure pcb and tf are both
non-NULL.
o Make sure the number of dirty registers is 0 for the new thread.
There are no user registers on the backing store because we heven't
enter userland yet.
gateway page is considered kernel space, we can panic when we should
only SIGSEGV. Hence, add the additional constraint that for page
faults we also require running with kernel privileges. The gateway
page is the only kernel code running with user privileges, iso this
is a correct way to exclude the gateway page from kernel land.
We do not currently exclude the gateway page for other faults as it
is not always the right way to do it. Further tuning will happen on
a case by case bases.
thr_create(2). This implementation is so far only compile tested.
But since this is also the last of the functions required to
support libthr, we're now functionally complete (for some weird
definition of functionally; and complete). Runtime testing can
commence.
sigreturn(), we cheat and assume the preserved registers are still
on-chip and unmodified. This is actually the case, but more by accident
than by design. We need to use unwinding eventually or explicitly
compile the kernel in a way that the compiler steers clear from using
the preserved registers completely.
o The SDM states that flushing the RSE in the cycle prior to the
call to ia32 code yields the best performance. We don't really
care to much about performance here, but we do the same anyway.
I'm being paranoia and conservative here.
o Only initialize the ia32 state registers, not the registers used
as scratch by the ia32 engine. This saves a couple of loads from
the trapframe, but also helps debugging: we don't clobber useful
debugging data (engineering hints :-)
o Make sure all general registers constituting ia32 state have been
initialized. If there's no useful to be loaded from the trapframe,
clear the register. This avoids accidentally leaking NaT bits.
o Make sure we set ar.k6 prior to clobbering ar.bspstore and also
set ar.k7 prior to setting sp. This fixes a race seen for ia64
native code as well (and previously fixed too).
backing store before we discard them. It is possible that we
enter the kernel (due to an execve in this case) with a lot of
dirty user registers and that the RSE has only partially spilled
them (to make room for new frames). We cannot move the backing
store pointer down (to discard user registers) when not all of
the user registers are on the backing store.
So, we flush the register stack IFF this happens. Unconditionally
doing the flush is too costly, because the condition in which we
need to flush is very rare.
This change appears to fix the SIGSEGV that sometimes happen for
newly executed processes and so far also appears to fix the last
of the corruption. It is possible, although not likely, that this
change prevents some other bug from happening, even though it is
itself not a fix. Hence the uncertainty. We'll know in a couple
of months I guess :-)
The current name is confusing, because it indicates to
the client that a bus_dmamap_sync() operation is not
necessary when the flag is specified, which is wrong.
The main purpose of this flag is to hint the underlying
architecture that DMA memory should be mapped in a coherent
way, but the architecture can ignore it. But if the
architecture does supports coherent mapping of memory, then
it makes bus_dmamap_sync() calls cheap.
This flag is the same as the one in NetBSD's Bus DMA.
Reviewed by: gibbs, scottl, des (implicitly)
Approved by: re@ (jhb)