for parsing model-specific and other fields in machine check events
including the global machine check capabilities and status registers,
CPU identification, and the FreeBSD CPU ID.
- Report these added fields in the console log of a machine check so that
a record structure can be reconstituted from the console messages.
- Parse new architectural errors including memory controller errors.
MFC after: 1 week
for upcoming 64-bit PowerPC and MIPS support. This renames the COMPAT_IA32
option to COMPAT_FREEBSD32, removes some IA32-specific code from MI parts
of the kernel and enhances the freebsd32 compatibility code to support
big-endian platforms.
Reviewed by: kib, jhb
processors. With this workaround, superpage promotion can be re-enabled
under virtualization. Moreover, machine check exceptions can safely be
enabled when FreeBSD is running natively on Family 10h processors.
Most of the credit should go to Andriy Gapon for diagnosing the error and
working with Borislav Petkov at AMD to document it. Andriy also reviewed
and tested my patches.
Discussed with: jhb
MFC after: 3 weeks
correctly initialized and just then assign to softclock/profclock.
Right now, some atrtc seems reporting strange diagnostic error* making the
current pattern bogus.
In order to do that cleanly, lapic_setup_clock(), on both ia32 and amd64,
now accepts as arguments the desired sources to handle, and returns the
actual ones (LAPIC_CLOCK_NONE is forbidden because otherwise there is no
meaning in calling such function).
This allows to bring out into commont x86 code the handling part for
machdep.lapic_allclocks tunable, which is retained.
Sponsored by: Sandvine Incorporated
Tested by: yongari, Richard Todd
<rmtodd at ichotolot dot servalan dot com>
MFC: 3 weeks
X-MFC: r202387, 204309
This header file uses __packed, without including <sys/cdefs.h>. This
means it cannot be used in the way described in sysarch(3) by only
including <machine/sysarch.h>.
LAPIC may lead to aliasing for softclock and profclock because frequencies
are sized in order to fit mainly hardclock.
atrtc used to take care of the softclock and profclock and it does still
do, if the LAPIC can't handle the clocks properly.
Revert the change when the LAPIC started taking charge of all three of
them and let atrtc handle softclock and profclock if not explicitly
requested. Such request can be made setting != 0 the new tunable
machdep.lapic_allclocks or if the new device ATPIC is not present
within the i386 kernel config (atrtc is linked to atpic presence).
Diagnosed by: Sandvine Incorporated
Reviewed by: jhb, emaste
Sponsored by: Sandvine Incorporated
MFC: 3 weeks
I/O port access is implemented on Itanium by reading and writing to a
special region in memory. To hide details and avoid misaligned memory
accesses, a process did I/O port reads and writes by making a MD system
call. There's one fatal problem with this approach: unprivileged access
was not being prevented. /dev/io serves that purpose on amd64/i386, so
employ it on ia64 as well. Use an ioctl for doing the actual I/O and
remove the sysarch(2) interface.
Backward compatibility is not being considered. The sysarch(2) approach
was added to support X11, but support for FreeBSD/ia64 was never fully
implemented in X11. Thus, nothing gets broken that didn't need more work
to begin with.
MFC after: 1 week
sys/vmmeter.h: warning: shadowed declaration is here
machine/cpufunc.h: In function 'insw':
machine/cpufunc.h: warning: declaration of 'cnt' shadows a global declaration
..snip..
- directly print mca information in case we fail to allocate memory
for a record
- include bank number into mca record
- print raw mca status value for extended information
Reviewed by: jhb
MFC after: 10 days
not properly set up. r199067 added the call to TUNABLE_INT_FETCH() to
initializecpu() that results in hang because AP are started when kernel
environment is already dynamic and thus needs to acquire mutex, that is
too early in AP start sequence to work.
Extract the code that should be executed only once, because it sets
up global variables, from initializecpu() to initializecpucache(),
and call the later only from hammer_time() executed on BSP. Now,
TUNABLE_INT_FETCH() is done only once at BSP at the early boot stage.
In collaboration with: Mykola Dzham <freebsd levsha org ua>
Reviewed by: jhb
Tested by: ed, battlez
handlers. This is primarily intended as a way to allow devices that use
multiple interrupts (e.g. MSI) to meaningfully distinguish the various
interrupt handlers.
- Add a new BUS_DESCRIBE_INTR() method to the bus interface to associate
a description with an active interrupt handler setup by BUS_SETUP_INTR.
It has a default method (bus_generic_describe_intr()) which simply passes
the request up to the parent device.
- Add a bus_describe_intr() wrapper around BUS_DESCRIBE_INTR() that supports
printf(9) style formatting using var args.
- Reserve MAXCOMLEN bytes in the intr_handler structure to hold the name of
an interrupt handler and copy the name passed to intr_event_add_handler()
into that buffer instead of just saving the pointer to the name.
- Add a new intr_event_describe_handler() which appends a description string
to an interrupt handler's name.
- Implement support for interrupt descriptions on amd64 and i386 by having
the nexus(4) driver supply a custom bus_describe_intr method that invokes
a new intr_describe() MD routine which in turn looks up the associated
interrupt event and invokes intr_event_describe_handler().
Requested by: many
Reviewed by: scottl
MFC after: 2 weeks
by looking at the bases used for non-relocatable executables by gnu ld(1),
and adjusting it slightly.
Discussed with: bz
Reviewed by: kan
Tested by: bz (i386, amd64), bsam (linux)
MFC after: some time
specify their own version of atomic_cmpset_* which could have been
different than the membar version.
Right now, however, FreeBSD is bound mostly to GCC-like compilers and
it is desired to add new support and compat shim mostly when there is
a real necessity, in order to avoid too much compatibility bloats.
In this optic, bring back atomic_cmpset_{acq, rel}_* to be the same as
atomic_cmpset_* and unwind the atomic_cmpset_barr_* introduction.
Requested by: jhb
Reviewed by: jhb
Tested by: Giovanni Trematerra <giovanni dot trematerra at
gmail dot com>
not defined through macros or similar) in order to be later compiled in
the kernel and offer this way the support for modules (and
compatibility among the UP case and SMP case).
Fix this for the newly introduced atomic_cmpset_barr_* cases by defining
and specifying a template. Note that the new DEFINE_CMPSET_GEN()
template save more typing on amd64 than the current code. [1]
- Fix the style for memory barriers on amd64.
[1] Reported by: Paul B. Mahol <onemda at gmail dot com>
memory barriers should also ensure that the compiler doesn't reorder paths
where they are used. GCC, however, does that aggressively, even in
presence of volatile operands. The most reliable way GCC offers for avoid
instructions reordering is clobbering "memory" even if that is
theoretically an heavy-weight operation, flushing the content of all
the registers and forcing reload of them (We could rely, however, on
gcc DTRT by just understanding the purpose as this is a well-known
pattern for many modern operating-systems).
Not all our memory barriers, right now, clobber memory for GCC-like
compilers. The most notable cases are IA32 and amd64 where the memory
barrier are treacted the same as normal atomic instructions.
Fix this by offering the possibility to implement atomic instructions
with memory barriers separately from the normal version and implement
the GCC-like specific one using memory clobbering.
Thanks to Chris Lattner (@apple) for his discussion on llvm specifics.
Reported by: jhb
Reviewed by: jhb
Tested by: rdivacky, Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
startup and genericize it so it can be reused to map other tables as well:
- Add a routine to walk a list of ACPI subtables such as those used in the
APIC and SRAT tables in the MI acpi(4) driver.
- Move the routines for mapping and unmapping an ACPI table as well as
mapping the RSDT or XSDT and searching for a table with a given signature
out into acpica_machdep.c for both amd64 and i386.
- Provide lapic_disable_pmc(), lapic_enable_pmc(), and lapic_reenable_pmc()
routines in the local APIC code that the hwpmc(4) driver can use to
manage the local APIC PMC interrupt vector.
- Do not enable the local APIC PMC interrupt vector by default when
HWPMC_HOOKS is enabled. Instead, the hwpmc(4) driver explicitly
enables the interrupt when it is succesfully initialized and disables
the interrupt when it is unloaded. This avoids enabling the interrupt
on unsupported CPUs which may result in spurious NMIs.
Reported by: rnoland
Reviewed by: jkoshy
Approved by: re (kib)
MFC after: 2 weeks
has proven to have a good effect when entering KDB by using a NMI,
but it completely violates all the good rules about interrupts
disabled while holding a spinlock in other occasions. This can be the
cause of deadlocks on events where a normal IPI_STOP is expected.
* Adds an new IPI called IPI_STOP_HARD on all the supported architectures.
This IPI is responsible for sending a stop message among CPUs using a
privileged channel when disponible. In other cases it just does match a
normal IPI_STOP.
Right now the IPI_STOP_HARD functionality uses a NMI on ia32 and amd64
architectures, while on the other has a normal IPI_STOP effect. It is
responsibility of maintainers to eventually implement an hard stop
when necessary and possible.
* Use the new IPI facility in order to implement a new userend SMP kernel
function called stop_cpus_hard(). That is specular to stop_cpu() but
it does use the privileged channel for the stopping facility.
* Let KDB use the newly introduced function stop_cpus_hard() and leave
stop_cpus() for all the other cases
* Disable interrupts on CPU0 when starting the process of APs suspension.
* Style cleanup and comments adding
This patch should fix the reboot/shutdown deadlocks many users are
constantly reporting on mailing lists.
Please don't forget to update your config file with the STOP_NMI
option removal
Reviewed by: jhb
Tested by: pho, bz, rink
Approved by: re (kib)
established, OS shall flush the caches on all processors that may have
used the mapping previously. This operation is not needed if processors
support self-snooping. If not, but clflush instruction is implemented
on the CPU, series of the clflush can be used on the mapping region.
Otherwise, we have to flush the whole cache. The later operation is very
expensive, and AMD-made CPUs do not have self-snooping.
Implement cache flush for remapped region by using clflush for amd64,
when supported by CPU.
Proposed and reviewed by: alc
Approved by: re (kensmith)
dependent memory attributes:
Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the
fact that there are machine-dependent memory attributes that have
nothing to do with controlling the cache's behavior.
Introduce vm_object_set_memattr() for setting the default memory
attributes that will be given to an object's pages.
Introduce and use pmap_page_{get,set}_memattr() for getting and
setting a page's machine-dependent memory attributes. Add full
support for these functions on amd64 and i386 and stubs for them on
the other architectures. The function pmap_page_set_memattr() is also
responsible for any other machine-dependent aspects of changing a
page's memory attributes, such as flushing the cache or updating the
direct map. The uses include kmem_alloc_contig(), vm_page_alloc(),
and the device pager:
kmem_alloc_contig() can now be used to allocate kernel memory with
non-default memory attributes on amd64 and i386.
vm_page_alloc() and the device pager will set the memory attributes
for the real or fictitious page according to the object's default
memory attributes.
Update the various pmap functions on amd64 and i386 that map pages to
incorporate each page's memory attributes in the mapping.
Notes: (1) Inherent to this design are safety features that prevent
the specification of inconsistent memory attributes by different
mappings on amd64 and i386. In addition, the device pager provides a
warning when a device driver creates a fictitious page with memory
attributes that are inconsistent with the real page that the
fictitious page is an alias for. (2) Storing the machine-dependent
memory attributes for amd64 and i386 as a dedicated "int" in "struct
md_page" represents a compromise between space efficiency and the ease
of MFCing these changes to RELENG_7.
In collaboration with: jhb
Approved by: re (kib)
return path only when neither thread was context switched while
executing syscall code nor syscall explicitely modified LDT or MSRs.
Save segment registers in trap handlers before interrupts are enabled,
to not allow context switches to happen before registers are saved.
Use separated byte in pcb for indication of fast/full return, since
pcb_flags are not synchronized with context switches.
The change puts back syscall microbenchmark numbers that were slowed
down after commit of the support for LDT on amd64.
Reviewed by: jeff
Tested (and tested, and tested ...) by: pho
Approved by: re (kensmith)
o add to platforms where it was missing (arm, i386, powerpc, sparc64, sun4v)
o define as "1" on amd64 and i386 where there is no restriction
o make the type returned consistent with ALIGN
o remove _ALIGNED_POINTER
o make associated comments consistent
Reviewed by: bde, imp, marcel
Approved by: re (kensmith)
- For x86, change the interrupt source method to assign an interrupt source
to a specific CPU to return an error value instead of void, thus allowing
it to fail.
- If moving an interrupt to a CPU fails due to a lack of IDT vectors in the
destination CPU, fail the request with ENOSPC rather than panicing.
- For MSI interrupts on x86 (but not MSI-X), only allow cpuset to be used
on the first interrupt in a group. Moving the first interrupt in a group
moves the entire group.
- Use the icu_lock to protect intr_next_cpu() on x86 instead of the
intr_table_lock to fix a LOR introduced in the last set of MSI changes.
- Add a new privilege PRIV_SCHED_CPUSET_INTR for using cpuset with
interrupts. Previously, binding an interrupt to a CPU only performed a
privilege check if the interrupt had an interrupt thread. Interrupts
without a thread could be bound by non-root users as a result.
- If an interrupt event's assign_cpu method fails, then restore the original
cpuset mask for the associated interrupt thread.
Approved by: re (kib)
required by video card drivers. Specifically, this change introduces
vm_cache_mode_t with an appropriate VM_CACHE_DEFAULT definition on all
architectures. In addition, this changes adds a vm_cache_mode_t parameter
to kmem_alloc_contig() and vm_phys_alloc_contig(). These will be the
interfaces for allocating mapped kernel memory and physical memory,
respectively, with non-default cache modes.
In collaboration with: jhb
This is mostly important for the multiple MSI message case where the
IDT vectors for the entire group need to be allocated together. This
also restores the assumptions made by the PCI bus code that it could
invoke PCIB_MAP_MSI() once MSI vectors were allocated.
- To avoid whiplash with CPU assignments, change the way that CPUs are
assigned to interrupt sources on activation. Instead of assigning the
CPU via pic_assign_cpu() before calling enable_intr(), allow the
different interrupt source drivers to ask the MD interrupt code which
CPU to use when they allocate an IDT vector. I/O APIC interrupt pins
do this in their pic_enable_intr() routines giving the same behavior as
before. MSI sources do it when the IDT vectors are allocated during
msi_alloc() and msix_alloc().
- Change the intr_table_lock from an sx lock to a mutex.
Tested by: rnoland
With the arrival of 128+ cores it is necessary to handle more than that.
One of the first thing to change is the support for cpumask_t that needs
to handle more than 32 bits masking (which happens now). Some places,
however, still assume that cpumask_t is a 32 bits mask.
Fix that situation by using always correctly cpumask_t when needed.
While here, remove the part under STOP_NMI for the Xen support as it
is broken in any case.
Additively make ipi_nmi_pending as static.
Reviewed by: jhb, kmacy
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
- For CPUs that only support MCE (the machine check exception) but not MCA
(i.e. Pentium), all this does is print out the value of the machine check
registers and then panic when a machine check exception occurs.
- For CPUs that support MCA (the machine check architecture), the support is
a bit more involved.
- First, there is limited support for decoding the CPU-independent MCA
error codes in the kernel, and the kernel uses this to output a short
description of any machine check events that occur.
- When a machine check exception occurs, all of the MCx banks on the
current CPU are scanned and any events are reported to the console
before panic'ing.
- To catch events for correctable errors, a periodic timer kicks off a
task which scans the MCx banks on all CPUs. The frequency of these
checks is controlled via the "hw.mca.interval" sysctl.
- Userland can request an immediate scan of the MCx banks by writing
a non-zero value to "hw.mca.force_scan".
- If any correctable events are encountered, the appropriate details
are stored in a 'struct mca_record' (defined in <machine/mca.h>).
The "hw.mca.count" is a count of such records and each record may
be queried via the "hw.mca.records" tree by specifying the record
index (0 .. count - 1) as the next name in the MIB similar to using
PIDs with the kern.proc.* sysctls. The idea is to export machine
check events to userland for more detailed processing.
- The periodic timer and hw.mca sysctls are only present if the CPU
supports MCA.
Discussed with: emaste (briefly)
MFC after: 1 month
and hide it inside of atrtc driver. Add new tunable hint.atrtc.0.clock
controlling it. Setting it to 0 disables using RTC clock as stat-/
profclock sources.
Teach i386 and amd64 SMP platforms to emulate stat-/profclocks using i8254
hardclock, when LAPIC and RTC clocks are disabled.
This allows to reduce global interrupt rate of idle system down to about
100 interrupts per core, permitting C3 and deeper C-states provide maximum
CPU power efficiency.
topology of nehalem/corei7 based systems.
- Remove the cpu_cores/cpu_logical detection from identcpu.
- Describe the layout of the system in cpu_mp_announce().
Sponsored by: Nokia
a fair number of static data structures, making this an unlikely
option to try to change without also changing source code. [1]
Change default cache line size on ia64, sparc64, and sun4v to 128
bytes, as this was what rtld-elf was already using on those
platforms. [2]
Suggested by: bde [1], jhb [2]
MFC after: 2 weeks
CACHE_LINE_SIZE constant. These constants are intended to
over-estimate the cache line size, and be used at compile-time
when a run-time tuning alternative isn't appropriate or
available.
Defaults for all architectures are 64 bytes, except powerpc
where it is 128 bytes (used on G5 systems).
MFC after: 2 weeks
Discussed on: arch@
- Do not iterate int 15h, function e820h twice. Instead, we use STAILQ to
store each return buffer and copy all at once.
- Export optional extended attributes defined in ACPI 3.0 as separate
metadata. Currently, there are only two bits defined in the specification.
For example, if the descriptor has extended attributes and it is not
enabled, it has to be ignored by OS. We may implement it in the kernel
later if it is necessary and proven correct in reality.
- Check return buffer size strictly as suggested in ACPI 3.0.
Reviewed by: jhb
Remove a hack to generate more efficient code for port numbers below
0x100, which has been obsolete for at least ten years, because GCC has
an asm constraint to specify that.
Submitted by: Christoph Mallon <christoph mallon gmx de>
Most compilers nowadays (including GCC) are smart enough to know what's
going on and generate more efficient code anyway.
Submitted by: Christoph Mallon <christoph.mallon@gmx.de>
Because the "c" input constaint is used, the compiler will already place
the MSR_FSBASE/MSR_GSBASE constants in ecx. Using __asm("ecx") makes
LLVM crash. Even though this is also an LLVM bug, we'd better remove the
unnecessary GCCism as well.
Submitted by: Christoph Mallon <christoph.mallon@gmx.de>
the kernel on amd64. Fill and read segment registers for mcontext and
signals. Handle traps caused by restoration of the
invalidated selectors.
Implement user-mode creation and manipulation of the process-specific
LDT descriptors for amd64, see sysarch(2).
Implement support for TSS i/o port access permission bitmap for amd64.
Context-switch LDT and TSS. Do not save and restore segment registers on
the context switch, that is handled by kernel enter/leave trampolines
now. Remove segment restore code from the signal trampolines for
freebsd/amd64, freebsd/ia32 and linux/i386 for the same reason.
Implement amd64-specific compat shims for sysarch.
Linuxolator (temporary ?) switched to use gsbase for thread_area pointer.
TODO:
Currently, gdb is not adapted to show segment registers from struct reg.
Also, no machine-depended ptrace command is added to set segment
registers for debugged process.
In collaboration with: pho
Discussed with: peter
Reviewed by: jhb
Linuxolator tested by: dchagin
Reorder amd64 gdt descriptors so that user-accessible selectors are the
same as on i386. At least Wine hard-codes this into the binary.
In collaboration with: pho
Reviewed by: jhb
Provides i386/freebsd API-compatible definitions for the argument
structures of the above sysarch commands. struct i386_ioperm_args
definition is ABI-compatible.
In collaboration with: pho
Reviewed by: jhb
To keep these structures ABI-compatible, half the size of r_trapno,
r_err, mc_trapno, mc_flags.
Add fsbase and gsbase to mcontext on both amd64 and i386.
Add flags to amd64 mcontext to indicate that it contains valid segments
or bases.
In collaboration with: pho
Discussed with: peter
Reviewed by: jhb
stored in the pmap is from the direct map region. The two exceptions have
been the kernel pmap and the swapper's pmap. These pmaps have used a
kernel virtual address established by pmap_bootstrap() for their shared
pml4 page table page. However, there is no reason not to use the direct
map for these pmaps as well.
to the full path of the image that is being executed.
Increase AT_COUNT.
Remove no longer true comment about types used in Linux ELF binaries,
listed types contain FreeBSD-specific entries.
Reviewed by: kan
This code is heavily inspired by Takanori Watanabe's experimental SMP patch
for i386 and large portion was shamelessly cut and pasted from Peter Wemm's
AP boot code.
ABIs:
- Store the FPU initial control word in the pcb for each thread.
- When first using the FPU, load the initial control word after restoring
the clean state if it is not the standard control word.
- Provide a correct control word for Linux/i386 binaries under
FreeBSD/amd64.
- Adjust the control word returned for fpugetregs()/npxgetregs() when a
thread hasn't used the FPU yet to reflect the real initial control
word for the current ABI.
- The Linux/i386 ABI for FreeBSD/i386 now properly sets the right control
word instead of trashing whatever the current state of the FPU is.
Reviewed by: bde
- fpudna() always returned 1 since amd64 CPUs always have FPUs. Change
the function to return void and adjust the calling code in trap() to
assume the return 1 case is the only case.
- Remove fpu_cleanstate_ready as it is always true when it is tested.
Also, only initialize fpu_cleanstate when fpuinit() is called on the BSP.
Reviewed by: bde
mode.
- Make the NMI handler run on its own stack (TSS_IST2).
- Store the GSBASE value for each CPU just before the start of
each NMI stack, permitting efficient retrieval using %rsp-relative
addressing.
- For NMIs taken from kernel mode, program MSR_GSBASE explicitly
since one or both of MSR_GSBASE and MSR_KGSBASE can be potentially
invalid. The current contents of MSR_GSBASE are saved and restored
at exit.
- For NMIs handled from user mode, continue to use 'swapgs' to
load the per-CPU GSBASE.
Reviewed by: jeff
Debugging help: jeff
Tested by: gnn, Artem Belevich <artemb at gmail dot com>
more irqs as we have more cpus. This is principally useful on systems
with msi devices which may want many irqs per-cpu.
Discussed with: jhb
Sponsored by: Nokia
32-bit processes. The value matches the initial setting used by
FreeBSD/i386. Otherwise, 32-bit binaries using floating point would use
a slightly different initial state when run on FreeBSD/amd64.
MFC after: 1 week
and Core Duo), models 0xF (Core2), model 0x17 (Core2Extreme) and
model 0x1C (Atom).
In these CPUs, the actual numbers, kinds and widths of PMCs present
need to queried at run time. Support for specific "architectural"
events also needs to be queried at run time.
Model 0xE CPUs support programmable PMCs, subsequent CPUs
additionally support "fixed-function" counters.
- Use event names that are close to vendor documentation, taking in
account that:
- events with identical semantics on two or more CPUs in this family
can have differing names in vendor documentation,
- identical vendor event names may map to differing events across
CPUs,
- each type of CPU supports a different subset of measurable
events.
Fixed-function and programmable counters both use the same vendor
names for events. The use of a class name prefix ("iaf-" or
"iap-" respectively) permits these to be distinguished.
- In libpmc, refactor pmc_name_of_event() into a public interface
and an internal helper function, for use by log handling code.
- Minor code tweaks: staticize a global, freshen a few comments.
Tested by: gnn
and ifnet functions
- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own
- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers
- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
allow drivers to efficiently manage multiple hardware queues
(i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues
This work was supported by Bitgravity Inc. and Chelsio Inc.
dependencies. A 'struct pmc_classdep' structure describes operations
on PMCs; 'struct pmc_mdep' contains one or more 'struct pmc_classdep'
structures depending on the CPU in question.
Inside PMC class dependent code, row indices are relative to the
PMCs supported by the PMC class; MI code in "hwpmc_mod.c" translates
global row indices before invoking class dependent operations.
- Augment the OP_GETCPUINFO request with the number of PMCs present
in a PMC class.
- Move code common to Intel CPUs to file "hwpmc_intel.c".
- Move TSC handling to file "hwpmc_tsc.c".
all to date and the latter also is only used in ia64 and powerpc
code which no longer serves a real purpose after bring-up and just
can be removed as well. Note that architectures like sun4u also
provide no means of implementing IPI'ing a CPU itself natively
in the first place.
Suggested by: jhb
Reviewed by: arch, grehan, jhb
On the i386 architecture, the processor only saves the current value
of `%esp' on stack if a privilege switch is necessary when entering
the interrupt handler. Thus, `frame->tf_esp' is only valid for
an entry from user mode. For interrupts taken in kernel mode, we
need to determine the top-of-stack for the interrupted kernel
procedure by adding the appropriate offset to the current frame
pointer.
Reported by: kris, Fabien Thomas
Tested by: Fabien Thomas <fabien.thomas at netasq dot com>
location in GDT where the segment descriptor from pcb_gs32sd is
copied, and the location is in GDT local to CPU.
Noted and reviewed by: peter
MFC after: 1 week
- Rename pciereg_cfgopen() to pcie_cfgregopen() and expose it to the
rest of the kernel. It now also accepts parameters via function
arguments rather than global variables.
- Add a notion of minimum and maximum bus numbers and reject requests for
an out of range bus.
- Add more range checks on slot/func/reg/bytes parameters to the cfg reg
read/write routines. Don't panic on any invalid parameters, just fail
the request (writes do nothing, reads return -1). This matches the
behavior of the other cfg mechanisms.
- Port the memory mapped configuration space access to amd64. On amd64
we simply use the direct map (via pmap_mapdev()) for the memory mapped
window.
- During acpi_attach() just after loading the ACPI tables, check for a
MCFG table. If it exists, call pciereg_cfgopen() on each subtable
(memory mapped window). For now we only support windows for domain 0
that start with bus 0. This removes the need for more chipset-specific
quirks in the MD code.
- Remove the chipset-specific quirks for the Intel 5000P/V/Z chipsets
since these machines should all have MCFG tables via ACPI.
- Updated pci_cfgregopen() to DTRT if ACPI had invoked pcie_cfgregopen()
earlier.
MFC after: 2 weeks
features of CPUs like reading/writing machine-specific registers,
retrieving cpuid data, and updating microcode.
- Add cpucontrol(8) utility, that provides userland access to
the features of cpuctl(4).
- Add subsequent manpages.
The cpuctl(4) device operates as follows. The pseudo-device node cpuctlX
is created for each cpu present in the systems. The pseudo-device minor
number corresponds to the cpu number in the system. The cpuctl(4) pseudo-
device allows a number of ioctl to be preformed, namely RDMSR/WRMSR/CPUID
and UPDATE. The first pair alows the caller to read/write machine-specific
registers from the correspondent CPU. cpuid data could be retrieved using
the CPUID call, and microcode updates are applied via UPDATE.
The permissions are inforced based on the pseudo-device file permissions.
RDMSR/CPUID will be allowed when the caller has read access to the device
node, while WRMSR/UPDATE will be granted only when the node is opened
for writing. There're also a number of priv(9) checks.
The cpucontrol(8) utility is intened to provide userland access to
the cpuctl(4) device features. The utility also allows one to apply
cpu microcode updates.
Currently only Intel and AMD cpus are supported and were tested.
Approved by: kib
Reviewed by: rpaulo, cokane, Peter Jeremy
MFC after: 1 month
mode changes, and cache and TLB invalidation when some or all of the
specified range is already mapped with the specified cache mode.
Submitted by: Magesh Dhasayyan
the 32bit images on amd64.
Change the semantic of the PCB_32BIT pcb flag to request the context
switch code to operate on the segment registers. Its previous meaning
of saving or restoring the %gs base offset is assigned to the new
PCB_GS32BIT flag.
FreeBSD 32bit image activator sets the PCB_32BIT flag, while Linux 32bit
emulation sets PCB_32BIT | PCB_GS32BIT.
Reviewed by: peter
MFC after: 2 weeks
page directory pages from VM_MIN_KERNEL_ADDRESS through the end of the
kernel's bss. Specifically, the dependence was in pmap_growkernel()'s one-
time initialization of kernel_vm_end, not in its main body. (I could not,
however, resist the urge to optimize the main body.)
Reduce the number of preallocated page directory pages to just those needed
to support NKPT page table pages. (In fact, this allows me to revert a
couple of my earlier changes to create_pagetables().)
ceiling as a fraction of the kernel map's size rather than an absolute
quantity. Thus, scaling of the kmem map's size will be automatic with
changes to the kernel map's size.
in practice, the error (currently) makes no difference because the computation
performed by KVADDR() hides the error. This revision fixes the error.
Also, eliminate a (now) unused definition.
maximum size of the kmem map can be greater than 4GB, there is little point
in making the kernel virtual address space larger than 6GB.
Tested by: kris@
Now that st_rdev is being automatically generated by the kernel, there
is no need to define static major/minor numbers for the iodev and
memdev. We still need the minor numbers for the memdev, however, to
distinguish between /dev/mem and /dev/kmem.
Approved by: philip (mentor)
address space on the amd64 architecture. The amd64 architecture
requires kernel code and global variables to reside in the highest 2GB
of the 64-bit virtual address space. Thus, KERNBASE cannot change.
However, KERNBASE is sometimes used as the start of the kernel virtual
address space. Henceforth, VM_MIN_KERNEL_ADDRESS should be used
instead. Since KERNBASE and VM_MIN_KERNEL_ADDRESS are still the same
address, there should be no visible effect from this change (yet).
from idle over the next tick.
- Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
suspended in cpu specific states. This function can fail and cause the
scheduler to fall back to another mechanism (ipi).
- Implement support for mwait in cpu_idle() on i386/amd64 machines that
support it. mwait is a higher performance way to synchronize cpus
as compared to hlt & ipis.
- Allow selecting the idle routine by name via sysctl machdep.idle. This
replaces machdep.cpu_idle_hlt. Only idle routines supported by the
current machine are permitted.
Sponsored by: Nokia
for better structure.
Much of this is related to <sys/clock.h>, which should really have
been called <sys/calendar.h>, but unless and until we need the name,
the repocopy can wait.
In general the kernel does not know about minutes, hours, days,
timezones, daylight savings time, leap-years and such. All that
is theoretically a matter for userland only.
Parts of kernel code does however care: badly designed filesystems
store timestamps in local time and RTC chips almost universally
track time in a YY-MM-DD HH:MM:SS format, and sometimes in local
timezone instead of UTC. For this we have <sys/clock.h>
<sys/time.h> on the other hand, deals with time_t, timeval, timespec
and so on. These know only seconds and fractions thereof.
Move inittodr() and resettodr() prototypes to <sys/time.h>.
Retain the names as it is one of the few surviving PDP/VAX references.
Move startrtclock() to <machine/clock.h> on relevant platforms, it
is a MD call between machdep.c/clock.c. Remove references to it
elsewhere.
Remove a lot of unnecessary <sys/clock.h> includes.
Move the machdep.disable_rtc_set sysctl to subr_rtc.c where it belongs.
XXX: should be kern.disable_rtc_set really, it's not MD.
these days, so de-generalize the acquire_timer/release_timer api
to just deal with speakers.
The new (optional) MD functions are:
timer_spkr_acquire()
timer_spkr_release()
and
timer_spkr_setfreq()
the last of which configures the timer to generate a tone of a given
frequency, in Hz instead of 1/1193182th of seconds.
Drop entirely timer2 on pc98, it is not used anywhere at all.
Move sysbeep() to kern/tty_cons.c and use the timer_spkr*() if
they exist, and do nothing otherwise.
Remove prototypes and empty acquire-/release-timer() and sysbeep()
functions from the non-beeping archs.
This eliminate the need for the speaker driver to know about
i8254frequency at all. In theory this makes the speaker driver MI,
contingent on the timer_spkr_*() functions existing but the driver
does not know this yet and still attaches to the ISA bus.
Syscons is more tricky, in one function, sc_tone(), it knows the hz
and things are just fine.
In the other function, sc_bell() it seems to get the period from
the KDMKTONE ioctl in terms if 1/1193182th second, so we hardcode
the 1193182 and leave it at that. It's probably not important.
Change a few other sysbeep() uses which obviously knew that the
argument was in terms of i8254 frequency, and leave alone those
that look like people thought sysbeep() took frequency in hertz.
This eliminates the knowledge of i8254_freq from all but the actual
clock.c code and the prof_machdep.c on amd64 and i386, where I think
it would be smart to ask for help from the timecounters anyway [TBD].
- Add a new intr_event method ie_assign_cpu() that is invoked when the MI
code wishes to bind an interrupt source to an individual CPU. The MD
code may reject the binding with an error. If an assign_cpu function
is not provided, then the kernel assumes the platform does not support
binding interrupts to CPUs and fails all requests to do so.
- Bind ithreads to CPUs on their next execution loop once an interrupt
event is bound to a CPU. Only shared ithreads are bound. We currently
leave private ithreads for drivers using filters + ithreads in the
INTR_FILTER case unbound.
- A new intr_event_bind() routine is used to bind an interrupt event to
a CPU.
- Implement binding on amd64 and i386 by way of the existing pic_assign_cpu
PIC method.
- For x86, provide a 'intr_bind(IRQ, cpu)' wrapper routine that looks up
an interrupt source and binds its interrupt event to the specified CPU.
MI code can currently (ab)use this by doing:
intr_bind(rman_get_start(irq_res), cpu);
however, I plan to add a truly MI interface (probably a bus_bind_intr(9))
where the implementation in the x86 nexus(4) driver would end up calling
intr_bind() internally.
Requested by: kmacy, gallatin, jeff
Tested on: {amd64, i386} x {regular, INTR_FILTER}
different "platforms" on x86 machines. The existing code already handles
having two platforms: ACPI and legacy. However, the existing approach was
rather hardcoded and difficult to extend. These changes take the approach
that each x86 hardware platform should provide its own nexus(4) driver (it
can inherit most of its behavior from the default legacy nexus(4) driver)
which is responsible for probing for the platform and performing
appropriate platform-specific setup during attach (such as adding a
platform-specific bus device). This does mean changing the x86 platform
busses to no longer use an identify routine for probing, but to move that
logic into their matching nexus(4) driver instead.
- Make the default nexus(4) driver in nexus.c on i386 and amd64 handle the
legacy platform. It's probe routine now returns BUS_PROBE_GENERIC so it
can be overriden.
- Expose a nexus_init_resources() routine which initializes the various
resource managers so that subclassed nexus(4) drivers can invoke it from
their attach routine.
- The legacy nexus(4) driver explicitly adds a legacy0 device in its
attach routine.
- The ACPI driver no longer contains an new-bus identify method. Instead
it exposes a public function (acpi_identify()) which is a probe routine
that the MD nexus(4) drivers can use to probe for ACPI. All of the
probe logic in acpi_probe() is now moved into acpi_identify() and
acpi_probe() is just a stub.
- On i386 and amd64, an ACPI-specific nexus(4) driver checks for ACPI via
acpi_identify() and claims the nexus0 device if the probe succeeds. It
then explicitly adds an acpi0 device in its attach routine.
- The legacy(4) driver no longer knows anything about the acpi0 device.
- On ia64 if acpi_identify() fails you basically end up with no devices.
This matches the previous behavior where the old acpi_identify() would
fail to add an acpi0 device again leaving you with no devices.
Discussed with: imp
Silence on: arch@
PhysMask fields based on the number of physical address bits supported
by the current CPU. The old code assumed 36 bits on i386 and 40 bits on
amd64. In truth, all Intel CPUs up until recently used 36 bits (a newer
Intel CPU uses 38 bits) and all the Opteron CPUs used 40 bits.
In at least one case (the new Intel CPU) having the size of the mask field
wrong resulted in writing questionable values into the MTRR registers on
the application processors (BSP as well if you modify the MTRRs via
memcontrol or running X, etc.). The result of the questionable physmask
was that all of memory was apparently treated as uncached rather than
write-back resulting in a very significant performance hit.
Fix this by constructing a run-time mask for the PhysBase and PhysMask
fields based on the number of physical address bits supported by the CPU.
All 64-bit capable CPUs provide a count of PA bits supported via the
0x80000008 extended CPUID feature, so use that if it is available. If that
feature is not available, then assume 36 PA bits.
While I'm here, expand the (now-unused) macros for the PhysBase and
PhysMask fields to the current largest possible value (52 PA bits).
MFC after: 1 week
PR: i386/120516
Reported by: Nokia
mappings. Automatic promotion can be enabled by setting the tunable
"vm.pmap.pg_ps_enabled" to a non-zero value. By default, automatic
promotion is disabled. (Expect this to change.)
Reviewed by: ups
Tested by: kris, Peter Holm
tree structure that encodes the level of cache sharing and other
properties.
- Provide several convenience functions for creating one and two level
cpu trees as well as a default flat topology. The system now always
has some topology.
- On i386 and amd64 create a seperate level in the hierarchy for HTT
and multi-core cpus. This will allow the scheduler to intelligently
load balance non-uniform cores. Presently we don't detect what level
of the cache hierarchy is shared at each level in the topology.
- Add a mechanism for testing common topologies that have more information
than the MD code is able to provide via the kern.smp.topology tunable.
This should be considered a debugging tool only and not a stable api.
Sponsored by: Nokia
in the range and precision of their type(s) on amd64, but FLT_EVAL_METHOD
said that they were evalated in the "interesting" (buggy) i387 methods.
float_t was broken compatibly with FLT_EVAL_METHOD.
These definitions seem to be broken on powerpc and possibly on arm.
float_t is float on powerpc with gcc [-notraditional] according to
glibc, and FLT_EVAL_METHOD is marked with XXX on arm.
Unmasked exceptions (which can be fixed up using fpset*() before they
trap) are very rare, especially on amd64 since SSE exceptions trap
synchronously, but I want to merge the faster amd64 implementations of
fpset*() back to i386 without introducing the bug on i386.
The i386 implementation has always avoided the trap automatically by
changing things using load/store of the FP environment, but this is
very slow. Most changes only affect the control word, so they can
usually be done much more efficiently, and amd64 has always done this,
but loading the control word can trap.
This version use the fast method only in the usual case where it will
not trap. This only costs a couple of integer instructions (including
one branch which I haven't optimized carefully yet) in the usual case,
but bloats the inlines a lot. The inlines were already a bit too large
to handle both the FPU and SSE.
- fix a previous style fix: shifts should be in the correct direction even
if they are null.
- restore a comment about namespace pollution from floatingpoint.h 1.12 and
update it.
- remove unused namespace pollution FP_*REG.
- improve some comments.
- sort macro definitions for entry points.
- don't use underscores for macro args.
- fix this to compile with C++ by casting ints to enums in a few places
and by using the correct parameter type for _fpsetprec(). Remove
__cplusplus ifdefs which disabled the buggy code.
- remove __CC_SUPPORTS___INLINE ifdefs. `__inline' vs `inline', and either
of these #defined away, are supposed to be handled by very old ifdefs
in <sys/cdefs.h>. Thus the __CC_SUPPORTS___INLINE macro is not needed
here (or anywhere else that it used). It is less needed here than in
most places, since this file is userland-only and userland is far from
supporting INTEL_COMPILER. The __CC_SUPPORTS___INLINE__ macro which
was used here is even less needed. It is to support spelling `inline'
as `__inline__' instead of the usual spelling `__inline'.
Fix some style bugs that I missed in the previous commit (remove unused
asms and sort more variables).
pv_list_count from struct md_page. Ever since Peter rewrote the pv
entry allocator for amd64 and i386 pv_list_count has been correctly
maintained but otherwise unused.
- Introduce per-architecture stack_machdep.c to hold stack_save(9).
- Introduce per-architecture machine/stack.h to capture any common
definitions required between db_trace.c and stack_machdep.c.
- Add new kernel option "options STACK"; we will build in stack(9) if it is
defined, or also if "options DDB" is defined to provide compatibility
with existing users of stack(9).
Add new stack_save_td(9) function, which allows the capture of a stacktrace
of another thread rather than the current thread, which the existing
stack_save(9) was limited to. It requires that the thread be neither
swapped out nor running, which is the responsibility of the consumer to
enforce.
Update stack(9) man page.
Build tested: amd64, arm, i386, ia64, powerpc, sparc64, sun4v
Runtime tested: amd64 (rwatson), arm (cognet), i386 (rwatson)
- On amd64, just assume type #1 is always used. PCI 2.0 mandated
deprecated type #2 and required type #1 for all future bridges which
was well before amd64 existed.
- For i386, ignore whatever value was in 0xcf8 before testing for type #1
and instead rely on the other tests to determine if type #1 works. Some
newer machines leave garbage in 0xcf8 during boot and as a result the
kernel doesn't find PCI at all (which greatly confuses ACPI which expects
PCI to exist when PCI busses are in the namespace).
MFC after: 3 days
Discussed with: scottl
refactored it to be a generic device.
Instead of being part of the standard kernel, there is now a 'nvram' device
for i386/amd64. It is in DEFAULTS like io and mem, and can be turned off
with 'nodevice nvram'. This matches the previous behavior when it was
first committed.