The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Event tasks are pinned to their respective CPU by default, in the same
fashion as they were.
Unpin the event tasks by setting hw.vmbus.pin_evttask to 0, if certain
CPUs serve special purpose.
MFC after: 3 days
Sponsored by: Microsoft
For GEN1 Hyper-V, vmbus is attached to pcib0, which contains the
resources for PCI passthrough and SR-IOV. There is no
acpi_syscontainer0 on GEN1 Hyper-V.
For GEN2 Hyper-V, vmbus is attached to acpi_syscontainer0, which
contains the resources for PCI passthrough and SR-IOV. There is
no pcib0 on GEN2 Hyper-V.
The ACPI VMBUS device now only holds its _CRS, which is empty as
of this commit; its existence is mainly for upward compatibility.
Device tree structure is suggested by jhb@.
Tested-by: dexuan@
Collabrated-wth: dexuan@
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D10565
This 6 times gettimeofday performance, as measured by
tools/tools/syscall_timing
Reviewed by: kib
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8789
This paves way to implement VDSO for the enlightened time counter.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8768
This makes booting on Hyper-V w/ small # of vCPUs work properly.
Reported by: Hongxiong Xian <v-hoxian microsoft com>, Hongjiang Zhang <honzhan microsoft com>
MFC after: 1 week
Sponsored by: Microsoft
Since hypervisor does not respond CHOPEN to a revoked channel.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8636
So that the caller can know the channel close error and react accordingly.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8600
So that the callers of vmbus_chan_open_br() could handle the passed in
bufring memory properly.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8569
vmbus_pcib requires NEW_PCIB, but in case that's not defined, we at
least shouldn't break build.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 3 days
Sponsored by: Microsoft
- Reference count the sub-channel when channel offer message is
processed, so that immediate rescind message on the same channel
will not race sub-channel open on driver side.
- Drop the above reference when sub-channel is closed, this closely
mimics the hypervisor's reaction when primary channel is closed
on the VM side. No drivers use sub-channel after primary channel
is closed.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8546
Drivers can now use vmbus_chan_{is_revoked,set_orphan,unset_orphan}() and
vmbus_xact_ctx_orphan() to fix their attach/detach DEVMETHODs for revoked
primary channels.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8545
It will be used to fix the primary channel revocation support.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8525
These functions are only used by management stuffs, so there are
no needs to introduce extra complexity.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8524
This will be used to fix device detach DEVMETHOD for revoked primary
channel.
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8522
We'd better add this dependency explicitly, though usually the pci
driver is built into the kernel by default.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 1 week
Sponsored by: Microsoft
The new methods will be used by the coming pcib driver.
Reviewed by: sephe
Approved by: sephe (mentor)
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8409
vcpu_id is host's representation of guest CPU.
We get the mapping between vcpu_id and FreeBSD kernel's cpu id when VMBus
driver is loaded. Later, when a driver, like the coming pcib driver, talks
to the host and needs to refer to a guest CPU, the driver must use the
vcpu_id.
Reviewed by: jhb, sephe
Approved by: sephe (mentor)
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D8410