If userspace has a newer bhyve than the kernel, it may be able to decode
and emulate some instructions vmm.ko is unaware of. In this scenario,
reset decoder state and try again.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24464
Save and restore (also known as suspend and resume) permits a snapshot
to be taken of a guest's state that can later be resumed. In the
current implementation, bhyve(8) creates a UNIX domain socket that is
used by bhyvectl(8) to send a request to save a snapshot (and
optionally exit after the snapshot has been taken). A snapshot
currently consists of two files: the first holds a copy of guest RAM,
and the second file holds other guest state such as vCPU register
values and device model state.
To resume a guest, bhyve(8) must be started with a matching pair of
command line arguments to instantiate the same set of device models as
well as a pointer to the saved snapshot.
While the current implementation is useful for several uses cases, it
has a few limitations. The file format for saving the guest state is
tied to the ABI of internal bhyve structures and is not
self-describing (in that it does not communicate the set of device
models present in the system). In addition, the state saved for some
device models closely matches the internal data structures which might
prove a challenge for compatibility of snapshot files across a range
of bhyve versions. The file format also does not currently support
versioning of individual chunks of state. As a result, the current
file format is not a fixed binary format and future revisions to save
and restore will break binary compatiblity of snapshot files. The
goal is to move to a more flexible format that adds versioning,
etc. and at that point to commit to providing a reasonable level of
compatibility. As a result, the current implementation is not enabled
by default. It can be enabled via the WITH_BHYVE_SNAPSHOT=yes option
for userland builds, and the kernel option BHYVE_SHAPSHOT.
Submitted by: Mihai Tiganus, Flavius Anton, Darius Mihai
Submitted by: Elena Mihailescu, Mihai Carabas, Sergiu Weisz
Relnotes: yes
Sponsored by: University Politehnica of Bucharest
Sponsored by: Matthew Grooms (student scholarships)
Sponsored by: iXsystems
Differential Revision: https://reviews.freebsd.org/D19495
The static assertions were added (with size and offsets from gdb) and verified
with a build prior to marking the holes explicitly.
This is in preparation for a subsequent revision, pending in phabricator, that
makes use of some of these unused bits without impacting the ABI.
Reviewed by: grehan
Differential Revision: https://reviews.freebsd.org/D24461
- Allow the userland hypervisor to intercept breakpoint exceptions
(BP#) in the guest. A new capability (VM_CAP_BPT_EXIT) is used to
enable this feature. These exceptions are reported to userland via
a new VM_EXITCODE_BPT that includes the length of the original
breakpoint instruction. If userland wishes to pass the exception
through to the guest, it must be explicitly re-injected via
vm_inject_exception().
- Export VMCS_ENTRY_INST_LENGTH as a VM_REG_GUEST_ENTRY_INST_LENGTH
pseudo-register. Injecting a BP# on Intel requires setting this to
the length of the breakpoint instruction. AMD SVM currently ignores
writes to this register (but reports success) and fails to read it.
- Rework the per-vCPU state tracked by the debug server. Rather than
a single 'stepping_vcpu' global, add a structure for each vCPU that
tracks state about that vCPU ('stepping', 'stepped', and
'hit_swbreak'). A global 'stopped_vcpu' tracks which vCPU is
currently reporting an event. Event handlers for MTRAP and
breakpoint exits loop until the associated event is reported to the
debugger.
Breakpoint events are discarded if the breakpoint is not present
when a vCPU resumes in the breakpoint handler to retry submitting
the breakpoint event.
- Maintain a linked-list of active breakpoints in response to the GDB
'Z0' and 'z0' packets.
Reviewed by: markj (earlier version)
MFC after: 2 months
Differential Revision: https://reviews.freebsd.org/D20309
Bhyve's vmm is a self-contained modern component and thus a good
candidate for use of C99 types.
Reviewed by: jhb, kib, markj, Patrick Mooney
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21036
hard-coded value. Don't allocate space for it from the kernel stack.
Account for prefix, suffix, and separator space in the name. This
takes the effective length up to 229 bytes on 13-current, and 37 bytes
on 12-stable. 37 bytes is enough to hold a full GUID string.
PR: 234134
MFC after: 1 week
Differential Revision: http://reviews.freebsd.org/D20924
Replace most VM_MAXCPU constant useses with an accessor function to
vm->maxcpus which for now is initialized and kept at the value of
VM_MAXCPUS.
This is a rework of Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
work from D10070 to adjust it for the cpu topology changes that
occured in r332298
Submitted by: Fabian Freyer (fabian.freyer_physik.tu-berlin.de)
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: bde (mentor), jhb (maintainer)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D18755
undefined instruction exception. Previously we would exit the guest,
however an unprivileged user could execute these.
Found with: syzkaller
Reviewed by: araujo, tychon (previous version)
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17192
Currently, when using dd(1) to take a VM memory image, the capture never ends,
reading zeroes when it's beyond VM system memory max address.
Return EFAULT when trying to read beyond VM system memory max address.
Reviewed by: imp, grehan, anish
Approved by: grehan
Differential Revision: https://reviews.freebsd.org/D15156
from userland without the need to use sysctls, it allows the old
sysctls to continue to function, but deprecates them at
FreeBSD_version 1200060 (Relnotes for deprecate).
The command line of bhyve is maintained in a backwards compatible way.
The API of libvmmapi is maintained in a backwards compatible way.
The sysctl's are maintained in a backwards compatible way.
Added command option looks like:
bhyve -c [[cpus=]n][,sockets=n][,cores=n][,threads=n][,maxcpus=n]
The optional parts can be specified in any order, but only a single
integer invokes the backwards compatible parse. [,maxcpus=n] is
hidden by #ifdef until kernel support is added, though the api
is put in place.
bhyvectl --get-cpu-topology option added.
Reviewed by: grehan (maintainer, earlier version),
Reviewed by: bcr (manpages)
Approved by: bde (mentor), phk (mentor)
Tested by: Oleg Ginzburg <olevole@olevole.ru> (cbsd)
MFC after: 1 week
Relnotes: Y
Differential Revision: https://reviews.freebsd.org/D9930
This is used as part of implementing run control in bhyve's debug
server. The hypervisor now maintains a set of "debugged" CPUs.
Attempting to run a debugged CPU will fail to execute any guest
instructions and will instead report a VM_EXITCODE_DEBUG exit to
the userland hypervisor. Virtual CPUs are placed into the debugged
state via vm_suspend_cpu() (implemented via a new VM_SUSPEND_CPU ioctl).
Virtual CPUs can be resumed via vm_resume_cpu() (VM_RESUME_CPU ioctl).
The debug server suspends virtual CPUs when it wishes them to stop
executing in the guest (for example, when a debugger attaches to the
server). The debug server can choose to resume only a subset of CPUs
(for example, when single stepping) or it can choose to resume all
CPUs. The debug server must explicitly mark a CPU as resumed via
vm_resume_cpu() before the virtual CPU will successfully execute any
guest instructions.
Reviewed by: avg, grehan
Tested on: Intel (jhb), AMD (avg)
Differential Revision: https://reviews.freebsd.org/D14466
Currently most of the debug registers are not saved and restored
during VM transitions allowing guest and host debug register values to
leak into the opposite context. One result is that hardware
watchpoints do not work reliably within a guest under VT-x.
Due to differences in SVM and VT-x, slightly different approaches are
used.
For VT-x:
- Enable debug register save/restore for VM entry/exit in the VMCS for
DR7 and MSR_DEBUGCTL.
- Explicitly save DR0-3,6 of the guest.
- Explicitly save DR0-3,6-7, MSR_DEBUGCTL, and the trap flag from
%rflags for the host. Note that because DR6 is "software" managed
and not stored in the VMCS a kernel debugger which single steps
through VM entry could corrupt the guest DR6 (since a single step
trap taken after loading the guest DR6 could alter the DR6
register). To avoid this, explicitly disable single-stepping via
the trace flag before loading the guest DR6. A determined debugger
could still defeat this by setting a breakpoint after the guest DR6
was loaded and then single-stepping.
For SVM:
- Enable debug register caching in the VMCB for DR6/DR7.
- Explicitly save DR0-3 of the guest.
- Explicitly save DR0-3,6-7, and MSR_DEBUGCTL for the host. Since SVM
saves the guest DR6 in the VMCB, the race with single-stepping
described for VT-x does not exist.
For both platforms, expose all of the guest DRx values via --get-drX
and --set-drX flags to bhyvectl.
Discussed with: avg, grehan
Tested by: avg (SVM), myself (VT-x)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D13229
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
devmem is used to represent MMIO devices like the boot ROM or a VESA framebuffer
where doing a trap-and-emulate for every access is impractical. devmem is a
hybrid of system memory (sysmem) and emulated device models.
devmem is mapped in the guest address space via nested page tables similar
to sysmem. However the address range where devmem is mapped may be changed
by the guest at runtime (e.g. by reprogramming a PCI BAR). Also devmem is
usually mapped RO or RW as compared to RWX mappings for sysmem.
Each devmem segment is named (e.g. "bootrom") and this name is used to
create a device node for the devmem segment (e.g. /dev/vmm/testvm.bootrom).
The device node supports mmap(2) and this decouples the host mapping of
devmem from its mapping in the guest address space (which can change).
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D2762
MFC after: 4 weeks
"sleeping" state. This is done by forcing the vcpu to transition to "idle"
by returning to userspace with an exit code of VM_EXITCODE_REQIDLE.
MFC after: 2 weeks
This is done explicitly because a vcpu thread can be in a critical section
for the entire time slice alloted to it. This in turn can delay the handling
of the 'td_owepreempt'.
Reviewed by: jhb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D2430
Prior to this change both functions returned 0 for success, -1 for failure
and +1 to indicate that an exception was injected into the guest.
The numerical value of ERESTART also happens to be -1 so when these functions
returned -1 it had to be translated to a positive errno value to prevent the
VM_RUN ioctl from being inadvertently restarted. This made it easy to introduce
bugs when writing emulation code.
Fix this by adding an 'int *guest_fault' parameter and setting it to '1' if
an exception was delivered to the guest. The return value is 0 or EFAULT so
no additional translation is needed.
Reviewed by: tychon
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D2428
code segment base address.
Also if an instruction doesn't support a mod R/M (modRM) byte, don't
be concerned if the CPU is in real mode.
Reviewed by: neel
These instructions are emitted by 'bus_space_read_region()' when accessing
MMIO regions.
Since MOVS can be used with a repeat prefix start decoding the REPZ and
REPNZ prefixes. Also start decoding the segment override prefix since MOVS
allows overriding the source operand segment register.
Tested by: tychon
MFC after: 1 week
VM_INJECT_EXCEPTION ioctl. However it morphed into other uses like keeping
track pending exceptions for a vcpu. This in turn causes confusion because
some fields in 'struct vm_exception' like 'vcpuid' make sense only in the
ioctl context. It also makes it harder to add or remove structure fields.
Fix this by using 'struct vm_exception' only to communicate information
from userspace to vmm.ko when injecting an exception.
Also, add a field 'restart_instruction' to 'struct vm_exception'. This
field is set to '1' for exceptions where the faulting instruction is
restarted after the exception is handled.
MFC after: 1 week
The new RTC emulation supports all interrupt modes: periodic, update ended
and alarm. It is also capable of maintaining the date/time and NVRAM contents
across virtual machine reset. Also, the date/time fields can now be modified
by the guest.
Since bhyve now emulates both the PIT and the RTC there is no need for
"Legacy Replacement Routing" in the HPET so get rid of it.
The RTC device state can be inspected via bhyvectl as follows:
bhyvectl --vm=vm --get-rtc-time
bhyvectl --vm=vm --set-rtc-time=<unix_time_secs>
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --get-rtc-nvram
bhyvectl --vm=vm --rtc-nvram-offset=<offset> --set-rtc-nvram=<value>
Reviewed by: tychon
Discussed with: grehan
Differential Revision: https://reviews.freebsd.org/D1385
MFC after: 2 weeks
"hw.vmm.trace_guest_exceptions". To enable this feature set the tunable
to "1" before loading vmm.ko.
Tracing the guest exceptions can be useful when debugging guest triple faults.
Note that there is a performance impact when exception tracing is enabled
since every exception will now trigger a VM-exit.
Also, handle machine check exceptions that happen during guest execution
by vectoring to the host's machine check handler via "int $18".
Discussed with: grehan
MFC after: 2 weeks
This reduces variability during timer calibration by keeping the emulation
"close" to the guest. Additionally having all timer emulations in the kernel
will ease the transition to a per-VM clock source (as opposed to using the
host's uptime keep track of time).
Discussed with: grehan
The hypervisor hides the MONITOR/MWAIT capability by unconditionally setting
CPUID.01H:ECX[3] to 0 so the guest should not expect these instructions to
be present anyways.
Discussed with: grehan
code. There are only a handful of MSRs common between the two so there isn't
too much duplicate functionality.
The VT-x code has the following types of MSRs:
- MSRs that are unconditionally saved/restored on every guest/host context
switch (e.g., MSR_GSBASE).
- MSRs that are restored to guest values on entry to vmx_run() and saved
before returning. This is an optimization for MSRs that are not used in
host kernel context (e.g., MSR_KGSBASE).
- MSRs that are emulated and every access by the guest causes a trap into
the hypervisor (e.g., MSR_IA32_MISC_ENABLE).
Reviewed by: grehan
VM-exit and ultimately on whether nRIP is valid. This allows us to update
the %rip after the emulation is finished so any exceptions triggered during
the emulation will point to the right instruction.
Don't attempt to handle INS/OUTS VM-exits unless the DecodeAssist capability
is available. The effective segment field in EXITINFO1 is not valid without
this capability.
Add VM_EXITCODE_SVM to flag SVM VM-exits that cannot be handled. Provide the
VMCB fields exitinfo1 and exitinfo2 as collateral to help with debugging.
Provide a SVM VM-exit handler to dump the exitcode, exitinfo1 and exitinfo2
fields in bhyve(8).
Reviewed by: Anish Gupta (akgupt3@gmail.com)
Reviewed by: grehan
by explicitly moving it out of the interrupt shadow. The hypervisor is done
"executing" the HLT and by definition this moves the vcpu out of the
1-instruction interrupt shadow.
Prior to this change the interrupt would be held pending because the VMCS
guest-interruptibility-state would indicate that "blocking by STI" was in
effect. This resulted in an unnecessary round trip into the guest before
the pending interrupt could be injected.
Reviewed by: grehan
forever in vm_handle_hlt().
This is usually not an issue as long as one of the other vcpus properly resets
or powers off the virtual machine. However, if the bhyve(8) process is killed
with a signal the halted vcpu cannot be woken up because it's sleep cannot be
interrupted.
Fix this by waking up periodically and returning from vm_handle_hlt() if
TDF_ASTPENDING is set.
Reported by: Leon Dang
Sponsored by: Nahanni Systems
The faulting instruction needs to be restarted when the exception handler
is done handling the fault. bhyve now does this correctly by setting
'vmexit[vcpu].inst_length' to zero so the %rip is not advanced.
A minor complication is that the fault injection APIs are used by instruction
emulation code that is shared by vmm.ko and bhyve. Thus the argument that
refers to 'struct vm *' in kernel or 'struct vmctx *' in userspace needs to
be loosely typed as a 'void *'.
A nested exception condition arises when a second exception is triggered while
delivering the first exception. Most nested exceptions can be handled serially
but some are converted into a double fault. If an exception is generated during
delivery of a double fault then the virtual machine shuts down as a result of
a triple fault.
vm_exit_intinfo() is used to record that a VM-exit happened while an event was
being delivered through the IDT. If an exception is triggered while handling
the VM-exit it will be treated like a nested exception.
vm_entry_intinfo() is used by processor-specific code to get the event to be
injected into the guest on the next VM-entry. This function is responsible for
deciding the disposition of nested exceptions.
instruction emulation [1].
Fix bug in emulation of opcode 0x8A where the destination is a legacy high
byte register and the guest vcpu is in 32-bit mode. Prior to this change
instead of modifying %ah, %bh, %ch or %dh the emulation would end up
modifying %spl, %bpl, %sil or %dil instead.
Add support for moffsets by treating it as a 2, 4 or 8 byte immediate value
during instruction decoding.
Fix bug in verify_gla() where the linear address computed after decoding
the instruction was not being truncated to the effective address size [2].
Tested by: Leon Dang [1]
Reported by: Peter Grehan [2]
Sponsored by: Nahanni Systems
it implicitly in vmm.ko.
Add ioctl VM_GET_CPUS to get the current set of 'active' and 'suspended' cpus
and display them via /usr/sbin/bhyvectl using the "--get-active-cpus" and
"--get-suspended-cpus" options.
This is in preparation for being able to reset virtual machine state without
having to destroy and recreate it.