Testing "iret" tracing on AMD/SVM Apply the following patch to get the 'sysctl debug.nit' node: https://people.freebsd.org/~neel/bhyve/tests/nmi_iret_testing.patch Single step guest kernel and make sure all instructions traced properly: ------------------------------------------------------------------------ 1. Boot up a FreeBSD 64-bit guest with more than one vcpu. 2. In the guest set 'sysctl machdep.kdb_on_nmi=1'. This will cause the guest to enter the debugger when it receives an NMI. 3. When the guest is idle execute the following on the host: $ bhyvectl --vm=vm2 --assert-lapic-lvt=1 --cpu=0 This will inject an NMI into vcpu 0 which will then cause the guest to enter the debugger. This will typically be the instruction following 'hlt' in cpu_idle_acpi(). Single step the guest using 's' and verify that no instructions are omitted. 4. This test will not work with a broadcast nmi (--cpu=-1). This is a drawback of the FreeBSD guest because all vcpus try to enter the debugger and deadlock waiting for the other vcpus to stop. Pending interrupt during iret tracing: -------------------------------------- 1. Boot up a FreeBSD 64-bit guest. Recommend using a single vcpu and disabling pause exits to de-clutter the KTR log file. 2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest. 3. Set 'sysctl debug.nit.delay_usecs=800000' in the host. The guest NMI handler reads i/o port 0x61 and the sysctl above will delay the it for 800 msec. This should be enough time for the guest timer interrupt to fire. This interrupt will be held pending until the NMI handler issues an 'iret' to re-enable interrupts. 4. Inject an NMI and capture a trace by executing the following on the host: sudo sysctl debug.ktr.clear=1 sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1 sleep 1 sudo ktrdump -cto /tmp/ktrdump.out 5. Open /tmp/ktrdump.out and look for the following: - There should be a pending interrupt after "Delaying 800000 usecs" - This interrupt should be delivered only after "vNMIs unblocked precisely" Pending NMI during iret tracing: -------------------------------- 1. Boot up a FreeBSD 64-bit guest. Recommend using a single vcpu and disabling pause exits to de-clutter the KTR log file. 2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest. 3. Set 'sysctl debug.nit.recursive_nmi=1' in the host. The guest NMI handler reads i/o port 0x61 and the sysctl above will inject an NMI in the i/o handler. This NMI will be held pending until the previous NMI handler issues an 'iret'. 4. Execute the following on the host: sudo bhyvectl --vm=vm2 --get-stats | grep "vNMI unblocked precisely" 5. Inject an NMI and capture a trace by executing the following on the host: sudo sysctl debug.ktr.clear=1 sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1 sleep 1 sudo ktrdump -cto /tmp/ktrdump.out 6. Execute the following on the host: sudo bhyvectl --vm=vm2 --get-stats | grep "vNMI unblocked precisely" The difference between the counter values in (6) and (4) should be '2' indicating that two NMIs were injected into the guest. 7. Open /tmp/ktrdump.out and look for the following: - There should be a pending NMI after "Injecting recursive vNMI" - This NMI should be delivered only after "vNMIs unblocked precisely" Nested Page Fault during iret tracing ------------------------------------- 1. Boot up a FreeBSD 64-bit guest. Recommend using a single vcpu and disabling pause exits to de-clutter the KTR log file. 2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest. 3. Set 'sysctl debug.nit.invalidate_nested_mappings=1' on the host. This will invalidate all nested page table entries before entering the guest with IRET tracing enabled. 4. Inject an NMI and capture a trace by executing the following on the host: sudo sysctl debug.ktr.clear=1 sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1 sleep 1 sudo ktrdump -cto /tmp/ktrdump.out 5. In /tmp/ktrdump.out look for the following: - "Removing all nested pages" should appear after "vNMI iret tracing enabled" - There should be a number of nested page fault VM exits: - All these VM exits occur at the %rip of the guest NMI handler's "iret" - This should be followed by "vNMIs unblocked precisely" 6. To test behavior when hypervisor returns to userspace during iret tracing also set 'sysctl debug.nit.bounce_to_userspace=1' in step (3) - This can be verified by "Bouncing to userspace" in the ktrdump.log #GP exception triggered by "iret" --------------------------------- 1. Boot up a FreeBSD 64-bit guest. Recommend using a single vcpu and disabling pause exits to de-clutter the KTR log file. 2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest. 3. Set 'sysctl debug.nit.iret_trigger_gpf=1' on the host. This will change the return address in the NMI trapframe to a non-canonical address. When the "iret" executes this will trigger a #GP in the guest which is intercepted by the hypervisor to clear NMI blocking. 4. Inject an NMI and capture a trace by executing the following on the host: sudo sysctl debug.ktr.clear=1 sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1 sudo ktrdump -cto /tmp/ktrdump.out 5. In /tmp/ktrdump.out look for: "updated to non-canonical value 0xdeadbeefbeefdead" "vNMIs unblocked precisely" "Reflecting exception 13/0 into the guest" 6. If the guest is idle then the #GP will be encountered in kernel mode and cause it to enter the debugger. 7. By executing a 'while [ 1 ]; do : done' script at the guest console it is possible to increase the likelyhood of the #GP to be encountered in user mode. In this case the running program will exit with a SIGBUS. Miscellaneous ------------- - Verify that tunable 'hw.vmm.trace_guest_exceptions=1' works as expected - Set breakpoint in the guest and make sure the debugger works properly - Inject NMI handled using an interrupt gate - Boot an i386 guest and load the tsnmi.ko kernel module https://people.freebsd.org/~neel/bhyve/tests/task_switch_nmi.tar.gz - Inject an NMI into the guest - Verify that debug.tsnmi is incremented. - Verify that the bhyve stat "Number of times vNMI unblocked speculatively" is incremented. Combining various test knobs ---------------------------- It is possible to combine the following knobs in any way: - debug.nit.invalidate_nested_mappings - debug.nit.bounce_to_userspace - debug.nit.delay_usecs - debug.nit.recursive_nmi Note that 'bounce_to_userspace' only takes effect if 'invalidate_nested_mappings' is set to '1'. It is possible to combine 'debug.nit.iret_trigger_gpf=1' with all other knobs except 'debug.nit.recursive_nmi=1'. This is because FreeBSD uses the 64-bit TSS interrupt stack table to switch stacks for the NMI handler to execute. When the #GP is triggered it unblocks NMI and the second NMI is injected into the guest. Note that the NMI stack still has the hardware stack frames for both the first NMI and the #GP. The second NMI will use the same stack switching mechanism as the first NMI and overwrite the stack frames for the first NMI and the #GP. Implementation details: ----------------------- When the hypervisor sets RFLAGS.TF and resumes the guest: - Also set RFLAGS.RF to ensure that the instruction is not subject to any instruction breakpoint fault. - How to handle an exception triggered by "iret"? - Restore RFLAGS.TF and RFLAGS.RF - Reinject the exception into the guest. - Clear NMI blocking - From Intel SDM 6.7.1: an execution of the IRET instruction unblocks NMIs even if the instruction causes a fault. - How to handle #DB that occurs when guest successfully executes "iret"? - Don't restore RFLAGS.TF and RFLAGS.RF - guest has already restored RFLAGS from the NMI stack - Don't reinject exception into the guest - Clear NMI blocking. - KASSERT(GUEST_DR6.BS == 1) - APMv2 15.12.2 DR6 and DR7 are updated before the #DB intercept - XXX restore DR6.BS Documentation: -------------- APMv2 section "Single Stepping" RFLAGS.TF = 1 will cause #DB after every instruction is executed. - The instruction that sets the TF bit and the instruction that follows it are not single stepped. - TF is cleared before entering the #DB handler so the debug handler itself is not single stepped. - The RFLAGS on the stack have the TF bit set so single stepping resumes when the IRET pops the saved value into RFLAGS. - The processor also sets DR6.BS = 1 to indicate that the #DB exception occurred due to single stepping. - Single step #DB has higher priority that external interrupts. Control is transferred to the #DB handler first causing RFLAGS.TF to be cleared. The processor then transfers control to the pending-interrupt handler. - INTn, INT3 and INTO clear the TF when they are executed. - This means that when the INTn, INT3 and INTO exception handlers are executed with RFLAGS.TF=0. - However, RFLAGS.TF=1 on the exception stack so when the handler returns it will resume single step mode. - Intel SDM Vol 2 "Int n/INTO/INT 3" pseudo-code. Resume Flag: RFLAGS.RF is used to prevent an instruction breakpoint from generating a #DB. The primary use is to prevent the processor from going into a debug exception loop on an instruction breakpoint. APMv3 13.1.3: Instruction breakpoints and general-detect conditions have lower interrupt priority than the other breakpoint and single-stepping conditions. Thus a single-step breakpoint on the most recently executed instruction will happen before the instruction-breakpoint on the next instruction. APMv3 13.1.1.3: Single-step mode has the highest priority among debug exceptions. Thus the single step breakpoint will have higher priority that data or i/o breakpoints. APMv3 13.1.3.1 The processor ignores the instruction-breakpoint condition if RFLAGS.RF = 1. The RFLAGS.RF is automatically cleared by the processor when an instruction is executed. The exception is when RFLAGS.RF is set to 1 by an IRET in which case the flag is not cleared to 0 until the next instruction is executed. The Resume Flag is used when restarting an instruction that triggered an instruction breakpoint. APMv3 15.5.1 "VMRUN and TF/RF Bits in EFLAGS" - EFLAGS.RF=1 takes effect on the first guest instruction executed after VMRUN and suppresses an instruction breakpoint (if any) on the first instruction. - EFLAGS.TF=1 takes effects after the completion of the first guest instruction APMv3 13.1.3.2, 13.1.3.3 Data and I/O breakpoints happen after the instruction that triggered the breakpoint is executed. The instruction pointer pushed on the #DB exception stack points the next instruction. APMv3 13.1.3.4 If TSS.T=1 then processor completes loading new task state and #DB exception occurs before the first instruction is executed. The processor does not clear the TSS.T bit automatically when the #DB exception occurs. Software must clear this explicitly to disable the task breakpoint.