Testing "iret" tracing on AMD/SVM

Apply the following patch to get the 'sysctl debug.nit' node:
https://people.freebsd.org/~neel/bhyve/tests/nmi_iret_testing.patch

Single step guest kernel and make sure all instructions traced properly:
------------------------------------------------------------------------

1. Boot up a FreeBSD 64-bit guest with more than one vcpu.

2. In the guest set 'sysctl machdep.kdb_on_nmi=1'. This will cause the guest
   to enter the debugger when it receives an NMI.

3. When the guest is idle execute the following on the host:
   $ bhyvectl --vm=vm2 --assert-lapic-lvt=1 --cpu=0

   This will inject an NMI into vcpu 0 which will then cause the guest to
   enter the debugger. This will typically be the instruction following
   'hlt' in cpu_idle_acpi().

   Single step the guest using 's' and verify that no instructions are
   omitted.

4. This test will not work with a broadcast nmi (--cpu=-1). This is
   a drawback of the FreeBSD guest because all vcpus try to enter the
   debugger and deadlock waiting for the other vcpus to stop.

Pending interrupt during iret tracing:
--------------------------------------

1. Boot up a FreeBSD 64-bit guest.
   Recommend using a single vcpu and disabling pause exits to de-clutter the
   KTR log file.

2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest.

3. Set 'sysctl debug.nit.delay_usecs=800000' in the host.
   The guest NMI handler reads i/o port 0x61 and the sysctl above will delay
   the it for 800 msec. This should be enough time for the guest timer
   interrupt to fire. This interrupt will be held pending until the NMI
   handler issues an 'iret' to re-enable interrupts.

4. Inject an NMI and capture a trace by executing the following on the host:
	sudo sysctl debug.ktr.clear=1
	sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1
	sleep 1
	sudo ktrdump -cto /tmp/ktrdump.out

5. Open /tmp/ktrdump.out and look for the following:
   - There should be a pending interrupt after "Delaying 800000 usecs"
   - This interrupt should be delivered only after "vNMIs unblocked precisely"

Pending NMI during iret tracing:
--------------------------------

1. Boot up a FreeBSD 64-bit guest.
   Recommend using a single vcpu and disabling pause exits to de-clutter the
   KTR log file.

2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest.

3. Set 'sysctl debug.nit.recursive_nmi=1' in the host.
   The guest NMI handler reads i/o port 0x61 and the sysctl above will inject
   an NMI in the i/o handler. This NMI will be held pending until the previous
   NMI handler issues an 'iret'.

4. Execute the following on the host:
   sudo bhyvectl --vm=vm2 --get-stats | grep "vNMI unblocked precisely"

5. Inject an NMI and capture a trace by executing the following on the host:
	sudo sysctl debug.ktr.clear=1
	sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1
	sleep 1
	sudo ktrdump -cto /tmp/ktrdump.out
6. Execute the following on the host:
   sudo bhyvectl --vm=vm2 --get-stats | grep "vNMI unblocked precisely"

   The difference between the counter values in (6) and (4) should be '2'
   indicating that two NMIs were injected into the guest.

7. Open /tmp/ktrdump.out and look for the following:
   - There should be a pending NMI after "Injecting recursive vNMI"
   - This NMI should be delivered only after "vNMIs unblocked precisely"

Nested Page Fault during iret tracing
-------------------------------------

1. Boot up a FreeBSD 64-bit guest.
   Recommend using a single vcpu and disabling pause exits to de-clutter the
   KTR log file.

2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest.

3. Set 'sysctl debug.nit.invalidate_nested_mappings=1' on the host.
   This will invalidate all nested page table entries before entering the
   guest with IRET tracing enabled.

4. Inject an NMI and capture a trace by executing the following on the host:
	sudo sysctl debug.ktr.clear=1
	sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1
	sleep 1
	sudo ktrdump -cto /tmp/ktrdump.out

5. In /tmp/ktrdump.out look for the following:
   - "Removing all nested pages" should appear after "vNMI iret tracing enabled"
   - There should be a number of nested page fault VM exits:
     - All these VM exits occur at the %rip of the guest NMI handler's "iret"
   - This should be followed by "vNMIs unblocked precisely"

6. To test behavior when hypervisor returns to userspace during iret tracing
   also set 'sysctl debug.nit.bounce_to_userspace=1' in step (3)
   - This can be verified by "Bouncing to userspace" in the ktrdump.log

#GP exception triggered by "iret"
---------------------------------

1. Boot up a FreeBSD 64-bit guest.
   Recommend using a single vcpu and disabling pause exits to de-clutter the
   KTR log file.

2. Set 'sysctl machdep.kdb_on_nmi=0' in the guest.

3. Set 'sysctl debug.nit.iret_trigger_gpf=1' on the host.
   This will change the return address in the NMI trapframe to a non-canonical
   address. When the "iret" executes this will trigger a #GP in the guest which
   is intercepted by the hypervisor to clear NMI blocking.

4. Inject an NMI and capture a trace by executing the following on the host:
	sudo sysctl debug.ktr.clear=1
	sudo bhyvectl --vm=vm2 --cpu=0 --assert-lapic-lvt=1
	sudo ktrdump -cto /tmp/ktrdump.out

5. In /tmp/ktrdump.out look for:
   "updated to non-canonical value 0xdeadbeefbeefdead"
   "vNMIs unblocked precisely"
   "Reflecting exception 13/0 into the guest"

6. If the guest is idle then the #GP will be encountered in kernel mode and
   cause it to enter the debugger.

7. By executing a 'while [ 1 ]; do : done' script at the guest console it is
   possible to increase the likelyhood of the #GP to be encountered in user
   mode. In this case the running program will exit with a SIGBUS.

Miscellaneous
-------------

- Verify that tunable 'hw.vmm.trace_guest_exceptions=1' works as expected

- Set breakpoint in the guest and make sure the debugger works properly

- Inject NMI handled using an interrupt gate
  - Boot an i386 guest and load the tsnmi.ko kernel module
    https://people.freebsd.org/~neel/bhyve/tests/task_switch_nmi.tar.gz
  - Inject an NMI into the guest
  - Verify that debug.tsnmi is incremented.
  - Verify that the bhyve stat "Number of times vNMI unblocked speculatively"
    is incremented.

Combining various test knobs
----------------------------

It is possible to combine the following knobs in any way:
  - debug.nit.invalidate_nested_mappings
    - debug.nit.bounce_to_userspace
  - debug.nit.delay_usecs
  - debug.nit.recursive_nmi
  Note that 'bounce_to_userspace' only takes effect if
  'invalidate_nested_mappings' is set to '1'.

It is possible to combine 'debug.nit.iret_trigger_gpf=1' with all other knobs
except 'debug.nit.recursive_nmi=1'.

This is because FreeBSD uses the 64-bit TSS interrupt stack table to switch
stacks for the NMI handler to execute. When the #GP is triggered it unblocks
NMI and the second NMI is injected into the guest. Note that the NMI stack
still has the hardware stack frames for both the first NMI and the #GP. The
second NMI will use the same stack switching mechanism as the first NMI and
overwrite the stack frames for the first NMI and the #GP.

Implementation details:
-----------------------

When the hypervisor sets RFLAGS.TF and resumes the guest:
- Also set RFLAGS.RF to ensure that the instruction is not subject to any
  instruction breakpoint fault.

- How to handle an exception triggered by "iret"?
  - Restore RFLAGS.TF and RFLAGS.RF
  - Reinject the exception into the guest.
  - Clear NMI blocking
    - From Intel SDM 6.7.1: an execution of the IRET instruction unblocks NMIs
      even if the instruction causes a fault.

- How to handle #DB that occurs when guest successfully executes "iret"?
  - Don't restore RFLAGS.TF and RFLAGS.RF
    - guest has already restored RFLAGS from the NMI stack
  - Don't reinject exception into the guest
  - Clear NMI blocking.
  - KASSERT(GUEST_DR6.BS == 1)
    - APMv2 15.12.2 DR6 and DR7 are updated before the #DB intercept
    - XXX restore DR6.BS

Documentation:
--------------

APMv2 section "Single Stepping"
RFLAGS.TF = 1 will cause #DB after every instruction is executed.
- The instruction that sets the TF bit and the instruction that follows it
  are not single stepped.
- TF is cleared before entering the #DB handler so the debug handler itself
  is not single stepped.
- The RFLAGS on the stack have the TF bit set so single stepping resumes
  when the IRET pops the saved value into RFLAGS.
- The processor also sets DR6.BS = 1 to indicate that the #DB exception
  occurred due to single stepping.
- Single step #DB has higher priority that external interrupts. Control is
  transferred to the #DB handler first causing RFLAGS.TF to be cleared. The
  processor then transfers control to the pending-interrupt handler.
- INTn, INT3 and INTO clear the TF when they are executed.
  - This means that when the INTn, INT3 and INTO exception handlers are
    executed with RFLAGS.TF=0.
  - However, RFLAGS.TF=1 on the exception stack so when the handler returns
    it will resume single step mode.
  - Intel SDM Vol 2 "Int n/INTO/INT 3" pseudo-code.

Resume Flag:
RFLAGS.RF is used to prevent an instruction breakpoint from generating a #DB.
The primary use is to prevent the processor from going into a debug exception
loop on an instruction breakpoint.

APMv3 13.1.3:
Instruction breakpoints and general-detect conditions have lower interrupt
priority than the other breakpoint and single-stepping conditions.

Thus a single-step breakpoint on the most recently executed instruction
will happen before the instruction-breakpoint on the next instruction.

APMv3 13.1.1.3:
Single-step mode has the highest priority among debug exceptions.

Thus the single step breakpoint will have higher priority that data or i/o
breakpoints.

APMv3 13.1.3.1
The processor ignores the instruction-breakpoint condition if RFLAGS.RF = 1.
The RFLAGS.RF is automatically cleared by the processor when an instruction
is executed. The exception is when RFLAGS.RF is set to 1 by an IRET in which
case the flag is not cleared to 0 until the next instruction is executed.

The Resume Flag is used when restarting an instruction that triggered
an instruction breakpoint.

APMv3 15.5.1 "VMRUN and TF/RF Bits in EFLAGS"
- EFLAGS.RF=1 takes effect on the first guest instruction executed after VMRUN
  and suppresses an instruction breakpoint (if any) on the first instruction.

- EFLAGS.TF=1 takes effects after the completion of the first guest instruction

APMv3 13.1.3.2, 13.1.3.3
Data and I/O breakpoints happen after the instruction that triggered the
breakpoint is executed. The instruction pointer pushed on the #DB exception
stack points the next instruction.

APMv3 13.1.3.4
If TSS.T=1 then processor completes loading new task state and #DB exception
occurs before the first instruction is executed. The processor does not clear
the TSS.T bit automatically when the #DB exception occurs. Software must clear
this explicitly to disable the task breakpoint.