Dev news

Commit 33fd0ccd2590 for kernel

commit 33fd0ccd2590b470b65adcca288615ad3b5e3e06
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Sun May 3 19:19:32 2026 +0200

    KVM: x86: Do IRR scan in __kvm_apic_update_irr even if PIR is empty

    Fall back to apic_find_highest_vector() when PID.ON is set but PIR
    turns out to be empty, to correctly report the highest pending interrupt
    from the existing IRR.

    In a nested VM stress test, the following WARNING fires in
    vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
    interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
    vmx_sync_pir_to_irr() again) returns -1:

      WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
      Call Trace:
       kvm_check_and_inject_events
       vcpu_enter_guest.constprop.0
       vcpu_run
       kvm_arch_vcpu_ioctl_run
       kvm_vcpu_ioctl
       __x64_sys_ioctl
       do_syscall_64
       entry_SYSCALL_64_after_hwframe

    The root cause is a race between vmx_sync_pir_to_irr() on the target vCPU
    and __vmx_deliver_posted_interrupt() on a sender vCPU.  The sender
    performs two individually-atomic operations that are not a single
    transaction:

      1. pi_test_and_set_pir(vector)  -- sets the PIR bit
      2. pi_test_and_set_on()         -- sets PID.ON

    The following interleaving triggers the bug:

      Sender vCPU (IPI):              Target vCPU (1st sync_pir_to_irr):
      B1: set PIR[vector]
                                      A1: pi_clear_on()
                                      A2: pi_harvest_pir() -> sees B1 bit
                                      A3: xchg() -> consumes bit, PIR=0
                                          (1st sync returns correct max_irr)
      B2: set PID.ON = 1

                                      Target vCPU (2nd sync_pir_to_irr):
                                      C1: pi_test_on() -> TRUE (from B2)
                                      C2: pi_clear_on() -> ON=0
                                      C3: pi_harvest_pir() -> PIR empty
                                      C4: *max_irr = -1, early return
                                          IRR NOT SCANNED

    The interrupt is not lost (it resides in the IRR from the first sync and
    is recovered on the next vcpu_enter_guest() iteration), but the incorrect
    max_irr causes a spurious WARNING and a wasted L2 VM-Enter/VM-Exit cycle.

    Fixes: b41f8638b9d3 ("KVM: VMX: Isolate pure loads from atomic XCHG when processing PIR")
    Reported-by: Farrah Chen <farrah.chen@intel.com>
    Analyzed-by: Chenyi Qiang <chenyi.qiang@intel.com>
    Cc: stable@vger.kernel.org
    Reviewed-by: Sean Christopherson <seanjc@google.com>
    Link: https://lore.kernel.org/kvm/20260428070349.1633238-1-chenyi.qiang@intel.com/T/
    Link: https://patch.msgid.link/20260503201703.108231-2-pbonzini@redhat.com/
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e3ec4d8607c1..5ee14d6bc288 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -669,12 +669,14 @@ bool __kvm_apic_update_irr(unsigned long *pir, void *regs, int *max_irr)
 	u32 irr_val, prev_irr_val;
 	int max_updated_irr;

+	if (!pi_harvest_pir(pir, pir_vals)) {
+		*max_irr = apic_find_highest_vector(regs + APIC_IRR);
+		return false;
+	}
+
 	max_updated_irr = -1;
 	*max_irr = -1;

-	if (!pi_harvest_pir(pir, pir_vals))
-		return false;
-
 	for (i = vec = 0; i <= 7; i++, vec += 32) {
 		u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);