Commit 2ae361ef1d for qemu.org

commit 2ae361ef1d7d526b07ff88d854552e2d009bfb1b
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Feb 18 15:09:58 2026 -0500

    aio-posix: notify main loop when SQEs are queued

    When a vCPU thread handles MMIO (holding BQL), aio_co_enter() runs the
    block I/O coroutine inline on the vCPU thread because
    qemu_get_current_aio_context() returns the main AioContext when BQL is
    held. The coroutine calls luring_co_submit() which queues an SQE via
    fdmon_io_uring_add_sqe(), but the actual io_uring_submit() only happens
    in gsource_prepare() on the main loop thread.

    Since the coroutine ran inline (not via aio_co_schedule()), no BH is
    scheduled and aio_notify() is never called. The main loop remains asleep
    in ppoll() with up to a 499ms timeout, leaving the SQE unsubmitted until
    the next timer fires.

    Fix this by calling aio_notify() after queuing the SQE. This wakes the
    main loop via the eventfd so it can run gsource_prepare() and submit the
    pending SQE promptly.

    This is a generic fix that benefits all devices using aio=io_uring.
    Without it, AHCI/SATA devices see MUCH worse I/O latency since they use
    MMIO (not ioeventfd like virtio) and have no other mechanism to wake the
    main loop after queuing block I/O.

    This is usually a bit hard to detect, as it also relies on the ppoll
    loop not waking up for other activity, and micro benchmarks tend not to
    see it because they don't have any real processing time. With a
    synthetic test case that has a few usleep() to simulate processing of
    read data, it's very noticeable. The below example reads 128MB with
    O_DIRECT in 128KB chunks in batches of 16, and has a 1ms delay before
    each batch submit, and a 1ms delay after processing each completion.
    Running it on /dev/sda yields:

    time sudo ./iotest /dev/sda

    ________________________________________________________
    Executed in   25.76 secs          fish           external
       usr time    6.19 millis  783.00 micros    5.41 millis
       sys time   12.43 millis  642.00 micros   11.79 millis

    while on a virtio-blk or NVMe device we get:

    time sudo ./iotest /dev/vdb

    ________________________________________________________
    Executed in    1.25 secs      fish           external
       usr time    1.40 millis    0.30 millis    1.10 millis
       sys time   17.61 millis    1.43 millis   16.18 millis

    time sudo ./iotest /dev/nvme0n1

    ________________________________________________________
    Executed in    1.26 secs      fish           external
       usr time    6.11 millis    0.52 millis    5.59 millis
       sys time   13.94 millis    1.50 millis   12.43 millis

    where the latter are consistent. If we run the same test but keep the
    socket for the ssh connection active by having activity there, then
    the sda test looks as follows:

    time sudo ./iotest /dev/sda

    ________________________________________________________
    Executed in    1.23 secs      fish           external
       usr time    2.70 millis   39.00 micros    2.66 millis
       sys time    4.97 millis  977.00 micros    3.99 millis

    as now the ppoll loop is woken all the time anyway.

    After this fix, on an idle system:

    time sudo ./iotest /dev/sda

    ________________________________________________________
    Executed in    1.30 secs      fish           external
       usr time    2.14 millis    0.14 millis    2.00 millis
       sys time   16.93 millis    1.16 millis   15.76 millis

    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Message-Id: <07d701b9-3039-4f9b-99a2-abeae51146a5@kernel.dk>
    Reviewed-by: Kevin Wolf <kwolf@redhat.com>
    [Generalize the comment since this applies to all vCPU thread activity,
    not just coroutines, as suggested by Kevin Wolf <kwolf@redhat.com>.
    --Stefan]
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

diff --git a/util/aio-posix.c b/util/aio-posix.c
index e24b955fd9..488d964611 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -23,6 +23,7 @@
 #include "qemu/rcu_queue.h"
 #include "qemu/sockets.h"
 #include "qemu/cutils.h"
+#include "system/iothread.h"
 #include "trace.h"
 #include "aio-posix.h"

@@ -813,5 +814,13 @@ void aio_add_sqe(void (*prep_sqe)(struct io_uring_sqe *sqe, void *opaque),
 {
     AioContext *ctx = qemu_get_current_aio_context();
     ctx->fdmon_ops->add_sqe(ctx, prep_sqe, opaque, cqe_handler);
+
+    /*
+     * Wake the main loop if it is sleeping in ppoll().  When a vCPU thread
+     * queues SQEs, the actual io_uring_submit() only happens in
+     * gsource_prepare() in the main loop thread.  Without this notify, the
+     * main loop thread's ppoll() can sleep up to 499ms before submitting.
+     */
+    aio_notify(ctx);
 }
 #endif /* CONFIG_LINUX_IO_URING */