Dev news

Commit 0b47b6c3543e for kernel

commit 0b47b6c3543efd65f2e620e359b05f4938314fbd
Author: Andrea Righi <arighi@nvidia.com>
Date:   Fri Sep 12 18:14:38 2025 +0200

    Revert "sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()"

    scx_bpf_reenqueue_local() can be called from ops.cpu_release() when a
    CPU is taken by a higher scheduling class to give tasks queued to the
    CPU's local DSQ a chance to be migrated somewhere else, instead of
    waiting indefinitely for that CPU to become available again.

    In doing so, we decided to skip migration-disabled tasks, under the
    assumption that they cannot be migrated anyway.

    However, when a higher scheduling class preempts a CPU, the running task
    is always inserted at the head of the local DSQ as a migration-disabled
    task. This means it is always skipped by scx_bpf_reenqueue_local(), and
    ends up being confined to the same CPU even if that CPU is heavily
    contended by other higher scheduling class tasks.

    As an example, let's consider the following scenario:

     $ schedtool -a 0,1, -e yes > /dev/null
     $ sudo schedtool -F -p 99 -a 0, -e \
       stress-ng -c 1 --cpu-load 99 --cpu-load-slice 1000

    The first task (SCHED_EXT) can run on CPU0 or CPU1. The second task
    (SCHED_FIFO) is pinned to CPU0 and consumes ~99% of it. If the SCHED_EXT
    task initially runs on CPU0, it will remain there because it always sees
    CPU0 as "idle" in the short gaps left by the RT task, resulting in ~1%
    utilization while CPU1 stays idle:

        0[||||||||||||||||||||||100.0%]   8[                        0.0%]
        1[                        0.0%]   9[                        0.0%]
        2[                        0.0%]  10[                        0.0%]
        3[                        0.0%]  11[                        0.0%]
        4[                        0.0%]  12[                        0.0%]
        5[                        0.0%]  13[                        0.0%]
        6[                        0.0%]  14[                        0.0%]
        7[                        0.0%]  15[                        0.0%]
      PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
     1067 root        RT   0  R   0  99.0  0.2  0:31.16 stress-ng-cpu [run]
      975 arighi      20   0  R   0   1.0  0.0  0:26.32 yes

    By allowing scx_bpf_reenqueue_local() to re-enqueue migration-disabled
    tasks, the scheduler can choose to migrate them to other CPUs (CPU1 in
    this case) via ops.enqueue(), leading to better CPU utilization:

        0[||||||||||||||||||||||100.0%]   8[                        0.0%]
        1[||||||||||||||||||||||100.0%]   9[                        0.0%]
        2[                        0.0%]  10[                        0.0%]
        3[                        0.0%]  11[                        0.0%]
        4[                        0.0%]  12[                        0.0%]
        5[                        0.0%]  13[                        0.0%]
        6[                        0.0%]  14[                        0.0%]
        7[                        0.0%]  15[                        0.0%]
      PID USER       PRI  NI  S CPU  CPU%▽MEM%   TIME+  Command
      577 root        RT   0  R   0 100.0  0.2  0:23.17 stress-ng-cpu [run]
      555 arighi      20   0  R   1 100.0  0.0  0:28.67 yes

    It's debatable whether per-CPU tasks should be re-enqueued as well, but
    doing so is probably safer: the scheduler can recognize re-enqueued
    tasks through the %SCX_ENQ_REENQ flag, reassess their placement, and
    either put them back at the head of the local DSQ or let another task
    attempt to take the CPU.

    This also prevents giving per-CPU tasks an implicit priority boost,
    which would otherwise make them more likely to reclaim CPUs preempted by
    higher scheduling classes.

    Fixes: 97e13ecb02668 ("sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local()")
    Cc: stable@vger.kernel.org # v6.15+
    Signed-off-by: Andrea Righi <arighi@nvidia.com>
    Acked-by: Changwoo Min <changwoo@igalia.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 4ae32ef179dd..088ceff38c8a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6788,12 +6788,8 @@ __bpf_kfunc u32 scx_bpf_reenqueue_local(void)
 		 * CPUs disagree, they use %ENQUEUE_RESTORE which is bypassed to
 		 * the current local DSQ for running tasks and thus are not
 		 * visible to the BPF scheduler.
-		 *
-		 * Also skip re-enqueueing tasks that can only run on this
-		 * CPU, as they would just be re-added to the same local
-		 * DSQ without any benefit.
 		 */
-		if (p->migration_pending || is_migration_disabled(p) || p->nr_cpus_allowed == 1)
+		if (p->migration_pending)
 			continue;

 		dispatch_dequeue(rq, p);