Optimized NUMA Distances For Intel GNR & CWF, Other Scheduler Improvements In Linux 6.19
([Linux Kernel] 6 Hours Ago
Linux 6.19 Scheduler)
- Reference: 0001596340
- News link: https://www.phoronix.com/news/Linux-6.19-Scheduler
- Source link:
The big set of kernel scheduler changes were merged on Monday for the in-development Linux 6.19 kernel.
New to the kernel scheduler code for Linux 6.19 is a new "NEXT_BUDDY" feature, which actually existed in the kernel before but ultimately decommissioned under less than clear conditions. This NEXT_BUDDY feature by Mel Gorman is explained in the patches as:
"The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results.
NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning was not documented but as vruntime spread is mentioned and NEXT_BUDDY cannot, by definition, strictly obey EEVDF principles. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. NEXT_BUDDY will still pick an earlier deadline but LAST_BUDDY can pick ineligible tasks. Peter Zijlstra made this comment about NEXT_BUDDY being disabled during review;
"I think I was just struggling to make sense of things and figured less is more and axed it.
I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache."
NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable."
Now NEXT_BUDDY is back and ready for the modern [1]EEVDF world.
There are also other load balancing improvements as part of the scheduler changes. For Intel Xeon 6 Granite Rapids and next-gen Xeon 6+ Clearwater Forest platforms there is now optimized NUMA distances. This addresses a sched domain build error for GNR and CWF processors operating in Sub-NUMA Clustering 3 (SNC-3) mode.
Also new is a proportional newidle balance mode for Linux 6.19. This work by Intel's Peter Zijlstra is a randomized algorithm that runs newidle balancing proportional to its success rate. This was found to help the Schbench scheduler benchmark significantly.
Plus there are fair scheduling enhancements, fixes to the deadling scheduler, and other fixes throughout. More details within [2]this Git pull that has already been merged to Linux Git.
[1] https://www.phoronix.com/search/EEVDF
[2] https://lore.kernel.org/lkml/aS14ZaStk4Kly1NI@gmail.com/
New to the kernel scheduler code for Linux 6.19 is a new "NEXT_BUDDY" feature, which actually existed in the kernel before but ultimately decommissioned under less than clear conditions. This NEXT_BUDDY feature by Mel Gorman is explained in the patches as:
"The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results.
NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning was not documented but as vruntime spread is mentioned and NEXT_BUDDY cannot, by definition, strictly obey EEVDF principles. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. NEXT_BUDDY will still pick an earlier deadline but LAST_BUDDY can pick ineligible tasks. Peter Zijlstra made this comment about NEXT_BUDDY being disabled during review;
"I think I was just struggling to make sense of things and figured less is more and axed it.
I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache."
NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable."
Now NEXT_BUDDY is back and ready for the modern [1]EEVDF world.
There are also other load balancing improvements as part of the scheduler changes. For Intel Xeon 6 Granite Rapids and next-gen Xeon 6+ Clearwater Forest platforms there is now optimized NUMA distances. This addresses a sched domain build error for GNR and CWF processors operating in Sub-NUMA Clustering 3 (SNC-3) mode.
Also new is a proportional newidle balance mode for Linux 6.19. This work by Intel's Peter Zijlstra is a randomized algorithm that runs newidle balancing proportional to its success rate. This was found to help the Schbench scheduler benchmark significantly.
Plus there are fair scheduling enhancements, fixes to the deadling scheduler, and other fixes throughout. More details within [2]this Git pull that has already been merged to Linux Git.
[1] https://www.phoronix.com/search/EEVDF
[2] https://lore.kernel.org/lkml/aS14ZaStk4Kly1NI@gmail.com/