News: 0001596822

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Linux 6.19 Fixes A Thundering Herd Problem For Big NUMA Servers

([Linux Kernel] 4 Hours Ago Timers Issue)


The "timers/core" pull requests for updating Linux kernel timer-related code doesn't tend to be too interesting each kernel cycle, but this time around for Linux 6.19 it is for addressing a problem HPE discovered on big NUMA servers.

Linux 6.19 fixes a timekeeper CPU issue that could lead to a large number of CPU cores getting stuck on very large NUMA servers. The [1]pull request noted:

"Prevent a thundering herd problem when the timekeeper CPU is delayed and a large number of CPUs compete to acquire jiffies_lock to do the update. Limit it to one CPU with a separate "uncontended" atomic variable."

Steve Wahl of HPE authored the patch to fix this issue they spotted at the company. The HPE engineer further explained with [2]the patch :

"On large NUMA systems, while running a test program that saturates the inter-processor and inter-NUMA links, acquiring the jiffies_lock can be very expensive. If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets delayed and other cpus decide to do the jiffies update themselves, a large number of them decide to do so at the same time. The inexpensive check against tick_next_period is far quicker than actually acquiring the lock, so most of these get in line to obtain the lock. If obtaining the lock is slow enough, this spirals into the vast majority of CPUs continuously being stuck waiting for this lock, just to obtain it and find out that time has already been updated by another cpu. For example, on one random entry to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.

To avoid this, allow only one non-timekeeper CPU to call tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies only if the jiffies update function is actually called.

With this change, manually interrupting the test I find at most two CPUs in the tick_do_update_jiffies64 function (the timekeeper and one other)."

This fix was merged this week for Linux 6.19.



[1] https://lore.kernel.org/lkml/176457122251.1888260.91531689314335034.tglx@xen13/

[2] https://lore.kernel.org/lkml/20251027183456.343407-1-steve.wahl@hpe.com/



Once upon a time there was a DOS user who saw Unix, and saw that it was
good. After typing cp on his DOS machine at home, he downloaded GNU's
unix tools ported to DOS and installed them. He rm'd, cp'd, and mv'd
happily for many days, and upon finding elvis, he vi'd and was happy. After
a long day at work (on a Unix box) he came home, started editing a file,
and couldn't figure out why he couldn't suspend vi (w/ ctrl-z) to do
a compile.
(By ewt@tipper.oit.unc.edu (Erik Troan)