News: 0001595540

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

New Linux Patches Enhance Single-Threaded Performance On Many-Core CPUs

([Linux Kernel] 107 Minutes Ago Better Single-Threaded Perf)


In addition to [1]the proposed Hierarchical Queued NUMA-aware spinlocks for better performance , another interesting performance-enhancing patch series posted in the past 24 hours for the Linux kernel is for improving the performance of single-threaded tasks running on high core count CPU desktops / workstations / servers.

Gabriel Krisman Bertazi of SUSE posted the request for comments (RFC) patch series to better the performance of single-threaded tasks with today's many-core CPUs. The optimization is focused around the Linux kernel's "rss_stat" structure that holds statistics around the Resident Set Size (RSS) for the process with the amount of memory in use.

Gabriel Krisman Bertazi explained of this rss_stat optimization for single-threaded tasks to speed up its initialization and teardown:

"The cost of the pcpu memory allocation is non-negligible for systems with many cpus, and it is quite visible when forking a new task, as reported in a few occasions. In particular, Jan Kara reported the commit introducing per-cpu counters for rss_stat caused a 10% regression of system time for gitsource in his system. In that same occasion, Jan suggested we special-cased the single-threaded case: since we know there won't be frequent remote updates of rss_stats for single-threaded applications, we could special case it with a local counter for most updates, and an atomic counter for the infrequent remote updates. This patchset implements this idea."

The end result are some nice performance gains for single-threaded tasks running on high core count Linux systems. In synthetic benchmarks a 6~15% improvement or in a more realistic benchmark around 1.5% better performance. Still enough to make pursuing it worthwhile:

"On a 256c system, where the pcpu allocation of the rss_stats is quite noticeable, this has reduced the wall-clock time between 6% - 15% (depending on the number of cores) of an artificial fork-intensive microbenchmark (calling /bin/true in a loop). In a more realistic benchmark, it showed an improvement of 1.5% on kernbench elapsed time."

Those interested in learning more can do so via [2]this RFC patch series . It will be fun to benchmark these patches if they look like they'll end up in mainline for enhancing EPYC and Threadripper systems.



[1] https://www.phoronix.com/news/HQspinlock-RFC-Patches

[2] https://lore.kernel.org/lkml/20251127233635.4170047-1-krisman@suse.de/#t



User n.:
A programmer who will believe anything you tell him.