News: 0001590184

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Linux 6.18-rc5 To Cut Down Performance Regression Observed On IBM POWER CPUs

([Linux Kernel] 3 Hours Ago Linux 6.18 PowerPC)


Merged today ahead of the Linux 6.18-rc5 kernel due out on Sunday is a partial fix for a performance regression observed on IBM POWER hardware.

Since the "IMMUTABLE" flag was dropped from the kernel's FUTEX code for the Linux 6.17 cycle, IBM engineers have noted a performance regression primarily affecting their hardware. Now for Linux 6.18-rc5 that performance regression is at least cut in half.

Intel engineer Peter Zijlstra worked out the partial fix/workaround by optimizing the per-CPU reference counting in the Futex code. Zijlstra explained with the now-merged [1]patch :

"Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely).

Further optimize the per-cpu reference counter by:

- switching from RCU to preempt;

- using __this_cpu_*() since we now have preempt disabled;

- switching from smp_load_acquire() to READ_ONCE().

This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock().

Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here.

Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier().

This is very similar to the percpu_down_read_internal() fast-path.

The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch.

Combined this reduces the performance gap by half, down to some 5%."

This improvement was merged to the Linux 6.18 Git code today as the sole change of this week's [2]locking/urgent pull request .



[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=locking/urgent&id=4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d

[2] https://lore.kernel.org/lkml/aQ8_0mAj3AUGgguL@gmail.com/



I see the eigenvalue in thine eye,
I hear the tender tensor in thy sigh.
Bernoulli would have been content to die
Had he but known such _a-squared cos 2(phi)!
-- Stanislaw Lem, "Cyberiad"