Linux 6.18-rc5 To Cut Down Performance Regression Observed On IBM POWER CPUs
([Linux Kernel] 3 Hours Ago
Linux 6.18 PowerPC)
- Reference: 0001590184
- News link: https://www.phoronix.com/news/Linux-6.18-rc5-POWER-Regression
- Source link:
Merged today ahead of the Linux 6.18-rc5 kernel due out on Sunday is a partial fix for a performance regression observed on IBM POWER hardware.
Since the "IMMUTABLE" flag was dropped from the kernel's FUTEX code for the Linux 6.17 cycle, IBM engineers have noted a performance regression primarily affecting their hardware. Now for Linux 6.18-rc5 that performance regression is at least cut in half.
Intel engineer Peter Zijlstra worked out the partial fix/workaround by optimizing the per-CPU reference counting in the Futex code. Zijlstra explained with the now-merged [1]patch :
"Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely).
Further optimize the per-cpu reference counter by:
- switching from RCU to preempt;
- using __this_cpu_*() since we now have preempt disabled;
- switching from smp_load_acquire() to READ_ONCE().
This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%."
This improvement was merged to the Linux 6.18 Git code today as the sole change of this week's [2]locking/urgent pull request .
[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=locking/urgent&id=4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
[2] https://lore.kernel.org/lkml/aQ8_0mAj3AUGgguL@gmail.com/
Since the "IMMUTABLE" flag was dropped from the kernel's FUTEX code for the Linux 6.17 cycle, IBM engineers have noted a performance regression primarily affecting their hardware. Now for Linux 6.18-rc5 that performance regression is at least cut in half.
Intel engineer Peter Zijlstra worked out the partial fix/workaround by optimizing the per-CPU reference counting in the Futex code. Zijlstra explained with the now-merged [1]patch :
"Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely).
Further optimize the per-cpu reference counter by:
- switching from RCU to preempt;
- using __this_cpu_*() since we now have preempt disabled;
- switching from smp_load_acquire() to READ_ONCE().
This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock().
Having preemption disabled allows using __this_cpu_*() provided the only access to the variable is in task context -- which is the case here.
Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier().
This is very similar to the percpu_down_read_internal() fast-path.
The reason this is significant for PowerPC is that it uses the generic this_cpu_*() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu*() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch.
Combined this reduces the performance gap by half, down to some 5%."
This improvement was merged to the Linux 6.18 Git code today as the sole change of this week's [2]locking/urgent pull request .
[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=locking/urgent&id=4cb5ac2626b5704ed712ac1d46b9d89fdfc12c5d
[2] https://lore.kernel.org/lkml/aQ8_0mAj3AUGgguL@gmail.com/