ARM NEON Accelerated CRC64 Optimization Shows Nearly 6x Improvement

([Linux Kernel] 69 Minutes Ago ARM NEON CRC64)

Reference: 0001620257
News link: https://www.phoronix.com/news/ARM-NEON-CRC64-6x
Source link:

A patch posted today to the Linux kernel mailing list provides an ARM64-optimized CRC64-NVMe implementation for nearly a 6x improvement on modern Arm SoCs.

Open-source developer Demian Shulhan added this NEON-optimized CRC64 implementation, similar to the other architecture-specific CRC64 implementations such as for x86_64 and RISC-V. The intent on this CRC64 speed-up is for benefiting NVMe and other storage devices in addressing this bottleneck.

Shulhan explained in the patch and the nearly 6x gain was for an Arm Crotex-A72 SoC. He wrote:

"Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.

The acceleration is implemented using C intrinsics (arm_neon.h) rather than raw assembly for better readability and maintainability.

Key highlights of this implementation:

- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.

- Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.

- Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.

- Safely falls back to the generic implementation on Big-Endian systems.

Performance results (kunit crc_benchmark on Cortex-A72):

- Generic (len=4096): ~268 MB/s

- PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)"

It's surprising it took until now to see an ARM64/NEON-optimized CRC64 implementation for the Linux kernel at just a little more than one hundred lines of code.

[1]The patch is now out for review on the Linux kernel mailing list.

[1] https://lore.kernel.org/all/20260317065425.2684093-1-demyansh@gmail.com/

News: 0001620257

ARM NEON Accelerated CRC64 Optimization Shows Nearly 6x Improvement