ARM NEON Accelerated CRC64 Optimization Shows Nearly 6x Improvement
([Linux Kernel] 69 Minutes Ago
ARM NEON CRC64)
- Reference: 0001620257
- News link: https://www.phoronix.com/news/ARM-NEON-CRC64-6x
- Source link:
A patch posted today to the Linux kernel mailing list provides an ARM64-optimized CRC64-NVMe implementation for nearly a 6x improvement on modern Arm SoCs.
Open-source developer Demian Shulhan added this NEON-optimized CRC64 implementation, similar to the other architecture-specific CRC64 implementations such as for x86_64 and RISC-V. The intent on this CRC64 speed-up is for benefiting NVMe and other storage devices in addressing this bottleneck.
Shulhan explained in the patch and the nearly 6x gain was for an Arm Crotex-A72 SoC. He wrote:
"Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.
The acceleration is implemented using C intrinsics (arm_neon.h) rather than raw assembly for better readability and maintainability.
Key highlights of this implementation:
- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.
- Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.
- Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.
- Safely falls back to the generic implementation on Big-Endian systems.
Performance results (kunit crc_benchmark on Cortex-A72):
- Generic (len=4096): ~268 MB/s
- PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)"
It's surprising it took until now to see an ARM64/NEON-optimized CRC64 implementation for the Linux kernel at just a little more than one hundred lines of code.
[1]The patch is now out for review on the Linux kernel mailing list.
[1] https://lore.kernel.org/all/20260317065425.2684093-1-demyansh@gmail.com/
Open-source developer Demian Shulhan added this NEON-optimized CRC64 implementation, similar to the other architecture-specific CRC64 implementations such as for x86_64 and RISC-V. The intent on this CRC64 speed-up is for benefiting NVMe and other storage devices in addressing this bottleneck.
Shulhan explained in the patch and the nearly 6x gain was for an Arm Crotex-A72 SoC. He wrote:
"Implement an optimized CRC64 (NVMe) algorithm for ARM64 using NEON Polynomial Multiply Long (PMULL) instructions. The generic shift-and-XOR software implementation is slow, which creates a bottleneck in NVMe and other storage subsystems.
The acceleration is implemented using C intrinsics (arm_neon.h) rather than raw assembly for better readability and maintainability.
Key highlights of this implementation:
- Uses 4KB chunking inside scoped_ksimd() to avoid preemption latency spikes on large buffers.
- Pre-calculates and loads fold constants via vld1q_u64() to minimize register spilling.
- Benchmarks show the break-even point against the generic implementation is around 128 bytes. The PMULL path is enabled only for len >= 128.
- Safely falls back to the generic implementation on Big-Endian systems.
Performance results (kunit crc_benchmark on Cortex-A72):
- Generic (len=4096): ~268 MB/s
- PMULL (len=4096): ~1556 MB/s (nearly 6x improvement)"
It's surprising it took until now to see an ARM64/NEON-optimized CRC64 implementation for the Linux kernel at just a little more than one hundred lines of code.
[1]The patch is now out for review on the Linux kernel mailing list.
[1] https://lore.kernel.org/all/20260317065425.2684093-1-demyansh@gmail.com/