Significant CRC32C Throughput Optimization On The Way To The Linux Kernel
- Reference: 0001500534
- News link: https://www.phoronix.com/news/Linux-Faster-CRC32X-x86
- Source link:
Biggers has patches pending to eliminate the jump table and excessive unrolling found within the CRC32C Assembly code used on modern Intel/AMD processors. He explains in [4]this patch within his crypto-pending branch:
"crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations full unrolled and uses a jump table to jump into the correct location. This optimization is misguided, as it bloats the binary code size and introduces an indirect call. x86_64 CPUs can predict loops well, so it is fine to just use a loop instead. Loop bookkeeping instructions can compete with the crc instructions for the ALUs, but this is easily mitigated by unrolling the loop by a smaller amount, such as 4 times.
Therefore, re-roll the loop and make related tweaks to the code.
This reduces the binary code size of crc_pclmul() from 4546 bytes to 418 bytes, a 91% reduction. In general it also makes the code faster, with some large improvements seen when retpoline is enabled."
With the default (Retpoline enabled) state for Intel and AMD CPUs, there is as much as a 66% throughput boost on Intel Emerald Rapids while AMD Zen 2 is even seeing as much as a 29% throughput improvement. Some real nice wins:
Hopefully this new code will be buttoned up in time for the upcoming Linux v6.13 kernel cycle for boosting the CRC32C kernel crypto performance for modern Intel and AMD processors.
[1] https://www.phoronix.com/news/AES-GCM-Intel-AMD-Linux-6.11
[2] https://www.phoronix.com/news/Linux-6.10-Crypto
[3] https://www.phoronix.com/search/Eric+Biggers
[4] https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/commit/?h=crypto-pending&id=84004e2996a00fdf527d9269fe33c0b254427f1f
microchip8