Intel's Zswap IAA Compress Batching Work Is Very Interesting For Linux Performance
([Intel] 4 Hours Ago
Zswap IAA Compress Batching)
- Reference: 0001505522
- News link: https://www.phoronix.com/news/Intel-Zswap-IAA-Compress-Batch
- Source link:
The Intel In-Memory Analytics Accelerator (IAA) found in various Xeon SKUs since Sapphire Rapids can be of big benefit to Linux servers/workstations with a Linux kernel patch series that has been in the works to provide Zswap IAA compress batching.
The Intel accelerator blocks found within recent generations of Xeon processors have overall only seen limited/niche use given the initial lack of broad software support around them. With time we've seen more software adoption around IAA and friends, including by the Linux kernel itself. One of the patch series I've been eagerly monitoring has been Intel's work on zswap IAA compress batching to use the Intel Analytics Accelerator for parallel compression of pages in large folios.
Benchmarks from Intel engineers of this Zswap IAA compress batching have shown extremely promising results for the latest Linux kernel code atop supported IAA-enabled Xeon processors:
Sent out last week were the [1]v3 patches for using the IAA accelerators for parallel compression of pages in large folios. The performance summary there is:
"The performance testing data with usemem 30 processes and kernel compilation test show throughput gains and elapsed/sys time reduction with zswap_store() large folios using IAA compress batching.
The iaa_crypto wq stats will show almost the same number of compress calls for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively. We see a latency reduction of 2.5% by distributing compress jobs among all IAA devices on the socket (based on v1 data).
We can expect to see even more significant performance and throughput improvements if we use the parallelism offered by IAA to batch compress the pages comprising a batch of 4K (really any-order) folios, not just batching within large folios. This is the reclaim batching patch 13 in v1, which will be submitted in a separate patch-series.
Our internal validation of IAA compress/decompress batching in highly contended Sapphire Rapids server setups with workloads running on 72 cores for ~25 minutes under stringent memory limit constraints have shown up to 50% reduction in sys time and 3.5% reduction in workload run time as compared to software compressors."
Fascinating work with significant performance benefits, so hopefully this work will end up in the mainline Linux kernel sooner rather than later for helping to make a more compelling IAA experience.
[1] https://lore.kernel.org/lkml/20241106192105.6731-1-kanchana.p.sridhar@intel.com/
The Intel accelerator blocks found within recent generations of Xeon processors have overall only seen limited/niche use given the initial lack of broad software support around them. With time we've seen more software adoption around IAA and friends, including by the Linux kernel itself. One of the patch series I've been eagerly monitoring has been Intel's work on zswap IAA compress batching to use the Intel Analytics Accelerator for parallel compression of pages in large folios.
Benchmarks from Intel engineers of this Zswap IAA compress batching have shown extremely promising results for the latest Linux kernel code atop supported IAA-enabled Xeon processors:
Sent out last week were the [1]v3 patches for using the IAA accelerators for parallel compression of pages in large folios. The performance summary there is:
"The performance testing data with usemem 30 processes and kernel compilation test show throughput gains and elapsed/sys time reduction with zswap_store() large folios using IAA compress batching.
The iaa_crypto wq stats will show almost the same number of compress calls for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively. We see a latency reduction of 2.5% by distributing compress jobs among all IAA devices on the socket (based on v1 data).
We can expect to see even more significant performance and throughput improvements if we use the parallelism offered by IAA to batch compress the pages comprising a batch of 4K (really any-order) folios, not just batching within large folios. This is the reclaim batching patch 13 in v1, which will be submitted in a separate patch-series.
Our internal validation of IAA compress/decompress batching in highly contended Sapphire Rapids server setups with workloads running on 72 cores for ~25 minutes under stringent memory limit constraints have shown up to 50% reduction in sys time and 3.5% reduction in workload run time as compared to software compressors."
Fascinating work with significant performance benefits, so hopefully this work will end up in the mainline Linux kernel sooner rather than later for helping to make a more compelling IAA experience.
[1] https://lore.kernel.org/lkml/20241106192105.6731-1-kanchana.p.sridhar@intel.com/
Anux