Intel Lands A Nice Memset Performance Optimization In Glibc
([Intel] 3 Hours Ago
Glibc Optimization)
- Reference: 0001467952
- News link: https://www.phoronix.com/news/Intel-Glibc-Memset-Perf
- Source link:
Intel engineer Noah Goldstein has landed another nice performance optimization in the GNU C Library "glibc" for benefiting newer Intel processors.
The latest performance optimization by Noah Goldstein in the area of the open-source toolchain is improving the large memset performance with non-temporal stores.
The focus of this latest optimization effort is benefiting at least Skylake-X and Ice Lake -- for the latter applicable to both client and server processors. Goldstein explained of this memory set optimization now in Glibc Git:
"x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a
few MBs) memsets. See
Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth.
...
The results on the memset-large benchmark suite on TGL-client for N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous."
[2]The patch is now in Glibc Git as yet another nice performance optimization thanks to Intel's software team and their relentless open-source tuning contributions across the stack.
[1] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
The latest performance optimization by Noah Goldstein in the area of the open-source toolchain is improving the large memset performance with non-temporal stores.
The focus of this latest optimization effort is benefiting at least Skylake-X and Ice Lake -- for the latter applicable to both client and server processors. Goldstein explained of this memory set optimization now in Glibc Git:
"x86: Improve large memset perf with non-temporal stores [RHEL-29312]
Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a
few MBs) memsets. See
[1]here
for data using different stategies for large memset on ICX and SKX.Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth.
...
The results on the memset-large benchmark suite on TGL-client for N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous."
[2]The patch is now in Glibc Git as yet another nice performance optimization thanks to Intel's software team and their relentless open-source tuning contributions across the stack.
[1] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f
cassiofb-dev