News: 0001467952

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Intel Lands A Nice Memset Performance Optimization In Glibc

([Intel] 3 Hours Ago Glibc Optimization)


Intel engineer Noah Goldstein has landed another nice performance optimization in the GNU C Library "glibc" for benefiting newer Intel processors.

The latest performance optimization by Noah Goldstein in the area of the open-source toolchain is improving the large memset performance with non-temporal stores.

The focus of this latest optimization effort is benefiting at least Skylake-X and Ice Lake -- for the latter applicable to both client and server processors. Goldstein explained of this memory set optimization now in Glibc Git:

"x86: Improve large memset perf with non-temporal stores [RHEL-29312]

Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a

few MBs) memsets. See

[1]here

for data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good

because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth.

...

The results on the memset-large benchmark suite on TGL-client for N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.926

Geometric Mean across the suite New / Old EXEX512: 0.925

Geometric Mean across the suite New / Old AVX2 : 0.928

Geometric Mean across the suite New / Old SSE2 : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous."

[2]The patch is now in Glibc Git as yet another nice performance optimization thanks to Intel's software team and their relentless open-source tuning contributions across the stack.



[1] https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing

[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=5bf0ab80573d66e4ae5d94b094659094336da90f



cassiofb-dev

coder

If a man had a child who'd gone anti-social, killed perhaps, he'd still
tend to protect that child.
-- McCoy, "The Ultimate Computer", stardate 4731.3