More AMD Zen 5 Tuning/Optimizations Merged For The GCC 15 Compiler
([AMD] 91 Minutes Ago
Znver5 Optimizations & Tuning)
- Reference: 0001489710
- News link: https://www.phoronix.com/news/GCC-15-Lands-More-Zen-5-Tuning
- Source link:
Following yesterday's [1]initial tuning of the "znver5" target for the AMD Zen 5 CPUs with the GCC 15 compiler, several more rounds of compiler tuning/optimizations were merged for benefiting the Ryzen AI 300 series, Ryzen 9000 series desktops, and upcoming EPYC Turin processors.
After yesterday's coverage of the first two Zen 5 tuning patches being merged to GCC Git by SUSE compiler engineer Jan Hubicka, he continued with several more patches for further enhancing the Znver5 compiler optimizations.
Part 3 landed and provided [2]scheduler tweaks for the AMD Zen 5 target. Hubicka explained there:
"This patch adds support for new [fusion] in znver5 documented in the optimization manual:
The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions with certain ALU instructions. The following conditions need to be met for fusion to happen:
- The MOV should be reg-reg mov with Opcode 0x89 or 0x8B
- The MOV is followed by an ALU instruction where the MOV and ALU destination register match.
- The ALU instruction may source only registers or immediate data. There cannot be any memory source.
- The ALU instruction sources either the source or dest of MOV instruction.
- If ALU instruction has 2 reg sources, they should be different.
- The following ALU instructions can fuse with an older qualified MOV instruction: ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR (I assume OP is OR)
I also increased issue rate from 4 to 6. Theoretically znver5 can do more, but with our model we can't realy use it. Increasing issue rate to 8 leads to infinite loop in scheduler.
Finally, I also enabled fuse_alu_and_branch since it is supported by znver5 (I think by earlier zens too)."
That was followed by [3]updating the re-association width :
"Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in 3 of them. FP units can do 2 additions and 2 multiplications with latency 2 and 3. This patch updates reassociation width accordingly. This has potential of increasing register pressure but unlike while benchmarking znver1 tuning I did not noticed this actually causing problem on spec, so this patch bumps up reassociation width to 6 for everything except for integer vectors, where there are 4 units with typical latency of 1."
And then as of writing now the fifth and last portion so far of the Zen 5 tuning for the GCC compiler is [4]updating the instruction latencies for Zen 5 processors:
"There is nothing exciting in this patch. I measured latencies and also compared them with newly released optimization guide. There are no dramatic changes compared to zen4. One interesting new bit is that addss is faster and can be 2 cycles when fed by another addss.
I also increased the large insn bound since decoders seems no longer require instructions to be 8 bytes or less."
That's it so far for this round of AMD Zen 5 tuning for the GNU Compiler Collection thanks to SUSE engineering.
[5]
These patches are all in GCC Git for the GCC 15.1 compiler to be released in March~April 2025. These patches might also be picked up for the next GCC 14 point release in the coming months for reaching a stable compiler version soon. In an ideal world this tuning would have happened all pre-launch given the annual-focused GCC compiler release cycles and the time it typically takes Linux distributions to adopt new GCC releases.
Once the Znver5 optimizations settle down I'll be through a fresh round of GCC compiler benchmarking the performance impact with current Ryzen 9000 series desktop processors.
[1] https://www.phoronix.com/news/AMD-Zen-5-Tuning-Part-2-GCC
[2] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e2125a600552bc6e0329e3f1224eea14804db8d3
[3] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f0ab3de6ec0e3540f2e57f3f5628005f0a4e3fa5
[4] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=4292297a0f938ffc953422fa246ff00fe345fe3d
[5] https://www.phoronix.com/image-viewer.php?id=2024&image=gcc_znver5_continues_lrg
After yesterday's coverage of the first two Zen 5 tuning patches being merged to GCC Git by SUSE compiler engineer Jan Hubicka, he continued with several more patches for further enhancing the Znver5 compiler optimizations.
Part 3 landed and provided [2]scheduler tweaks for the AMD Zen 5 target. Hubicka explained there:
"This patch adds support for new [fusion] in znver5 documented in the optimization manual:
The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions with certain ALU instructions. The following conditions need to be met for fusion to happen:
- The MOV should be reg-reg mov with Opcode 0x89 or 0x8B
- The MOV is followed by an ALU instruction where the MOV and ALU destination register match.
- The ALU instruction may source only registers or immediate data. There cannot be any memory source.
- The ALU instruction sources either the source or dest of MOV instruction.
- If ALU instruction has 2 reg sources, they should be different.
- The following ALU instructions can fuse with an older qualified MOV instruction: ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR (I assume OP is OR)
I also increased issue rate from 4 to 6. Theoretically znver5 can do more, but with our model we can't realy use it. Increasing issue rate to 8 leads to infinite loop in scheduler.
Finally, I also enabled fuse_alu_and_branch since it is supported by znver5 (I think by earlier zens too)."
That was followed by [3]updating the re-association width :
"Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in 3 of them. FP units can do 2 additions and 2 multiplications with latency 2 and 3. This patch updates reassociation width accordingly. This has potential of increasing register pressure but unlike while benchmarking znver1 tuning I did not noticed this actually causing problem on spec, so this patch bumps up reassociation width to 6 for everything except for integer vectors, where there are 4 units with typical latency of 1."
And then as of writing now the fifth and last portion so far of the Zen 5 tuning for the GCC compiler is [4]updating the instruction latencies for Zen 5 processors:
"There is nothing exciting in this patch. I measured latencies and also compared them with newly released optimization guide. There are no dramatic changes compared to zen4. One interesting new bit is that addss is faster and can be 2 cycles when fed by another addss.
I also increased the large insn bound since decoders seems no longer require instructions to be 8 bytes or less."
That's it so far for this round of AMD Zen 5 tuning for the GNU Compiler Collection thanks to SUSE engineering.
[5]
These patches are all in GCC Git for the GCC 15.1 compiler to be released in March~April 2025. These patches might also be picked up for the next GCC 14 point release in the coming months for reaching a stable compiler version soon. In an ideal world this tuning would have happened all pre-launch given the annual-focused GCC compiler release cycles and the time it typically takes Linux distributions to adopt new GCC releases.
Once the Znver5 optimizations settle down I'll be through a fresh round of GCC compiler benchmarking the performance impact with current Ryzen 9000 series desktop processors.
[1] https://www.phoronix.com/news/AMD-Zen-5-Tuning-Part-2-GCC
[2] https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=e2125a600552bc6e0329e3f1224eea14804db8d3
[3] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f0ab3de6ec0e3540f2e57f3f5628005f0a4e3fa5
[4] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=4292297a0f938ffc953422fa246ff00fe345fe3d
[5] https://www.phoronix.com/image-viewer.php?id=2024&image=gcc_znver5_continues_lrg
hubicka