Glibc Lands A Big Optimization For LoongArch CPUs
([Hardware] 3 Hours Ago
Faster LoongArch)
- Reference: 0001627695
- News link: https://www.phoronix.com/news/Faster-LoongArch-glibc-THP
- Source link:
Loongson's LoongArch processors are running decent in our recent [1]Loongson 3B6000 benchmarks but even better performance is on the way with the next GNU C Library "glibc" release.
Merged yesterday to Glibc Git is a LoongArch-specific change to enable transparent hugepages (THP) aligned load segments by default for LoongArch64. Aligning ELF load segments to THP boundaries is providing a consistent performance win for large binaries by reducing transparent lookaside buffer (TLB) pressure and improving instruction fetch efficiency.
Benchmarks for compiling Rust's Cargo on a Loongson 3A6000 show instruction TLB misses dropping by 72%, reduction in CPU cycles by about 4.7%, and around 4.2% wall time savings. Or compiling the Linux kernel with LLM yielded a wall time reduction of about 12%. It's quite a big performance win from [2]this patch to THP-aligned load segments by default for LoongArch.
That patch is part of a series that also introduced the [3]glibc.elf.thp tunable for THP-aware segment alignment and the new [4]alignment code .
Will be fun to benchmark these LoongArch improvements soon to see how much better the 3B6000 is looking across a range of workloads.
[1] https://www.phoronix.com/review/loongson-3b6000-loongarch
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=ef044cc6d79c5646b521569417da890154ea1813
[3] https://sourceware.org/git/?p=glibc.git;a=commit;h=f9933bf832d4ea4bb1b21db435e61324c05b7b10
[4] https://sourceware.org/git/?p=glibc.git;a=commit;h=2f9fc3fba6562c888204159c09d654cb3499e38b
Merged yesterday to Glibc Git is a LoongArch-specific change to enable transparent hugepages (THP) aligned load segments by default for LoongArch64. Aligning ELF load segments to THP boundaries is providing a consistent performance win for large binaries by reducing transparent lookaside buffer (TLB) pressure and improving instruction fetch efficiency.
Benchmarks for compiling Rust's Cargo on a Loongson 3A6000 show instruction TLB misses dropping by 72%, reduction in CPU cycles by about 4.7%, and around 4.2% wall time savings. Or compiling the Linux kernel with LLM yielded a wall time reduction of about 12%. It's quite a big performance win from [2]this patch to THP-aligned load segments by default for LoongArch.
That patch is part of a series that also introduced the [3]glibc.elf.thp tunable for THP-aware segment alignment and the new [4]alignment code .
Will be fun to benchmark these LoongArch improvements soon to see how much better the 3B6000 is looking across a range of workloads.
[1] https://www.phoronix.com/review/loongson-3b6000-loongarch
[2] https://sourceware.org/git/?p=glibc.git;a=commit;h=ef044cc6d79c5646b521569417da890154ea1813
[3] https://sourceware.org/git/?p=glibc.git;a=commit;h=f9933bf832d4ea4bb1b21db435e61324c05b7b10
[4] https://sourceware.org/git/?p=glibc.git;a=commit;h=2f9fc3fba6562c888204159c09d654cb3499e38b