Amazon primed to fuse Nvidia's NVLink into 4th-gen Trainium accelerators
(2025/12/02)
- Reference: 1764691215
- News link: https://www.theregister.co.uk/2025/12/02/amazon_nvidia_trainium/
- Source link:
Re:Invent Amazon says that its next generation of homegrown silicon will deliver 6x higher performance thanks to a little help from its buddy Nvidia.
At its Re:Invent convention in Las Vegas on Tuesday, Amazon Web Services (AWS) teased its Trainium4 accelerators, which will be among the first to embrace Nvidia's NVLink Fusion interconnect tech for chip-to-chip communications.
NVLink is a high-speed interconnect that allows multiple GPUs spanning multiple systems to pool resources and behave like a single accelerator. Previously, this technology has been limited to Nvidia CPUs and GPUs, but back in May, the AI infrastructure giant announced it was opening the tech to others with the [1]introduction of NVLink Fusion at Computex.
[2]
Amazon claims that the technology will allow its Trainium4 accelerators, Graviton CPUs, and EFA networking tech to communicate seamlessly across Nvidia's MGX racks.
[3]
[4]
In its current form, Nvidia's fifth-gen NVLink fabrics support up to 1.8 TB/s of bandwidth (900 GB/s in each direction) per GPU, but the company is on track to double that to 3.6 TB/s by next year.
Beyond Nvidia's interconnect tech, details are somewhat vague. We're told that the new chips will deliver 3x more FLOPS at FP8, 6x the performance at FP4, and 4x the memory bandwidth. Whether those claims pertain to the individual chips or its UltraServer rack systems, Amazon hasn't said.
[5]
Assuming it's the rack systems, as was the case with Trainium3, that suggests AWS's Trainium4 UltraServers could deliver upwards of 2 exaFLOPS of dense FP4 performance and 2.8 petabytes a second of memory bandwidth.
That latter point is likely to be a major boon for bandwidth-bound inference workloads. Despite a rather confusing naming convention, AWS actually employs Trainium for both internal and external training and inference.
Of course, the devil is in the details and we simply don't have all of them yet. Amazon made [6]similar claims about its Trainium3 UltraServers this time last year, boasting a 4.4x uplift in compute over its Trainium2 racks. But while technically true, what we didn't know at the time was roughly half that performance would be achieved by more than doubling the number of chips from 64 to 144.
Trainium3 arrives on EC2
Speaking of Trainium3, a year after first teasing the chips, Amazon is finally ready to bring its third generation of Trainium accelerators to the general market.
According to AWS, each chip is equipped with 144 GB of HBM3E memory, good for 4.9 TB/s of memory bandwidth, and is capable of churning out just over 2.5 petaFLOPS of dense FP8 performance.
[7]
However, for jobs that benefit from sparsity, like training, the chips are even more potent. Trainium3 features 16:4 structured sparsity, which effectively quadruples the chip's performance to 10 petaFLOPS for supported workloads.
Amazon's Trainium3 UltraServers cram 144 of these chips connected in an all-to-all fabric using its NeuronSwitch-v1 interconnect tech, which Amazon says offers twice the chip-to-chip bandwidth.
[8]AWS and Google build a fix for multi-cloud barriers they said didn't exist
[9]AWS builds a DNS backstop to allow changes when its notoriously flaky US East region wobbles
[10]Perplexity shows how to run monster AI models more efficiently on aging GPUs, AWS networks
[11]AWS: How do you do, fellow kids? Please watch our keynotes in Fortnite
This is a marked change from Amazon's Trainium2 UltraServers, which featured 64 accelerators arranged in a 4x4x4 3D torus topology.
Amazon declined to comment on how the 144 Trainium3 accelerators are connected to one another, but if we had to guess, it likely resembles the flat switched topology used in Nvidia's NVL72 or AMD's Helios rack systems.
Such a move should ease the transition to NVLink Fusion next generation, but leaves Google as one of the few chip designers left using mesh topologies in large-scale AI training and inference clusters.
In any case, Amazon seems confident that its new interconnect tech and EFA networking will enable it to support production deployments containing up to a million accelerators, compared to the 500,000 Trainium2 chips found in [12]Project Rainier .
Combined, each Trainium3 UltraServer features 20.7 TB of HBM3E, 706 TB/s of memory bandwidth, and between 363 and 1,452 petaFLOPS depending on whether your workload actually benefits from sparsity or not.
This puts the systems roughly on par with Nvidia's latest Blackwell Ultra-based GB300 NVL72 systems – at least at FP8. At FP4, the gap widens considerably with the Nvidia system delivering more than 3x the performance.
With that said, FP4 is still primarily used in inference, while higher-precision datatypes like BF16 and FP8 are preferred for training.
Despite Trainium's advancements in performance, some customers still aren't ready to abandon Nvidia just yet. Because of this, Amazon has also announced the availability of new compute offerings based on Nvidia's GB300 NVL72, which join the company's existing GB200 instances. ®
Get our [13]Tech Resources
[1] https://www.theregister.com/2025/05/19/nvidia_nvlink_fusion/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2024/12/03/amazon_ai_chip/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2025/12/01/aws_google_cloud_interconnect/
[9] https://www.theregister.com/2025/11/27/aws_dns_us_east_workaround/
[10] https://www.theregister.com/2025/11/05/perplexity_1t_parameter_models_aws_efa/
[11] https://www.theregister.com/2025/12/02/aws_reinvent_fortnite/
[12] https://www.theregister.com/2025/07/04/project_rainier_deep_dive/
[13] https://whitepapers.theregister.com/
At its Re:Invent convention in Las Vegas on Tuesday, Amazon Web Services (AWS) teased its Trainium4 accelerators, which will be among the first to embrace Nvidia's NVLink Fusion interconnect tech for chip-to-chip communications.
NVLink is a high-speed interconnect that allows multiple GPUs spanning multiple systems to pool resources and behave like a single accelerator. Previously, this technology has been limited to Nvidia CPUs and GPUs, but back in May, the AI infrastructure giant announced it was opening the tech to others with the [1]introduction of NVLink Fusion at Computex.
[2]
Amazon claims that the technology will allow its Trainium4 accelerators, Graviton CPUs, and EFA networking tech to communicate seamlessly across Nvidia's MGX racks.
[3]
[4]
In its current form, Nvidia's fifth-gen NVLink fabrics support up to 1.8 TB/s of bandwidth (900 GB/s in each direction) per GPU, but the company is on track to double that to 3.6 TB/s by next year.
Beyond Nvidia's interconnect tech, details are somewhat vague. We're told that the new chips will deliver 3x more FLOPS at FP8, 6x the performance at FP4, and 4x the memory bandwidth. Whether those claims pertain to the individual chips or its UltraServer rack systems, Amazon hasn't said.
[5]
Assuming it's the rack systems, as was the case with Trainium3, that suggests AWS's Trainium4 UltraServers could deliver upwards of 2 exaFLOPS of dense FP4 performance and 2.8 petabytes a second of memory bandwidth.
That latter point is likely to be a major boon for bandwidth-bound inference workloads. Despite a rather confusing naming convention, AWS actually employs Trainium for both internal and external training and inference.
Of course, the devil is in the details and we simply don't have all of them yet. Amazon made [6]similar claims about its Trainium3 UltraServers this time last year, boasting a 4.4x uplift in compute over its Trainium2 racks. But while technically true, what we didn't know at the time was roughly half that performance would be achieved by more than doubling the number of chips from 64 to 144.
Trainium3 arrives on EC2
Speaking of Trainium3, a year after first teasing the chips, Amazon is finally ready to bring its third generation of Trainium accelerators to the general market.
According to AWS, each chip is equipped with 144 GB of HBM3E memory, good for 4.9 TB/s of memory bandwidth, and is capable of churning out just over 2.5 petaFLOPS of dense FP8 performance.
[7]
However, for jobs that benefit from sparsity, like training, the chips are even more potent. Trainium3 features 16:4 structured sparsity, which effectively quadruples the chip's performance to 10 petaFLOPS for supported workloads.
Amazon's Trainium3 UltraServers cram 144 of these chips connected in an all-to-all fabric using its NeuronSwitch-v1 interconnect tech, which Amazon says offers twice the chip-to-chip bandwidth.
[8]AWS and Google build a fix for multi-cloud barriers they said didn't exist
[9]AWS builds a DNS backstop to allow changes when its notoriously flaky US East region wobbles
[10]Perplexity shows how to run monster AI models more efficiently on aging GPUs, AWS networks
[11]AWS: How do you do, fellow kids? Please watch our keynotes in Fortnite
This is a marked change from Amazon's Trainium2 UltraServers, which featured 64 accelerators arranged in a 4x4x4 3D torus topology.
Amazon declined to comment on how the 144 Trainium3 accelerators are connected to one another, but if we had to guess, it likely resembles the flat switched topology used in Nvidia's NVL72 or AMD's Helios rack systems.
Such a move should ease the transition to NVLink Fusion next generation, but leaves Google as one of the few chip designers left using mesh topologies in large-scale AI training and inference clusters.
In any case, Amazon seems confident that its new interconnect tech and EFA networking will enable it to support production deployments containing up to a million accelerators, compared to the 500,000 Trainium2 chips found in [12]Project Rainier .
Combined, each Trainium3 UltraServer features 20.7 TB of HBM3E, 706 TB/s of memory bandwidth, and between 363 and 1,452 petaFLOPS depending on whether your workload actually benefits from sparsity or not.
This puts the systems roughly on par with Nvidia's latest Blackwell Ultra-based GB300 NVL72 systems – at least at FP8. At FP4, the gap widens considerably with the Nvidia system delivering more than 3x the performance.
With that said, FP4 is still primarily used in inference, while higher-precision datatypes like BF16 and FP8 are preferred for training.
Despite Trainium's advancements in performance, some customers still aren't ready to abandon Nvidia just yet. Because of this, Amazon has also announced the availability of new compute offerings based on Nvidia's GB300 NVL72, which join the company's existing GB200 instances. ®
Get our [13]Tech Resources
[1] https://www.theregister.com/2025/05/19/nvidia_nvlink_fusion/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2024/12/03/amazon_ai_chip/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/awsreinvent&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aS8bK_BnWm1d8QJnJTKuhAAAAJI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2025/12/01/aws_google_cloud_interconnect/
[9] https://www.theregister.com/2025/11/27/aws_dns_us_east_workaround/
[10] https://www.theregister.com/2025/11/05/perplexity_1t_parameter_models_aws_efa/
[11] https://www.theregister.com/2025/12/02/aws_reinvent_fortnite/
[12] https://www.theregister.com/2025/07/04/project_rainier_deep_dive/
[13] https://whitepapers.theregister.com/