Every conference is an AI conference as Nvidia unpacks its Vera Rubin CPUs and GPUs at CES
(2026/01/05)
- Reference: 1767652764
- News link: https://www.theregister.co.uk/2026/01/05/ces_rubin_nvidia/
- Source link:
CES used to be all about consumer electronics, TVs, smartphones, tablets, PCs, and – over the last few years – automobiles. Now, it's just another opportunity for Nvidia to peddle its AI hardware and software — in particular its next-gen Vera Rubin architecture.
The AI arms dealer boasts that, compared to Blackwell, the chips will deliver up to 5x higher floating point performance for inference, 3.5x for training, along with 2.8x more memory bandwidth and an NvLink interconnect that's now twice as fast.
But don't get too excited just yet. It's not like the chips are launching earlier than previously expected. They're still expected to arrive in the second half of the year, just like Blackwell and Blackwell Ultra did.
[1]
Nvidia normally holds off until GTC in March to reveal its next-gen chips. Perhaps AMD's aggressive rack scale roadmap has Nvidia's CEO Jensen Huang nervous. Announced at [2]Advancing AI late last spring and expected later this year, AMD's double-wide Helios racks promise to deliver performance on par with Vera Rubin NVL72 while offering customers 50 percent more HBM4.
[3]
[4]
Nvidia has also been [5]teasing the Vera Rubin platform for nearly a year now, to the point where there's not much we didn't already know about the platform.
But even though you won't be able to get your hands on Rubin for a few more months, it's never too early for a closer look at what the multi-million dollar machines will buy you.
NVL72 refined
The flagship system for Nvidia's Vera Rubin CPU and GPU architectures is once again its NVL72 rack systems. At first blush, the machine doesn't look all that different from its Blackwell and Blackwell Ultra-based siblings. But under the hood, Nvidia has been hard at work refining the architecture for better serviceability and telemetry.
Switch trays can now be serviced without taking down the machine first. Nvidia also has new reliability, availability, and serviceability features which enable customers to check in on the health of the GPUs without dropping them from the cluster first. These health checks can now run between training checkpoints or jobs, Ian Buck, Nvidia's VP and General Manager of Hyperscale and HPC, tells El Reg .
[6]
At the heart of the rack is the Vera Rubin superchip, which, if history tells us anything, should bear the VR200 code name.
Much like Blackwell, the Vera Rubin superchip features two dual-die Rubin GPUs, each capable of churning out 50 petaFLOPS of inference performance or 35 petaFLOPS for training. Both of those numbers refer to peak performance achievable when using NVLFP4 data type.
[7]
Here's a quick overview of Rubin's speeds and feeds - Click to enlarge
According to Buck, for this generation, Nvidia is using a new adaptive compression technique that's better suited to generative AI and mixture of experts (MoE) model inference to achieve the 50 petaFLOP claim rather than structured sparsity. As you may recall, while structured sparsity did have benefits for certain workloads, it didn't offer many if any advantages for LLM inference.
We've asked Nvidia about higher precision data types, like FP8 and BF16 which remain relevant for vision language model inference, image generation, fine tuning, and training workloads; we'll let you know if we hear back.
The GPUs are fed by 288 GB of HBM4 memory — 576 GB per superchip — which, despite delivering the same capacity as the Blackwell Ultra-based GB300, is now 2.8x faster at 22 TB/s per socket (44 TB/s per superchip). If that number seems a little high, that's because Nvidia initially targeted 13 TB/s of HBM4 bandwidth when it first teased Rubin last year. Buck tells us that the jump to 22 TB/s was attained entirely through silicon and doesn't rely on techniques like memory compression.
[8]
Nvidia's latest Arm CPU features 88 custom Olympus cores with SMT - Click to enlarge
The two Rubin GPUs are paired to Nvidia's new Vera CPU via a 1.8 TB/s NvLink-C2C interconnect. The CPU contains 88 of Nvidia's custom Arm-based Olympus cores and is paired with 1.5 TB of LPDDR5x memory — 3x that of the GB200. We guess we know why memory is in such short supply these days. Actually it's more complicated than that, but this certainly isn't helping the situation.
However, one of the most important features Vera brings to the table is support for confidential computing across the system's NvLink domain, something that previously was only available on x86-based HGX systems.
[9]
Nvidia's Vera Rubin NVL72 racks feature 72 Rubin GPUs, 20.7 TB of HBM4, 36 Vera CPUs, 54 TB of LPDDR5x which are spread across 18 compute blades interconnected by nine NvSwitch 6 blades which deliver 3.6 TB/s of bandwidth to each GPU — twice that of last gen.
Nvidia isn't ready to say how much power that additional compute and bandwidth will require. However, Buck tells us that while it will be higher, we shouldn't expect power to double.
Wasn't it supposed to be 144?
[10]
Since announcing Rubin, Nvidia has decided not to use the NVL144 naming convention and to stick to counting SXM modules as GPUs rather than dies. (Image from GTC 2025) - Click to enlarge
If you're scratching your head wondering "didn't Nvidia say this thing was supposed to have 144 GPUs?" you wouldn't be the only one. At GTC 2025, Huang announced that they were changing the way they counted GPUs from the package to the dies on board. In that sense, the Blackwell-based NVL72s also had 144 GPUs, but Nvidia was going to wait for Vera Rubin to make the switch to the new convention.
It seems Nvidia has since changed its mind and is sticking with the established naming convention. Having said that, we may yet see Nvidia racks with at least 144 GPUs on board before long.
Then there's CPX
The Rubin CPUs we've talked about up to this point actually are one of two accelerators announced so far. Rubin CPX is [11]the other .
Unveiled in September, the chip is a more niche product, designed specifically to accelerate the compute-intense prefill phase of LLM inference. Since prefill isn't bandwidth-bound, CPX doesn't need HBM and can instead make do with slower DRAM.
Each CPX accelerator will be capable of churning out 30 petaFLOPS of NVFP4 compute and will sport 128 GB of GDDR7 memory.
[12]
Nvidia's Vera Rubin NVL144 CPX compute trays will now pack 12 GPUs. Four with HBM and another eight context optimized ones using GDDR7 - Click to enlarge
In a graphic shared this summer, Nvidia showed an NVL144 CPX blade with four 288 GB Rubin SXM modules and eight Rubin CPX prefill accelerators for a total of 12 GPUs per node.
The complete rack system would only need 12 compute blades for the thing to have 144 GPUs, though only 48 of them would be connected via NVLink.
Stitching together a SuperPOD
As with past Nvidia rack systems, eight NVL72 racks form a SuperPOD with the GPU slinger's Spectrum-X Ethernet and/or Quantum-X InfiniBand the glue used to stitch them together. Multiple SuperPODS can then be combined to form larger compute environments for training or distributed inference.
If you aren't ready to make the switch to Nvidia's rack-scale kit, don't worry. Eight-way (NVL8) HGX systems based around the Rubin platform are still available, but we're told liquid cooling is no longer a suggestion, but a requirement. These smaller systems, 64 to be exact, can also be combined to form a SuperPOD with 512 GPUs — just shy of the of the more powerful NVL72 SuperPOD at 576.
For this generation, Nvidia also has two new NICs, which it teased on a few occasions over the last year. At [13]GTC DC , Nvidia showed off the ConnectX-9, a 1.6 Tbps "superNIC" designed for high-speed distributed computing, which we sometimes call the backend network.
[14]
Here's a closer look at Nvidia's 1.6Tbps ConnectX-9 superNIC - Click to enlarge
For storage, management, and security, Nvidia is pushing its BlueField-4 data processing units (DPUs), which feature an integrated 800 Gbps ConnectX-9 NIC and a 64-core Grace CPU on board. This, we should note, isn't the same Grace CPU found in the GB200, but a newer version based on Arm's Neoverse V3 core architecture.
[15]
Nvidia's BlueField-4 DPUs feature an 800 Gbps ConnectX-9 NIC along with a 64-core Grace CPU for software-defined networking, security, and storage offload - Click to enlarge
The beefier CPU is designed to offload software defined networking, storage, security, and can also run hypervisors for virtualized environments.
Playing Whac-A-Mole with inference bottlenecks
Cramming 64 Grace cores onto a NIC might seem like overkill, but Nvidia has a specific reason for wanting that much compute hanging off the machine like a computer in front of a computer.
Alongside all its shiny new hardware, Nvidia showed off what it's describing as a "new class of memory between the GPU and storage," designed to offload key value (KV) caches.
The basic idea isn't new. KV caches store the model's state. You can think of this like its short-term memory. Calculating the key value vectors is one of the more compute-intensive aspects of the inference.
Because inference workloads often involve passing over the same info multiple times, it makes sense to cache the computed vectors in memory. By doing this, only changes need to be computed and data in the cache can be reused. This sounds simple, but, in practice, KV caches can be quite large, easily consuming tens of gigabytes in order to keep track of 100,000 or so tokens. That might sound like a lot, but a single user running a code assistant or agent can blow through that rather quickly.
As we understand it, Nvidia's Inference Context Storage platform will work with storage platforms from multiple partner vendors, and will take advantage of the BlueField-4 DPU, NIXL GPU direct storage libraries, and optimize KV cache offloading for maximum performance and efficiency.
Combined with technologies like Rubin CPX, this kind of high-performance KV offloading should allow the GPUs to spend more time generating tokens and less time waiting on data to be shuffled about and recomputed.
The AI infrastructure arena heats up
Nvidia's decision to "launch" Rubin — again it isn't actually shipping in volume yet — betrays an increasingly competitive compute landscape.
As we mentioned earlier, AMD's Helios rack systems promise to deliver floating point performance roughly equivalent to Nvidia's Vera Rubin NVL72 at 2.9 exaFLOPS versus 2.5-3.6 exaFLOPS of FP4, respectively. For applications that can't take advantage of Nvidia's adaptive compression tech, Helios is, at least on paper, faster.
However, with Nvidia planning to ship faster memory on Rubin than initially planned, AMD no longer has a bandwidth advantage. It does still have a capacity lead with 432 GB of HBM4 per GPU socket compared to 288 GB on Rubin. In theory, this should allow the AMD-based system to serve 50 percent larger MoE models on a single double-wide rack.
[16]
AMD Helios systems won't exactly fit into a standard 19" rack - Click to enlarge
In practice, the real-world performance is going to depend heavily on how well tunneling Ultra Accelerator Link (UALink) over Broadcom's Tomahawk 6 Ethernet switches actually works.
AMD's MI450-series GPUs appear very well positioned to compete against Rubin, but as we've seen repeated with Amazon and Google, the ability to scale that compute often makes a bigger difference than the chip's individual performance.
AMD is also having to play catch up on the software ecosystem front. The company's HIP and ROCm libraries have certainly come a long way since the MI300X made its debut at the end of 2023, but the company still has a ways to go.
LLMs, robots, and autonomous cars
Nvidia certainly isn't making the situation any easier for AMD. At CES, the GPU giant unveiled a slew of new software frameworks aimed at enterprises, robotics devs, and the automotive industry.
This includes the development of new foundation models for domain specific applications like retrieval augmented generation, safety, speech, and autonomous driving.
The latter, called Alpamayo, is a relatively small "reasoning vision language action" model designed to help level-4 autonomous vehicles better handle unique and fast evolving road conditions. Level-4 capable vehicles are capable of driving fully autonomously, unsupervised driving in specific environments, like high-ways or urban environments.
Nvidia's autonomous driving stack is due to hit US roads late this year with the level-2++ capable Mercedes Benz CLA. This class of autonomous vehicle is capable of driving itself in similar conditions as level-4, but requires the supervision of a human operator.
[17]AWS raises GPU prices 15% on a Saturday, hopes you weren't paying attention
[18]Nvidia DMs TSMC: Please sir can I have some more? The Chinese are starved for H200s
[19]When the AI bubble pops, Nvidia becomes the most important software company overnight
[20]Everybody has a theory about why Nvidia dropped $20B on Groq - they're mostly wrong
What about GTC?
With Nvidia kicking off the New Year with Rubin — a chip we hadn't expected to get a good look at for another three months — we're left to wonder what we'll see at GTC, which is slated to run from March 16-19 in San Jose, California.
In addition to the regular mix of software libraries and foundation models, we expect to get a lot more details on the Kyber racks that'll underpin the company's Vera Rubin Ultra platform starting in 2027.
As you might have noticed, Nvidia, AMD, AWS, and others have gotten in the habit of pre-announcing products well in advance of them shipping or becoming generally available. As the saying goes: enterprises don't buy products, they buy roadmaps. In this case, however, it's really about ensuring they have somewhere to put them.
Nvidia's Kyber racks are expected to pull 600 kilowatts of power which means datacenter operators need to start preparing now, if they want to deploy them on day one.
[21]
By 2027 Nvidia CEO Jensen Huang expects racks to surge to 600 kW with the debut of its Rubin Ultra Kyber racks (Image from GTC 2025) - Click to enlarge
We don't yet have a full picture of what Vera Rubin Ultra will offer, but we know it'll feature four reticle-sized Rubin Ultra GPUs, 1TB of HBM4e, and will deliver 100 petaFLOPS of FP4 performance.
As things currently stand, Nvidia plans to cram 144 of these GPU packages (576 GPU dies) into a single NvLink domain which is expected to deliver 15 exaFLOPS of FP4 inference performance or 10 exaFLOPS for training. ®
Get our [22]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://www.theregister.com/2025/06/12/amd_helios_dc/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://www.theregister.com/2025/03/19/nvidia_charts_course_for_600kw/
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://regmedia.co.uk/2026/01/05/nvidia_rubin_gpu.jpg
[8] https://regmedia.co.uk/2026/01/05/nvidia_vera.jpg
[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[10] https://regmedia.co.uk/2025/03/18/vera_rubin_144.jpg
[11] https://www.theregister.com/2025/09/10/nvidia_rubin_cpx/
[12] https://regmedia.co.uk/2025/09/09/vera_rubin_nvl144_cpx.jpg
[13] https://www.nextplatform.com/2025/11/18/nvidia-brings-together-quantum-and-ai-for-hpc-centers/
[14] https://regmedia.co.uk/2026/01/05/connectx-9_supernic.jpg
[15] https://regmedia.co.uk/2026/01/05/nvidia_bluefield-4.jpg
[16] https://regmedia.co.uk/2025/06/12/amd_helios_rack_scale_system.jpg
[17] https://www.theregister.com/2026/01/05/aws_price_increase/
[18] https://www.theregister.com/2025/12/31/china_nvidia_h200/
[19] https://www.theregister.com/2025/12/30/how_nvidia_survives_ai_bubble_pop/
[20] https://www.theregister.com/2025/12/31/groq_nvidia_analysis/
[21] https://regmedia.co.uk/2025/03/18/vera_rubin_nvl576.jpg
[22] https://whitepapers.theregister.com/
The AI arms dealer boasts that, compared to Blackwell, the chips will deliver up to 5x higher floating point performance for inference, 3.5x for training, along with 2.8x more memory bandwidth and an NvLink interconnect that's now twice as fast.
But don't get too excited just yet. It's not like the chips are launching earlier than previously expected. They're still expected to arrive in the second half of the year, just like Blackwell and Blackwell Ultra did.
[1]
Nvidia normally holds off until GTC in March to reveal its next-gen chips. Perhaps AMD's aggressive rack scale roadmap has Nvidia's CEO Jensen Huang nervous. Announced at [2]Advancing AI late last spring and expected later this year, AMD's double-wide Helios racks promise to deliver performance on par with Vera Rubin NVL72 while offering customers 50 percent more HBM4.
[3]
[4]
Nvidia has also been [5]teasing the Vera Rubin platform for nearly a year now, to the point where there's not much we didn't already know about the platform.
But even though you won't be able to get your hands on Rubin for a few more months, it's never too early for a closer look at what the multi-million dollar machines will buy you.
NVL72 refined
The flagship system for Nvidia's Vera Rubin CPU and GPU architectures is once again its NVL72 rack systems. At first blush, the machine doesn't look all that different from its Blackwell and Blackwell Ultra-based siblings. But under the hood, Nvidia has been hard at work refining the architecture for better serviceability and telemetry.
Switch trays can now be serviced without taking down the machine first. Nvidia also has new reliability, availability, and serviceability features which enable customers to check in on the health of the GPUs without dropping them from the cluster first. These health checks can now run between training checkpoints or jobs, Ian Buck, Nvidia's VP and General Manager of Hyperscale and HPC, tells El Reg .
[6]
At the heart of the rack is the Vera Rubin superchip, which, if history tells us anything, should bear the VR200 code name.
Much like Blackwell, the Vera Rubin superchip features two dual-die Rubin GPUs, each capable of churning out 50 petaFLOPS of inference performance or 35 petaFLOPS for training. Both of those numbers refer to peak performance achievable when using NVLFP4 data type.
[7]
Here's a quick overview of Rubin's speeds and feeds - Click to enlarge
According to Buck, for this generation, Nvidia is using a new adaptive compression technique that's better suited to generative AI and mixture of experts (MoE) model inference to achieve the 50 petaFLOP claim rather than structured sparsity. As you may recall, while structured sparsity did have benefits for certain workloads, it didn't offer many if any advantages for LLM inference.
We've asked Nvidia about higher precision data types, like FP8 and BF16 which remain relevant for vision language model inference, image generation, fine tuning, and training workloads; we'll let you know if we hear back.
The GPUs are fed by 288 GB of HBM4 memory — 576 GB per superchip — which, despite delivering the same capacity as the Blackwell Ultra-based GB300, is now 2.8x faster at 22 TB/s per socket (44 TB/s per superchip). If that number seems a little high, that's because Nvidia initially targeted 13 TB/s of HBM4 bandwidth when it first teased Rubin last year. Buck tells us that the jump to 22 TB/s was attained entirely through silicon and doesn't rely on techniques like memory compression.
[8]
Nvidia's latest Arm CPU features 88 custom Olympus cores with SMT - Click to enlarge
The two Rubin GPUs are paired to Nvidia's new Vera CPU via a 1.8 TB/s NvLink-C2C interconnect. The CPU contains 88 of Nvidia's custom Arm-based Olympus cores and is paired with 1.5 TB of LPDDR5x memory — 3x that of the GB200. We guess we know why memory is in such short supply these days. Actually it's more complicated than that, but this certainly isn't helping the situation.
However, one of the most important features Vera brings to the table is support for confidential computing across the system's NvLink domain, something that previously was only available on x86-based HGX systems.
[9]
Nvidia's Vera Rubin NVL72 racks feature 72 Rubin GPUs, 20.7 TB of HBM4, 36 Vera CPUs, 54 TB of LPDDR5x which are spread across 18 compute blades interconnected by nine NvSwitch 6 blades which deliver 3.6 TB/s of bandwidth to each GPU — twice that of last gen.
Nvidia isn't ready to say how much power that additional compute and bandwidth will require. However, Buck tells us that while it will be higher, we shouldn't expect power to double.
Wasn't it supposed to be 144?
[10]
Since announcing Rubin, Nvidia has decided not to use the NVL144 naming convention and to stick to counting SXM modules as GPUs rather than dies. (Image from GTC 2025) - Click to enlarge
If you're scratching your head wondering "didn't Nvidia say this thing was supposed to have 144 GPUs?" you wouldn't be the only one. At GTC 2025, Huang announced that they were changing the way they counted GPUs from the package to the dies on board. In that sense, the Blackwell-based NVL72s also had 144 GPUs, but Nvidia was going to wait for Vera Rubin to make the switch to the new convention.
It seems Nvidia has since changed its mind and is sticking with the established naming convention. Having said that, we may yet see Nvidia racks with at least 144 GPUs on board before long.
Then there's CPX
The Rubin CPUs we've talked about up to this point actually are one of two accelerators announced so far. Rubin CPX is [11]the other .
Unveiled in September, the chip is a more niche product, designed specifically to accelerate the compute-intense prefill phase of LLM inference. Since prefill isn't bandwidth-bound, CPX doesn't need HBM and can instead make do with slower DRAM.
Each CPX accelerator will be capable of churning out 30 petaFLOPS of NVFP4 compute and will sport 128 GB of GDDR7 memory.
[12]
Nvidia's Vera Rubin NVL144 CPX compute trays will now pack 12 GPUs. Four with HBM and another eight context optimized ones using GDDR7 - Click to enlarge
In a graphic shared this summer, Nvidia showed an NVL144 CPX blade with four 288 GB Rubin SXM modules and eight Rubin CPX prefill accelerators for a total of 12 GPUs per node.
The complete rack system would only need 12 compute blades for the thing to have 144 GPUs, though only 48 of them would be connected via NVLink.
Stitching together a SuperPOD
As with past Nvidia rack systems, eight NVL72 racks form a SuperPOD with the GPU slinger's Spectrum-X Ethernet and/or Quantum-X InfiniBand the glue used to stitch them together. Multiple SuperPODS can then be combined to form larger compute environments for training or distributed inference.
If you aren't ready to make the switch to Nvidia's rack-scale kit, don't worry. Eight-way (NVL8) HGX systems based around the Rubin platform are still available, but we're told liquid cooling is no longer a suggestion, but a requirement. These smaller systems, 64 to be exact, can also be combined to form a SuperPOD with 512 GPUs — just shy of the of the more powerful NVL72 SuperPOD at 576.
For this generation, Nvidia also has two new NICs, which it teased on a few occasions over the last year. At [13]GTC DC , Nvidia showed off the ConnectX-9, a 1.6 Tbps "superNIC" designed for high-speed distributed computing, which we sometimes call the backend network.
[14]
Here's a closer look at Nvidia's 1.6Tbps ConnectX-9 superNIC - Click to enlarge
For storage, management, and security, Nvidia is pushing its BlueField-4 data processing units (DPUs), which feature an integrated 800 Gbps ConnectX-9 NIC and a 64-core Grace CPU on board. This, we should note, isn't the same Grace CPU found in the GB200, but a newer version based on Arm's Neoverse V3 core architecture.
[15]
Nvidia's BlueField-4 DPUs feature an 800 Gbps ConnectX-9 NIC along with a 64-core Grace CPU for software-defined networking, security, and storage offload - Click to enlarge
The beefier CPU is designed to offload software defined networking, storage, security, and can also run hypervisors for virtualized environments.
Playing Whac-A-Mole with inference bottlenecks
Cramming 64 Grace cores onto a NIC might seem like overkill, but Nvidia has a specific reason for wanting that much compute hanging off the machine like a computer in front of a computer.
Alongside all its shiny new hardware, Nvidia showed off what it's describing as a "new class of memory between the GPU and storage," designed to offload key value (KV) caches.
The basic idea isn't new. KV caches store the model's state. You can think of this like its short-term memory. Calculating the key value vectors is one of the more compute-intensive aspects of the inference.
Because inference workloads often involve passing over the same info multiple times, it makes sense to cache the computed vectors in memory. By doing this, only changes need to be computed and data in the cache can be reused. This sounds simple, but, in practice, KV caches can be quite large, easily consuming tens of gigabytes in order to keep track of 100,000 or so tokens. That might sound like a lot, but a single user running a code assistant or agent can blow through that rather quickly.
As we understand it, Nvidia's Inference Context Storage platform will work with storage platforms from multiple partner vendors, and will take advantage of the BlueField-4 DPU, NIXL GPU direct storage libraries, and optimize KV cache offloading for maximum performance and efficiency.
Combined with technologies like Rubin CPX, this kind of high-performance KV offloading should allow the GPUs to spend more time generating tokens and less time waiting on data to be shuffled about and recomputed.
The AI infrastructure arena heats up
Nvidia's decision to "launch" Rubin — again it isn't actually shipping in volume yet — betrays an increasingly competitive compute landscape.
As we mentioned earlier, AMD's Helios rack systems promise to deliver floating point performance roughly equivalent to Nvidia's Vera Rubin NVL72 at 2.9 exaFLOPS versus 2.5-3.6 exaFLOPS of FP4, respectively. For applications that can't take advantage of Nvidia's adaptive compression tech, Helios is, at least on paper, faster.
However, with Nvidia planning to ship faster memory on Rubin than initially planned, AMD no longer has a bandwidth advantage. It does still have a capacity lead with 432 GB of HBM4 per GPU socket compared to 288 GB on Rubin. In theory, this should allow the AMD-based system to serve 50 percent larger MoE models on a single double-wide rack.
[16]
AMD Helios systems won't exactly fit into a standard 19" rack - Click to enlarge
In practice, the real-world performance is going to depend heavily on how well tunneling Ultra Accelerator Link (UALink) over Broadcom's Tomahawk 6 Ethernet switches actually works.
AMD's MI450-series GPUs appear very well positioned to compete against Rubin, but as we've seen repeated with Amazon and Google, the ability to scale that compute often makes a bigger difference than the chip's individual performance.
AMD is also having to play catch up on the software ecosystem front. The company's HIP and ROCm libraries have certainly come a long way since the MI300X made its debut at the end of 2023, but the company still has a ways to go.
LLMs, robots, and autonomous cars
Nvidia certainly isn't making the situation any easier for AMD. At CES, the GPU giant unveiled a slew of new software frameworks aimed at enterprises, robotics devs, and the automotive industry.
This includes the development of new foundation models for domain specific applications like retrieval augmented generation, safety, speech, and autonomous driving.
The latter, called Alpamayo, is a relatively small "reasoning vision language action" model designed to help level-4 autonomous vehicles better handle unique and fast evolving road conditions. Level-4 capable vehicles are capable of driving fully autonomously, unsupervised driving in specific environments, like high-ways or urban environments.
Nvidia's autonomous driving stack is due to hit US roads late this year with the level-2++ capable Mercedes Benz CLA. This class of autonomous vehicle is capable of driving itself in similar conditions as level-4, but requires the supervision of a human operator.
[17]AWS raises GPU prices 15% on a Saturday, hopes you weren't paying attention
[18]Nvidia DMs TSMC: Please sir can I have some more? The Chinese are starved for H200s
[19]When the AI bubble pops, Nvidia becomes the most important software company overnight
[20]Everybody has a theory about why Nvidia dropped $20B on Groq - they're mostly wrong
What about GTC?
With Nvidia kicking off the New Year with Rubin — a chip we hadn't expected to get a good look at for another three months — we're left to wonder what we'll see at GTC, which is slated to run from March 16-19 in San Jose, California.
In addition to the regular mix of software libraries and foundation models, we expect to get a lot more details on the Kyber racks that'll underpin the company's Vera Rubin Ultra platform starting in 2027.
As you might have noticed, Nvidia, AMD, AWS, and others have gotten in the habit of pre-announcing products well in advance of them shipping or becoming generally available. As the saying goes: enterprises don't buy products, they buy roadmaps. In this case, however, it's really about ensuring they have somewhere to put them.
Nvidia's Kyber racks are expected to pull 600 kilowatts of power which means datacenter operators need to start preparing now, if they want to deploy them on day one.
[21]
By 2027 Nvidia CEO Jensen Huang expects racks to surge to 600 kW with the debut of its Rubin Ultra Kyber racks (Image from GTC 2025) - Click to enlarge
We don't yet have a full picture of what Vera Rubin Ultra will offer, but we know it'll feature four reticle-sized Rubin Ultra GPUs, 1TB of HBM4e, and will deliver 100 petaFLOPS of FP4 performance.
As things currently stand, Nvidia plans to cram 144 of these GPU packages (576 GPU dies) into a single NvLink domain which is expected to deliver 15 exaFLOPS of FP4 inference performance or 10 exaFLOPS for training. ®
Get our [22]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://www.theregister.com/2025/06/12/amd_helios_dc/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://www.theregister.com/2025/03/19/nvidia_charts_course_for_600kw/
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://regmedia.co.uk/2026/01/05/nvidia_rubin_gpu.jpg
[8] https://regmedia.co.uk/2026/01/05/nvidia_vera.jpg
[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVxCjQAQanmuuJtwtrJT3wAAAZE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[10] https://regmedia.co.uk/2025/03/18/vera_rubin_144.jpg
[11] https://www.theregister.com/2025/09/10/nvidia_rubin_cpx/
[12] https://regmedia.co.uk/2025/09/09/vera_rubin_nvl144_cpx.jpg
[13] https://www.nextplatform.com/2025/11/18/nvidia-brings-together-quantum-and-ai-for-hpc-centers/
[14] https://regmedia.co.uk/2026/01/05/connectx-9_supernic.jpg
[15] https://regmedia.co.uk/2026/01/05/nvidia_bluefield-4.jpg
[16] https://regmedia.co.uk/2025/06/12/amd_helios_rack_scale_system.jpg
[17] https://www.theregister.com/2026/01/05/aws_price_increase/
[18] https://www.theregister.com/2025/12/31/china_nvidia_h200/
[19] https://www.theregister.com/2025/12/30/how_nvidia_survives_ai_bubble_pop/
[20] https://www.theregister.com/2025/12/31/groq_nvidia_analysis/
[21] https://regmedia.co.uk/2025/03/18/vera_rubin_nvl576.jpg
[22] https://whitepapers.theregister.com/