Nvidia's context-optimized Rubin CPX GPUs were inevitable
(2025/09/10)
- Reference: 1757506515
- News link: https://www.theregister.co.uk/2025/09/10/nvidia_rubin_cpx/
- Source link:
Analysis Nvidia on Tuesday unveiled the Rubin CPX, a GPU designed specifically to accelerate extremely long-context AI workflows like those seen in code assistants such as Microsoft's GitHub Copilot, while simultaneously cutting back on pricey and power-hungry high-bandwidth memory (HBM).
The first indication that Nvidia might be moving in this direction came when CEO Jensen Huang unveiled [1]Dynamo during his GTC keynote in spring. The framework brought mainstream attention to the idea of disaggregated inference.
As you may already be aware, inference on large language models (LLMs) can be broken into two categories: a computationally intensive compute phase and a second memory bandwidth-bound decode phase.
[2]
Traditionally, both the prefill and decode have taken place on the same GPU. Disaggregated serving allows different numbers of GPUs to be assigned to each phase of the pipeline, avoiding compute or bandwidth bottlenecks as context sizes grow – and they're certainly growing quickly.
[3]
[4]
Over the past few years, model context windows have leapt from a mere 4,096 tokens (think word fragments, numbers, and punctuation) on Llama 2 to as many as 10 million with Meta's Llama 4 Scout, released earlier this year.
These large context windows are a bit like the model's short-term memory and dictate how many tokens it can keep track of when processing and generating a response. For the average chatbot, this is relatively small. The ChatGPT Plus plan supports a context length of about 32,000 tokens. It takes a fairly long conversation to exhaust it.
[5]
For agentic workloads like code generation, however, making sense of a codebase may require juggling hundreds of thousands, if not millions, of tokens worth of code. In scenarios like this, far more raw compute capacity is required than memory bandwidth.
Dedicating loads of HBM-packed GPUs to the prefill stage is expensive, power-hungry, and inefficient. Instead, Nvidia plans to reserve its HBM-equipped GPUs for decode and is introducing a new GPU, the Rubin CPX, which uses slower, cheaper, but more power-frugal GDDR7 memory instead.
[6]
Nvidia's blog post includes a graphic that illustrates this strategy nicely
The result is a configuration that enables you to do the same amount of work using far less expensive hardware.
Prefilling the Vera Rubin NVL144 CPX
According to Nvidia, each [7]Rubin CPX accelerator will be capable of delivering 30 petaFLOPS of NVFP4 compute (it's unclear whether that figure assumes sparsity), and sport 128 GB of GDDR7 memory with both hardware encode and decode functionality intact.
GDDR7 is a fraction of the speed of HBM. For example, the GDDR7-based RTX Pro 6000 has 96 GB of memory, but it maxes out at between 1.6-1.7 TB/s. Compare that to a B300 SXM module, which has 180 GB of HBM3E and can deliver 4 TB/s of bandwidth to each of its two Blackwell GPU dies.
While Nvidia is touting NVFP4 performance, key-value (KV) caches have traditionally stored context at BF16 to preserve model accuracy, so actual prefill performance will likely depend on at what precision the key and value caches are stored. We're told the chip will also deliver a 3x acceleration to attention, a key mechanism in LLM inference, compared to its GB300 superchips.
[8]
For comparison, the version of Rubin [9]revealed at GTC will feature a pair of reticle-sized GPU dies on a single SXM module that'll deliver a combined 50 petaFLOPS of NVFP4, 288 GB of HBM4 with 13 TB/s of memory bandwidth.
Nvidia hasn't said how these chips will be integrated into its rack-scale NVL systems, but we do know that the GPU giant will offer a CPX version of that rack with two Vera CPUs, and 16 GPUs, eight Rubin (HBM) and eight Rubin CPX (GDDR7) per compute tray. In total, Nvidia's NVL144 CPX racks will feature 36 CPU sockets and 288 GPUs per rack.
[10]
Nvidia's Vera Rubin NVL144 CPX compute trays will now pack 16 GPUs – eight with HBM and another eight context-optimized ones using GDDR7
It's not immediately obvious how Nvidia will integrate these CPX GPUs into its system, though the NVL144 naming suggests Nvidia won't use its 1.8 TB/s NVLink-C2C interconnect, and will instead utilize PCIe 6.0.
[11]Alibaba looks to end reliance on Nvidia for AI inference
[12]AI arms dealer Nvidia laments the many billions lost to US-China trade war
[13]Nvidia details its itty bitty GB10 superchip for local AI development
[14]Nvidia touts Jetson Thor kit for real-time robot reasoning
The problem in context
Model context has become something of a new battleground for infrastructure and software vendors over the past year as models have become more advanced.
In addition to disaggregated serving with frameworks like Nvidia Dynamo or [15]llm-d , the idea of prompt caching and KV cache offload has also been gaining speed.
The idea behind these is rather than recomputing KV caches every time, computation outputs from the prefill phase are cached to system memory. This way only tokens that haven't already been processed yet have to be computed.
The developers of LMCache, which provides KV cache offload and caching functionality for popular model runners like vLLM, claim the approach can cut the time to first token by as much as 10x.
This approach can be tiered with CXL memory expansion modules, in-memory databases such as Redis, or even storage arrays, so that when a user walks away from a chat or AI coding session, the KV cache can be saved for future use. Even when GPU and system memory have been exhausted, the KV caches only have to be retrieved from slower CXL memory or storage, but don't have to be recomputed.
Enfabrica CEO Rochan Sankar, whose company has developed a sort of memory area network aimed at addressing the model context problem, likens this hierarchy to short, medium, and long-term memory.
As we mentioned earlier, larger contexts put a bigger computational burden on accelerators, but they also require more memory. DeepSeek's latest model will use roughly 104 GB or 208 GB of memory for every 128,000-token sequence, depending on whether you store them at FP8 or BF16. That means to support just ten simultaneous requests with a 128K context length, you'd need 1-2 TB of memory.
This is one of the challenges Enfabrica hopes will drive adoption of its Emfasys CXL memory [16]appliances , each of which can expose up to 18 TB of memory as an RDMA target for GPU systems to access.
The Pliops XDP LightningAI platform is similar in [17]concept using RDMA to sidestep compatibility issues with CXL and SSDs rather than DRAM. Using NAND flash is an interesting choice. On the one hand, it's substantially less expensive. On the other, KV caching is an inherently write-intensive operation, which could be problematic unless you happen to have hoarded tons of Optane SSDs. ®
Get our [18]Tech Resources
[1] https://www.theregister.com/2025/03/23/nvidia_dynamo/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://regmedia.co.uk/2025/09/09/nvidia_disaggregated_gpu_inference.jpg
[7] https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[9] https://www.theregister.com/2025/03/19/nvidia_charts_course_for_600kw/
[10] https://regmedia.co.uk/2025/09/09/vera_rubin_nvl144_cpx.jpg
[11] https://www.theregister.com/2025/08/29/china_alibaba_ai_accelerator/
[12] https://www.theregister.com/2025/08/27/nvidia_q2_china/
[13] https://www.theregister.com/2025/08/27/nvidia_blackwell_gb10/
[14] https://www.theregister.com/2025/08/25/nvidia_touts_jetson_thor_kit/
[15] https://llm-d.ai/
[16] https://www.nextplatform.com/2025/07/29/skimpy-hbm-memory-opens-up-the-way-ai-inference-memory-godbox/
[17] https://pliops.com/lightning-ai/
[18] https://whitepapers.theregister.com/
The first indication that Nvidia might be moving in this direction came when CEO Jensen Huang unveiled [1]Dynamo during his GTC keynote in spring. The framework brought mainstream attention to the idea of disaggregated inference.
As you may already be aware, inference on large language models (LLMs) can be broken into two categories: a computationally intensive compute phase and a second memory bandwidth-bound decode phase.
[2]
Traditionally, both the prefill and decode have taken place on the same GPU. Disaggregated serving allows different numbers of GPUs to be assigned to each phase of the pipeline, avoiding compute or bandwidth bottlenecks as context sizes grow – and they're certainly growing quickly.
[3]
[4]
Over the past few years, model context windows have leapt from a mere 4,096 tokens (think word fragments, numbers, and punctuation) on Llama 2 to as many as 10 million with Meta's Llama 4 Scout, released earlier this year.
These large context windows are a bit like the model's short-term memory and dictate how many tokens it can keep track of when processing and generating a response. For the average chatbot, this is relatively small. The ChatGPT Plus plan supports a context length of about 32,000 tokens. It takes a fairly long conversation to exhaust it.
[5]
For agentic workloads like code generation, however, making sense of a codebase may require juggling hundreds of thousands, if not millions, of tokens worth of code. In scenarios like this, far more raw compute capacity is required than memory bandwidth.
Dedicating loads of HBM-packed GPUs to the prefill stage is expensive, power-hungry, and inefficient. Instead, Nvidia plans to reserve its HBM-equipped GPUs for decode and is introducing a new GPU, the Rubin CPX, which uses slower, cheaper, but more power-frugal GDDR7 memory instead.
[6]
Nvidia's blog post includes a graphic that illustrates this strategy nicely
The result is a configuration that enables you to do the same amount of work using far less expensive hardware.
Prefilling the Vera Rubin NVL144 CPX
According to Nvidia, each [7]Rubin CPX accelerator will be capable of delivering 30 petaFLOPS of NVFP4 compute (it's unclear whether that figure assumes sparsity), and sport 128 GB of GDDR7 memory with both hardware encode and decode functionality intact.
GDDR7 is a fraction of the speed of HBM. For example, the GDDR7-based RTX Pro 6000 has 96 GB of memory, but it maxes out at between 1.6-1.7 TB/s. Compare that to a B300 SXM module, which has 180 GB of HBM3E and can deliver 4 TB/s of bandwidth to each of its two Blackwell GPU dies.
While Nvidia is touting NVFP4 performance, key-value (KV) caches have traditionally stored context at BF16 to preserve model accuracy, so actual prefill performance will likely depend on at what precision the key and value caches are stored. We're told the chip will also deliver a 3x acceleration to attention, a key mechanism in LLM inference, compared to its GB300 superchips.
[8]
For comparison, the version of Rubin [9]revealed at GTC will feature a pair of reticle-sized GPU dies on a single SXM module that'll deliver a combined 50 petaFLOPS of NVFP4, 288 GB of HBM4 with 13 TB/s of memory bandwidth.
Nvidia hasn't said how these chips will be integrated into its rack-scale NVL systems, but we do know that the GPU giant will offer a CPX version of that rack with two Vera CPUs, and 16 GPUs, eight Rubin (HBM) and eight Rubin CPX (GDDR7) per compute tray. In total, Nvidia's NVL144 CPX racks will feature 36 CPU sockets and 288 GPUs per rack.
[10]
Nvidia's Vera Rubin NVL144 CPX compute trays will now pack 16 GPUs – eight with HBM and another eight context-optimized ones using GDDR7
It's not immediately obvious how Nvidia will integrate these CPX GPUs into its system, though the NVL144 naming suggests Nvidia won't use its 1.8 TB/s NVLink-C2C interconnect, and will instead utilize PCIe 6.0.
[11]Alibaba looks to end reliance on Nvidia for AI inference
[12]AI arms dealer Nvidia laments the many billions lost to US-China trade war
[13]Nvidia details its itty bitty GB10 superchip for local AI development
[14]Nvidia touts Jetson Thor kit for real-time robot reasoning
The problem in context
Model context has become something of a new battleground for infrastructure and software vendors over the past year as models have become more advanced.
In addition to disaggregated serving with frameworks like Nvidia Dynamo or [15]llm-d , the idea of prompt caching and KV cache offload has also been gaining speed.
The idea behind these is rather than recomputing KV caches every time, computation outputs from the prefill phase are cached to system memory. This way only tokens that haven't already been processed yet have to be computed.
The developers of LMCache, which provides KV cache offload and caching functionality for popular model runners like vLLM, claim the approach can cut the time to first token by as much as 10x.
This approach can be tiered with CXL memory expansion modules, in-memory databases such as Redis, or even storage arrays, so that when a user walks away from a chat or AI coding session, the KV cache can be saved for future use. Even when GPU and system memory have been exhausted, the KV caches only have to be retrieved from slower CXL memory or storage, but don't have to be recomputed.
Enfabrica CEO Rochan Sankar, whose company has developed a sort of memory area network aimed at addressing the model context problem, likens this hierarchy to short, medium, and long-term memory.
As we mentioned earlier, larger contexts put a bigger computational burden on accelerators, but they also require more memory. DeepSeek's latest model will use roughly 104 GB or 208 GB of memory for every 128,000-token sequence, depending on whether you store them at FP8 or BF16. That means to support just ten simultaneous requests with a 128K context length, you'd need 1-2 TB of memory.
This is one of the challenges Enfabrica hopes will drive adoption of its Emfasys CXL memory [16]appliances , each of which can expose up to 18 TB of memory as an RDMA target for GPU systems to access.
The Pliops XDP LightningAI platform is similar in [17]concept using RDMA to sidestep compatibility issues with CXL and SSDs rather than DRAM. Using NAND flash is an interesting choice. On the one hand, it's substantially less expensive. On the other, KV caching is an inherently write-intensive operation, which could be problematic unless you happen to have hoarded tons of Optane SSDs. ®
Get our [18]Tech Resources
[1] https://www.theregister.com/2025/03/23/nvidia_dynamo/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://regmedia.co.uk/2025/09/09/nvidia_disaggregated_gpu_inference.jpg
[7] https://developer.nvidia.com/blog/nvidia-rubin-cpx-accelerates-inference-performance-and-efficiency-for-1m-token-context-workloads/
[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aMGgl4FhmIvctkmhztYfGQAAAJU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[9] https://www.theregister.com/2025/03/19/nvidia_charts_course_for_600kw/
[10] https://regmedia.co.uk/2025/09/09/vera_rubin_nvl144_cpx.jpg
[11] https://www.theregister.com/2025/08/29/china_alibaba_ai_accelerator/
[12] https://www.theregister.com/2025/08/27/nvidia_q2_china/
[13] https://www.theregister.com/2025/08/27/nvidia_blackwell_gb10/
[14] https://www.theregister.com/2025/08/25/nvidia_touts_jetson_thor_kit/
[15] https://llm-d.ai/
[16] https://www.nextplatform.com/2025/07/29/skimpy-hbm-memory-opens-up-the-way-ai-inference-memory-godbox/
[17] https://pliops.com/lightning-ai/
[18] https://whitepapers.theregister.com/