Decoding Nvidia's Groq-powered LPX and the rest of its new rack systems
- Reference: 1773963696
- News link: https://www.theregister.co.uk/2026/03/19/nvidia_lpx_deep_dive/
- Source link:
As we've [1]said before , if Nvidia wanted to build an SRAM-heavy inference accelerator, it didn't need to buy Groq to do it. The company’s newly announced Groq 3 [2]LPX racks , which pack 256 LP30 language processing units (LPUs) into a single system, show time-to-market was the reason Nvidia bought rather than built.
We're told the chip is based on Groq's second-gen LPU tech with a handful of last-minute tweaks made just before tapping out at Samsung's fabs.
[3]
The chip doesn't use Nvidia's proprietary NVLink interconnect, it lacks NVFP4 hardware support, and it isn't CUDA-compatible at launch.
[4]
[5]
We can therefore conclude that the $20 billion paid to acquire Groq's intellectual property rights and engineering staff was an opportunity cost to get the chips out the door and into customers' hands this year.
Why the rush?
One of the defining characteristics of SRAM-heavy architectures from Groq and its rival Cerebras is that they are very fast when running LLM inferencing workloads, routinely achieving generation rates exceeding 500 and even 1000 tokens a second.
The faster Nvidia can generate tokens, the faster code assistants and AI agents can act. But this kind of speed also opens the door to what Huang describes as test-time scaling.
The idea is that by letting "reasoning" models generate more "thinking" tokens, they can produce smarter, more accurate results. So, the faster you can generate tokens, the less of a latency penalty test-time scaling incurs.
[6]
On stage at GTC, Huang suggested that this high-performance and low-latency inference provider could eventually charge as much as $150 per million tokens for this capability.
[7]
As you can see, Nvidia's GPUs are great for generating bulk tokens, but as interactivity increases, efficiency drops. - Click to enlarge
Unfortunately for Nvidia, GPUs are great for batch processing but don't scale nearly as efficiently as per-user-output speeds increase. At least not on their own.
By combining its GPUs and Groq's LPU tech, Nvidia aims to deliver the best of both worlds: an inference platform that scales much more efficiently at higher tokens per second per user.
[8]
In this graphic, the faint green and yellow lines show GPU and LPU scaling. By combining the two, Nvidia aims to deliver the best of both worlds — high throughput and interactivity. Image Credit: Nvidia - Click to enlarge
Nvidia is also under some pressure to maintain its dominance of the AI infrastructure market as rival chip designers like AMD close the gap on hardware and software.
Last week, Amazon and Cerebras announced a collaboration to [9]pair AWS' Trainium-3 accelerators with the latter's wafer-scale accelerators for many of the same reasons Nvidia built LPX. Of course, AWS has also announced plans to [10]deploy more than a million Nvidia GPUs in addition to fielding Nvidia-Groq LPUs, so the cloud giant hasn't suddenly started picking sides.
The Groq-3 LPX
The LP30 is very different from Nvidia’s GPUs. It's built by Samsung Electronics rather than TSMC and uses only on-chip SRAM. It also ditches the conventional Von Neumann architecture for another commonly referred to as data flow.
Rather than fetching instructions from memory, decoding, executing, and then writing that back to a register, data flow architectures process data as it's streamed through the chip. The processor's compute units don't have to wait for a bunch of load and store operations to shuffle data around, which, in theory results in higher achievable utilization.
[11]
Here's a quick overview of Nvidia's Groq-3 LPU - Click to enlarge
According to Nvidia, each LP30 can deliver 1.2 petaFLOPS of FP8 compute. But, as we mentioned earlier, support for 4-bit block floating point data types, like MX or NV FP4, won't arrive until the LP35 arrives sometime next year.
That compute is tied to a relatively large pool of SRAM memory, which is orders of magnitude faster than the high-bandwidth memory (HBM) found in GPUs today, but is also incredibly inefficient in terms of space required.
Each LPU only has enough die space for 500 MB of on-chip memory. For comparison, just one of the eight HBM4 modules on Nvidia's [12]Rubin GPUs contains 36GB of memory. What the LP30 lacks in capacity, it more than makes up for in bandwidth, achieving speeds up to 150 TB/s – nearly 7 times more than Nvidia’s Rubin accelerators.
This makes LPUs ideal for the auto-regressive decode phase of the inference pipeline, during which all of a model's active parameters need to be streamed from memory for every token generated.
[13]
Of course, to do that, you need to fit the model in memory, which is no easy task for the trillion-parameter models Nvidia is targeting. For models this large, multiple racks are required. Because of this, LP30 bristles with interconnects. Each chip features 96 of them – specifically, 112 Gbps SerDes – totalling 2.5 TB/s of bidirectional bandwidth.
Each LPX rack is equipped with 256 LPUs. Those are spread across 32 compute trays, each containing eight LPUs, some fabric expansion logic and DRAM, and the host CPU and a BlueField-4 data processing unit (DPU).
[14]
Each LPU compute tray features eight liquid-cooled Groq-3 LPUs totalling 4GB of SRAM - Click to enlarge
Some of that network connectivity is funnelled out the back of these blades into a new copper Ethernet backplane Nvidia calls the Oberon ETL256, while the remainder is directed out the front of the system enabling multiple NVL72 and LPX racks to be stitched together.
Not a standalone part
While it's entirely possible to run large language models (LLMs) entirely on an LPX cluster, that's not how Nvidia is positioning the product.
[15]
This graphic shows how inference workloads are distributed across GPUs and LPUs. Image credit: Nvidia - Click to enlarge
Instead, one or more LPX racks is paired with a Vera-Rubin NVL72, which we [16]discussed in more detail back when Nvidia showed it off in January, with various parts of the inference stack distributed across the GPUs and LPUs. Nvidia's reference design has a relatively small number of GPUs handling the compute-heavy prompt-processing (prefill) phase, while the bandwidth-intensive decode phase, where tokens are generated, is split between a separate pool of GPUs and the LPUs.
During this decode phase, Nvidia takes advantage of GPUs' comparatively large memory and compute capacity to handle the attention operations, while the bandwidth-constrained feed forward neural network ops are offloaded to LPUs sitting in the LPX rack over Ethernet.
Nvidia's Dynamo disaggregated inference platform handles orchestration for all of this.
How many LPUs do you need?
The whole system requires a lot of LPUs.
The exact ratio of GPUs to LPUs depends on the workload. Tasks requiring extremely large contexts, batch sizes, or concurrency may need a larger pool of GPUs. A general-purpose chatbot might run well on a single rack.
This is because longer context windows require more memory for the key-value (KV) caches that store model state (think short-term memory) and attention operations. By keeping these on the GPU, Nvidia is able to get by with fewer LPUs.
The actual number of required LPUs is directly proportional to the size of a model. For a trillion-parameter model, that translates to between four and eight LPX racks, or 1,024 to 2,048 LPUs, depending on whether the weights are stored in SRAM at 4-bit or 8-bit precision.
Who is LPX for?
If you're not a hyperscaler, neocloud, model dev, LPX is probably not for you. The sheer number of LPUs required to serve large open models will likely put Nvidia's LPX platform out of reach for most enterprises.
Speaking to press ahead of this week's keynote, Buck said Nvidia is focusing primarily on model builders and service providers that need to serve trillion-plus-parameter models with token rates exceeding 500 to 1,000 a second.
Having said that, in a [17]technical blog , Nvidia presented another use case for the LPUs as a speculative decode accelerator, something we suggested the company might do back in December.
Speculative decoding is a [18]method for juicing inference performance by using a smaller, more performant "draft" model to predict the outputs of a larger model. When it works, the technique can speed token generation by anywhere from 2x to 3x.
And since the approach fails back to the larger model anytime it guesses wrong, there's no loss in quality or accuracy.
Nvidia proposes hosting the draft model on LPUs and the larger target model on a set of GPUs. Since draft models tend to be fairly small, this might present an opportunity for Nvidia to sell LPUs to enterprise customers.
What happened to Rubin CPX?
You may be scratching your head, wondering "wasn't there supposed to be some kind of special Rubin chip optimized for large-context prefill processing?" You're not hallucinating.
Back at Computex last northern spring, Nvidia [19]unveiled the Rubin CPX, a version of Rubin that used slower, less expensive GDDR7 memory to speed up the time to first token – how long users or agents have to wait for the model to start generating an output – when working with large inputs.
The idea was that Rubin CPX could cut down on wait times for applications that might involve processing large quantities of documents, freeing up the non-CPX Rubins, and speeding up overall decode times.
[20]
Nvidia's Vera Rubin NVL144 CPX compute trays will now pack 16 GPUs. Eight with HBM and another eight context optimized ones using GDDR7 - Click to enlarge
However, by early 2026, Nvidia stopped mentioning CPX. This week, we learned the project had been put on the back burner so Nvidia can prioritize LPX.
It's important to note that LPX is not a replacement for CPX. The two platforms were designed to accelerate opposite ends of the inference pipeline: LPUs are designed to speed up token generation during the decode phase, while CPX was intended to cut the time users or agents spent waiting for the model to respond during prefill.
Nvidia hasn't given up on the concept either. Ian Buck, VP of Hyperscale and HPC at Nvidia, told press that CPX is still a good idea and that we may see the concept resurface in future generations.
[21]Nvidia's on-again off-again H200 sales in China are now on again
[22]Chips... in spaaaace – courtesy of Nvidia
[23]Nvidia powers further into the CPU market with new rack systems packing 256 Vera processors
[24]Nvidia slaps $20B Groq tech into massive new LPX racks to speed AI response time
An alphabet soup of rack scale architectures
While LPX is the most interesting addition to Nvidia's rack-scale lineup, it's not the only one.
At GTC, Nvidia also [25]unveiled three more rack-scale designs, one each for networking, storage, and agentic compute.
We looked at Nvidia's new Vera CPU racks in [26]more detail earlier this week, but the system uses the same ETL network backplane as its LPX racks and HGX systems, and features 32 compute blades, each with eight 88-core Vera CPUs and up to 12 TB of LPDDR5X SOCAMM memory modules on board.
In addition to serving as the host processor on Nvidia's latest generation products, the Vera CPU rack is intended as an execution environment for agentic systems, like Open Claw, that require high-memory bandwidth and single-threaded performance.
Alongside the CPU racks is a new storage rack called the BlueField-4 STX. As the name suggests, the reference design combines Nvidia's BlueField-4 data processing units (DPUs, aka SmartNICs) with a Vera CPU and ConnectX-9 NICs. Nvidia intends this offering to serve as a KV-cache offload target.
Any time an LLM processes a prompt, it generates KV caches that store the model state in vectors. By keeping those pre-computed vectors in GPU or system memory, or flash storage, only new tokens have to be computed while repeating ones can be recycled from cache.
Earlier this year, Nvidia [27]showed off its context-memory storage platform, which is meant to automate the offload of KV caches to compatible storage targets. The AI infrastructure giant claims that this approach can boost token rates by up to 5x by freeing up GPU resources to handle other elements of the inference pipeline.
Finally, there's the Spectrum-6 SPX network rack, which also leverages the MGX ETL reference design to simplify cabling of Spectrum-X and Quantum-X switches.
Together, these rack systems form a sort of assembly line. Think of it this way: Vera CPU racks running AI agents make API calls to models running Vera-Rubin NVL72 systems with Groq LPX decode accelerators. KV caches generated by these agents are offloaded to STX storage, and everything is connected to one another by SPX racks packed with Spectrum or Quantum switches. And as long as the AI boom continues, Nvidia keeps printing money. ®
Get our [28]Tech Resources
[1] https://www.theregister.com/2025/12/31/groq_nvidia_analysis/
[2] https://www.theregister.com/2026/03/16/nvidia_lpx_groq_3/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/nvidiagtc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2abzUczCLmRzY3o3mYLH13QAAAcQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/nvidiagtc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abzUczCLmRzY3o3mYLH13QAAAcQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/nvidiagtc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abzUczCLmRzY3o3mYLH13QAAAcQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/nvidiagtc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abzUczCLmRzY3o3mYLH13QAAAcQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://regmedia.co.uk/2026/03/06/goldilocks.jpg
[8] https://regmedia.co.uk/2026/03/19/nvidia_gpu_lpu.jpg
[9] https://www.cerebras.ai/blog/cerebras-is-coming-to-aws
[10] https://aws.amazon.com/blogs/machine-learning/aws-and-nvidia-deepen-strategic-collaboration-to-accelerate-ai-from-pilot-to-production/
[11] https://regmedia.co.uk/2026/03/19/groq_3_lpu.jpg
[12] https://www.theregister.com/2026/01/05/ces_rubin_nvidia/
[13] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/nvidiagtc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abzUczCLmRzY3o3mYLH13QAAAcQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[14] https://regmedia.co.uk/2026/03/19/nvidia_lpu_compute_try.jpg
[15] https://regmedia.co.uk/2026/03/19/nvidia_nvl72_lpx.jpg
[16] https://www.theregister.com/2026/01/05/ces_rubin_nvidia/
[17] https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/
[18] https://www.theregister.com/2024/12/15/speculative_decoding/
[19] https://www.theregister.com/2025/09/10/nvidia_rubin_cpx/
[20] https://regmedia.co.uk/2025/09/09/vera_rubin_nvl144_cpx.jpg
[21] https://www.theregister.com/2026/03/17/nvidia_h200_china_sales_resume/
[22] https://www.theregister.com/2026/03/17/nvidia_chips_in_spaaaaaace/
[23] https://www.theregister.com/2026/03/16/nvidia_vera_cpu_rack/
[24] https://www.theregister.com/2026/03/16/nvidia_lpx_groq_3/
[25] https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/
[26] https://www.theregister.com/2026/03/16/nvidia_vera_cpu_rack/
[27] https://www.blocksandfiles.com/2026/01/12/nvidias-basic-context-memory-extension-infrastructure/4090541
[28] https://whitepapers.theregister.com/
LPX is neat, but ...
I was also hoping to hear more about CPO at this ' AI Burning Man ' shindig, seeing how [1]Ayar Labs seemed to make nice inroads there, and even [2]Lightmatter recently demoed a quite practical near-package version of the tech (NPO). Is Nvidia planning to eventually clean up its (messy) 2 miles of copper wiring per rack by adopting CPO/NPO, or will they let competition get there first? Any word on this at GTC26, like a timeline?
Also, for KV cache, Penguin Solutions' [3]MemoryAI KV cache server that relies on CXL memory tech sounds quite neat (and nicely standardized). I wonder how the BlueField-4 STX storage rack compares to that, specwise, plus what may be its advantages and drawbacks (if any) ... (maybe I missed it?).
[1] https://www.theregister.com/2026/03/11/ayar_labs_wiwynn_photonics/
[2] https://www.theregister.com/2026/03/11/lightmatter_passge_l20_fiber/
[3] https://www.hpcwire.com/off-the-wire/penguin-solutions-introduces-industrys-first-production-ready-cxl-based-kv-cache-server/
Scratching
You may be scratching your head
I like the optimism, that somebody gone that far reading the acronyms soup disguised as an article.