Everybody has a theory about why Nvidia dropped $20B on Groq - they're mostly wrong

(2025/12/31)

Reference: 1767178933
News link: https://www.theregister.co.uk/2025/12/31/groq_nvidia_analysis/
Source link:

This summer, AI chip startup Groq raised $750 million at a valuation of $6.9 billion. Just three months later, Nvidia celebrated the holidays by dropping nearly three times that to license its technology and squirrel away its talent.

In the days that followed, the armchair AI gurus of the web have been speculating wildly as to how Nvidia can justify spending $20 billion to get Groq’s tech and people.

Pundits believe Nvidia knows something we don't. Theories run the gamut from the deal signifying Nvidia intends to ditch HBM for SRAM, a play to secure additional foundry capacity from Samsung, or an attempt to quash a potential competitor. Some hold water better than others, and we certainly have a few of our own.

What we know so far

Nvidia [1]paid $20 billion to non-exclusively license Groq's intellectual property, which includes its language processing units (LPUs) and accompanying software libraries.

Groq's LPUs form the foundation of its high-performance inference-as-a-service offering, which it will keep and continue to operate without interruption after the deal closes.

[2]

The arrangement is clearly engineered to avoid regulatory scrutiny. Nvidia isn't buying Groq, it's licensing its tech. Except… it's totally buying Groq.

[3]

[4]

How else to describe a deal that sees Groq’s CEO Jonathan Ross and president Sunny Madra move to Nvidia, along with most of its engineering talent?

Sure, Groq is technically sticking around as an independent company with Simon Edwards at the helm as its new CEO, but with much of its talent gone, it's hard to see how the chip startup survives long-term.

[5]

The argument that Nvidia just wiped a competitor off the board therefore works. Whether that move was worth $20 billion is another matter, given it could provoke an antitrust lawsuit.

It must be for the SRAM, right?

One prominent theory about Nvidia’s motives is that Groq’s LPUs use static random access memory (SRAM), which is orders of magnitude faster than the high-bandwidth memory (HBM) found in GPUs today.

A single HBM3e stack can achieve about 1 TB/s of memory bandwidth per module and 8 TB/s per GPU today. The SRAM in Groq's LPUs can be 10 to 80 times faster.

Since large language model (LLM) inference is predominantly bound by memory bandwidth, Groq can achieve stupendously fast token generation rates. In Llama 3.3 70B, the benchmarkers at Artificial Analysis [6]report that Groq's chips can churn out 350 tok/s. Performance is even better when running a mixture of experts models, like gpt-oss 120B, where the chips managed 465 tok/s.

We're also in the middle of a global memory shortage and demand for HBM has never been higher. So, we understand why some might look at this deal and think Groq could help Nvidia cope with the looming memory crunch.

[7]

The simplest answer is often the right one – just not this time.

Sorry to have to tell you this, but there's nothing special about SRAM. It's in basically every modern processor, including Nvidia's chips.

SRAM also has a pretty glaring downside. It's not exactly what you'd call space efficient. We're talking, at most, a few hundred megabytes per chip compared to 36 GB for a 12-high HBM3e stack for a total of 288 GB per GPU.

Groq's LPUs have just 230 MB of SRAM each, which means you need hundreds or even thousands of them just to run a modest LLM. At 16-bit precision, you'd need 140 GB of memory to hold the model weights and an additional 40 GB for every 128,000 token sequence.

Groq needed 574 LPUs stitched together using a high-speed interconnect fabric to run Llama 70B.

You can get around this by building a bigger chip – each of Cerebras' WSE-3 wafers features more than 40 GB of SRAM on board, but these chips are the size of a dinner plate and consume 23 kilowatts. Anyway, Groq hasn't gone this route.

Suffice it to say, if Nvidia wanted to make a chip that uses SRAM instead of HBM, it didn't need to buy Groq to do it.

Going with the data flow

So, what did Nvidia throw money at Groq for?

Our best guess is that it was really for Groq's "assembly line architecture." This is essentially a programmable data flow design built with the express purpose of accelerating the linear algebra calculations computed during inference.

Most processors today use a Von Neumann architecture. Instructions are fetched from memory, decoded, executed, and then written to a register or stored in memory. Modern implementations introduce things like branch prediction, but the principles are largely the same.

Data flow works on a different principle. Rather than a bunch of load-store operations, data flow architectures essentially process data as it's streamed through the chip.

As Groq explains it, these data conveyor belts "move instructions and data between the chip's SIMD (single instruction/multiple data) function units."

"At each step of the assembly process, the function unit receives instructions via the conveyor belt. The instructions inform the function unit where it should go to get the input data (which conveyor belt), which function it should perform with that data, and where it should place the output data."

According to Groq, this architecture effectively eliminates bottlenecks that bog down GPUs, as it means the LPU is never waiting for memory or compute to catch up.

Groq can make this happen with an LPU and between them, which is good news as Groq's LPUs aren't that potent on their own. On paper, they achieve BF16 perf, roughly on par with an RTX 3090 or the INT8 perf of an L40S. But, remember that's peak FLOPS under ideal circumstances. In theory, data flow architectures should be able to achieve better real-world performance for the same amount of power.

It's worth pointing out that data flow architectures aren't restricted to SRAM-centric designs. For example, NextSilicon's data flow architecture uses HBM. Groq opted for an SRAM-only design because it kept things simple, but there's no reason Nvidia couldn't build a data flow accelerator based on Groq's IP using SRAM, HBM, or GDDR.

So, if data flow is so much better, why isn't it more common? Because it's a royal pain to get right. But, Groq has managed to make it work, at least for inference.

And, as Ai2's Tim Dettmers recently put it, chipmakers like Nvidia are quickly running out of levers they can pull to juice chip performance. Data flow gives Nvidia new techniques to apply as it seeks extra speed, and the deal with Groq means Jensen Huang’s company is in a better position to commercialize it.

An inference-optimized compute stack?

Groq also provides Nvidia with an inference-optimized compute architecture, something that it's been sorely lacking. Where it fits, though, is a bit of a mystery.

Most of Nvidia’s "inference-optimized" chips, like the H200 or B300, aren't fundamentally different from their "mainstream" siblings. In fact, the only difference between the H100 and H200 was that the latter used faster, higher capacity HBM3e which just happens to benefit inference-heavy workloads.

As a reminder, LLM inference can be broken into two stages: the compute-heavy prefill stage, during which the prompt is processed, and the memory-bandwidth-intensive decode phase during which the model generates output tokens.

That's changing with Nvidia's Rubin generation of chips in 2026. Announced back in September, the [8]Rubin CPX is designed specifically to accelerate the compute-intensive prefill phase of the inference pipeline, freeing up its HBM-packed Vera Rubin superchips to handle decode.

This disaggregated architecture minimizes resource contention and helps to improve utilization and throughput.

Groq's LPUs are optimized for inference by design, but they don't have enough SRAM to make for a very good decode accelerator. They could, however, be interesting as a speculative decoding part.

If you're not familiar, [9]speculative decoding is a technique which uses a small "draft" model to predict the output of a larger one. When those predictions are correct, system performance can double or triple, driving down cost per token.

These speculative draft models are generally quite small, often consuming a few billion parameters at most, making Groq's existing chip designs plausible for such a design.

Do we need a dedicated accelerator for speculative decoding? Sure, why not. Is it worth $20 billion? Depends on how you measure it. Compared with publicly traded companies whose total valuation is around $20 billion, like HP, Inc., or Figma, it may seem steep. But for Nvidia, $20 billion is a relatively affordable amount – it recorded $23 billion in cash flow from operations last quarter alone. In the end, it means more chips and accessories for Nvidia to sell.

[10]When the AI bubble pops, Nvidia becomes the most important software company overnight

[11]Tis the season when tech leaders rub their crystal balls

[12]Nvidia spends $5B on Intel bailout, instantly gets $2.5B richer

[13]Nvidia wasting no time to flog H200s in China

What about foundry diversification?

Perhaps the least likely take we've seen is the suggestion that Groq somehow opens up additional foundry capacity for Nvidia.

Groq currently uses GlobalFoundries to make its chips, and plans to build its next-gen parts on Samsung's 4 nm process tech. Nvidia, by comparison, does nearly all of its fabrication at TSMC and is heavily reliant on the Taiwanese giant’s advanced packaging tech.

The problem with this theory is that it doesn't actually make any sense. It's not like Nvidia can't go to Samsung to fab its chips. In fact, Nvidia has fabbed chips at Samsung before – the Korean giant made most of Nvidia’s Ampere generation product. Nvidia needed TSMC's advanced packaging tech for some parts like the A100, but it doesn’t need the Taiwanese company to make Rubin CPX. Samsung or Intel can probably do the job.

All of this takes time, and licensing Groq's IP and hiring its team doesn't change that.

The reality is Nvidia may not do anything with Groq's current generation of LPUs. Jensen might just be playing the long game, as he's been known to do. ®

Get our [14]Tech Resources

[1] https://www.cnbc.com/2025/12/24/nvidia-buying-ai-chip-startup-groq-for-about-20-billion-biggest-deal.html

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aVVWrhS2mA8mNB1FVvDWmgAAAoo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVVWrhS2mA8mNB1FVvDWmgAAAoo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVVWrhS2mA8mNB1FVvDWmgAAAoo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aVVWrhS2mA8mNB1FVvDWmgAAAoo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://artificialanalysis.ai/providers/groq?speed=output-speed&latency=time-to-first-token&endpoints=groq_gpt-oss-120b-low%2Cgroq_gpt-oss-20b-low%2Cgroq_gpt-oss-120b%2Cgroq_gpt-oss-20b%2Cgroq_llama-3-3-instruct-70b%2Cgroq_llama-4-scout-instruct%2Cgroq_llama-4-maverick%2Cgroq_kimi-k2-0905%2Cgroq_llama-3-1-instruct-8b

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aVVWrhS2mA8mNB1FVvDWmgAAAoo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://www.theregister.com/2025/09/10/nvidia_rubin_cpx/

[9] https://www.theregister.com/2024/12/15/speculative_decoding/

[10] https://www.theregister.com/2025/12/30/how_nvidia_survives_ai_bubble_pop/

[11] https://www.theregister.com/2025/12/30/tech_leaders_predictions_2026/

[12] https://www.theregister.com/2025/12/29/nvidia_intel_5_billion/

[13] https://www.theregister.com/2025/12/22/nvidia_flog_h200s_china/

[14] https://whitepapers.theregister.com/

other options

Omnipresent

NVIDIA just threw 5-6 billion at intel at well. Why sell AI to others when you could BE the AI?

Also, could be the money merry go round. "You throw a billion at me, and I'll throw a billion at you."

Re: other options

HuBo

Yeah ... $20B is a bit much for a reverse acqui-hire when Intel managed to [1]'ambush' SambaNova's [2]unrivaled full-stack AI for [3]just $1.6B (or nearly so) ... suggesting masterful courtship.

And in the same throbbing vein, AMD [4]swallowed up [5]Untether earlier this year (acqui-hired for less than $100M), for what one expects are similar reasons.

Those dataflow moves should help them effectively tame the [6]exponential costs of linear progress in this field imho, leaving [7]Von Neumann in the dust for [8]some workloads (or subsets of workloads -- as discussed in TFA).

Plenty of tasty morsels left in the sea though, not to mention [9]related compilers and orchestration -- but too late for those salivating over [10]Confluent it seems ...

[1] https://www.investing.com/analysis/intel-sambanova-play-isnt-an-acquisition-its-an-ambush-200669623

[2] https://www.intelcapital.com/sambanova-unrivaled-full-stack-ai/

[3] https://www.bloomberg.com/news/articles/2025-12-12/intel-nears-1-6-billion-deal-for-ai-chip-startup-sambanova

[4] https://www.eetimes.com/untether-ai-shuts-down-engineering-team-joins-amd/

[5] https://www.nextplatform.com/2020/10/29/server-inference-chip-startup-untethered-from-ai-data-movement/

[6] https://www.theregister.com/2025/12/11/ai_superintelligence_fantasy/

[7] https://www.theregister.com/2025/11/27/tenstorrent_quietbox_review/

[8] https://www.theregister.com/2025/03/12/training_inference_shift/

[9] https://www.eetimes.com/lemurian-labs-raises-28-million-for-ai-portability-software/

[10] https://newsroom.ibm.com/2025-12-08-ibm-to-acquire-confluent-to-create-smart-data-platform-for-enterprise-generative-ai

mark l 2

How did space Karen get away with calling his sh1tty AI Grok, when there was already a well established AI business using the name Groq set up years prior?

Sure they aren't spelled exactly the same, but orally they sound the same when pronounced so it would be easy for a layman to mistake the two as being the same company. As i initially did when i started reading the article.

I'm sure i couldn't set up business called Happle and start selling phones and tablets without Apples lawyers coming after me, so how has Musk got away with it for all this time?

tiggity

@mark l 2

I'm assuming Musk chose his spelling (Grok) based on usage in the Heinlein novel (Stranger in a Strange Land) & long in use in computing slang - not sure where Groq got their name from (the q would imply not based on Heinlein) - there is a query language of that name but not related to chip architecture / data flows

Anonymous Coward

> not sure where Groq got their name from

Just a guess that it's the same word, with the spelling startupified, like a million others have done. Presumably to make their names more searchable.

Anonymous Coward

In any case, a note in the article to the effect that this is nothing to do with xAI's product would have been welcome.

100GW in space!

Grunchy

The XAI crowd’s latest ketamine-hallucination is that they can build a 100GW space-based orbiting super-intelligence cluster thing.

Someone took the fever dream and created a 3D rendering. If solar irradiance is 1kW/m^2, at 100% efficiency you wind up with a square-shape orbiting panel array 10km on a side.

(Whichever homophone “groq” or “grok”, gosh they both certainly bear a resemblance to the old “Gorf” space laser game? Treading on the coattails of greatness, much?)

I used to grok this

Daedalus

Groq's architecture sounds a lot like good old array processing with data pipelines and dedicated processing modules. OK back in the day the processing was Add and Multiply instead of whatever these things do, but the idea of acceleration with parallel processing has a long history. GPU's are almost certainly using similar architecture already. The original hardware from Floating Point Systems claimed (somewhat dubiously) 64 MegaFLOPS from a 4 MHz clock. It was a fun thing to work with, extracting the last ounce of efficiency by creatively arranging data flows and operations.

News: 1767178933

Everybody has a theory about why Nvidia dropped $20B on Groq - they're mostly wrong

other options

Re: other options

100GW in space!

I used to grok this