News: 1753823360

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Stacking up Huawei’s rack-scale boogeyman against Nvidia’s best

(2025/07/29)


Analysis Nvidia has the green light to resume shipments of its H20 GPUs to China, but while the chip may be plentiful, bit barn operators in the region now have far more capable alternatives at their disposal.

Among the most promising is Huawei's CloudMatrix 384 rack systems, which it [1]teased at the World Artificial Intelligence Conference (WAIC) in Shanghai this week.

The system is powered by the Chinese IT goliath's latest Ascend neural processing unit (NPU), the P910C. Assuming you can get your hands on one, the chip promises more than twice the floating point performance of the H20 and more, albeit slower, memory to boot.

[2]

However, with its CloudMatrix systems, Huawei is clearly aiming a lot higher than Nvidia's sanctions-compliant silicon. Compared to Nvidia's Blackwell-based [3]GB200 NVL72 rack systems, Huawei's biggest iron boasts about 60 percent higher dense 16-bit floating point performance, roughly twice the memory bandwidth, and just over 3.5x the HBM.

[4]

[5]

How does a company effectively blacklisted from Western chip tech accomplish that? Simple: the CloudMatrix 384 is enormous, packing more than 5x the accelerators and taking up 16x the floor space of Nvidia's NVL72.

Dissecting the Ascend P910C

At the heart of the CloudMatrix 384 is Huawei's Ascend P910C NPU. Each of these accelerators comes equipped with a pair of compute dies stitched together using a high-speed chip-to-chip interconnect capable of shuttling data around at 540GB/s or 270GB/s in each direction.

Combined, these dies are capable of churning out 752 teraFLOPS of dense FP16/BF16 performance. Feeding all that compute are eight stacks of high-bandwidth memory totaling 128GB, which supplies 1.6TB/s of memory bandwidth to each of the compute dies for a total of 3.2TB/s.

[6]

Here's a breakdown of Huawei's latest NPU the Ascend P910C - Click to enlarge

If you've been keeping track of AI chip development, you'll know that's not exactly what you'd call competitive in 2025. For comparison, Nvidia's nearly two-year-old H200 boasts about 83 teraFLOPS higher floating point performance at FP16, 13GB more HBM, and 1.6TB/s more memory bandwidth.

Since you can't exactly buy an H200 in China – at least not [7]legally – the better comparison would be to the H20, which Nvidia is set to [8]resume shipping any day now. While the H20 still holds a narrow advantage in memory bandwidth, the Ascend P910C has more HBM (128GB vs 96GB) and more than twice the floating-point performance.

The P910C may not support FP8, but Huawei argues that INT8 is nearly as good, at least so far as inference is concerned.

[9]

Individually, the P910C presents a compelling alternative to Nvidia's China-spec accelerators, even if they're no match for the GPU giant's latest batch of Blackwell chips.

NPUs together strong

Most cutting-edge large language models aren't being trained or run on a single chip, however. There's simply not enough compute memory or bandwidth to make that work. Because of this, the chip's individual performance is less important than how efficiently you can scale it up and out. And that's exactly what Huawei has designed its latest NPUs to do.

Huawei's Ascend P910C features an NVLink-like scale up interconnect or unified bus (UB), which allows Huawei to stitch multiple accelerators together into one great big one, just like Nvidia is doing with its HGX and NVL72 servers and rack systems.

Each P910C accelerator features 14 28GB/s UB links (seven per compute die), which connect to seven UB switch ASICs baked into each node to form a fully non-blocking all-to-all mesh with eight NPUs and four Kunpeng CPUs per node.

Unlike Nvidia's H20 or B200 boxes, Huawei's UB switches have a bunch of spare ports which connect up to a second tier of UB spine switches. This is what allows Huawei to scale from eight NPUs per box to 32 a rack or 384 per "supernode" — hence the name CloudMatrix 384.

[10]

This diagram offers a look at how the CloudMatrix's 384 accelerators act like one great big server - Click to enlarge

From a rack-to-rack standpoint, Nvidia's GB200 NVL72 systems are upwards of 7.5x faster at FP16/BF16, offer 5.6x the memory bandwidth and 3.4x the memory capacity. However, Nvidia only supports a compute domain with up to 72 GPUs, which is less than one-fifth as many as Huawei. That's how the Chinese IT giant can claim greater system level performance than its Western rival on paper.

As you might expect, with just 32 NPUs per rack, the full CloudMatrix 384 is a lot bigger than Nvidia's NVL72. Huawei's biggest AI iron spans 16 racks with 12 for compute and four for networking.

[11]

We'll note that technically, Nvidia's NVLink switch tech can support scale-up networks with up to 576 GPUs but we've yet to see such a system in the wild.

For deployments requiring more than 384 NPUs, Huawei's CloudMatrix also sports 400Gbps of scale-out networking per accelerator. This, the company claims, allows for training clusters with up to 165,000 NPUs.

Inference performance

At least for inference, these large-scale compute fabrics present a couple of advantages, particularly when it comes to the flurry of massive mixture of experts (MoE) models coming out of China these days.

More chips mean operators can better leverage techniques like tensor, data, and or expert parallelism to boost inference throughput and drive down the overall cost per token.

In the case of the CloudMatrix 384, a mixture-of-experts model like DeepSeek R1 could be configured so that each NPU die hosts a single expert, Huawei explained in a paper [12]published last month.

To enable this, Huawei has developed an LLM inference serving platform called CloudMatrix-Infer, which disaggregates prefill, decode, and caching. "Unlike existing KV cache-centric architectures, this design enables high-bandwidth, uniform access to cached data via the UB network, thus reducing data locality constraints, simplifying task scheduling, and improving cache efficiency," the researchers wrote.

If any of that sounds familiar, that's because Nvidia announced a similar system for its GPUs back at GTC called Dynamo, which we took a deep look at [13]back in March .

Testing on DeepSeek-R1, Huawei showed CloudMatrix-Infer dramatically increased performance, with a single NPU processing 6,688 input tokens a second while generating tokens at a rate of 1,943 tokens a second.

That might sound incredible, but it's worth pointing out that aggregate throughput was at a batch size of 96. Individual performance was closer to 50ms per output token or 20 tokens a second. Pushing individual performance to around 66 tokens a second, something that's likely to make a noticeable difference for thinking models like R1, cuts the NPU's overall throughput to 538 tokens per second at a batch size of eight.

Under ideal conditions, Huawei says it was able to achieve a prompt-processing efficiency of 4.5 tokens/sec per teraFLOPS, putting it just ahead of Nvidia's H800 at 3.96 tokens/sec per teraFLOPS. Huawei demonstrated similar performance during the decode phase, where the rack system eked out a roughly 10 percent lead over Nvidia's H800. As usual, take these vendor claims with a grain of salt. Inference performance is heavily dependent on your workload.

Power, density, and cost

While token/sec per teraFLOPS may offer some insights into the overall efficiency of the system, practically the more important metric is how expensive the tokens generated by the system are. This is usually measured in tokens per dollar per watt.

So while the CloudMatrix 384's sheer scale allows it to compete with and even exceed the performance of Nvidia's much more powerful Blackwell systems, that doesn't matter much if it's more expensive to deploy and operate.

Official power ratings for Huawei's CloudMatrix systems are hard to pin down, but SemiAnalysis has [14]speculated that the complete system is likely pulling somewhere in the neighborhood of 600 kilowatts all in. That's compared to the roughly 120kW of the GB200 NVL72.

Assuming those estimates are accurate, that'd not only make Nvidia's NVL72 several times more compute-dense, but more than 3x more power-efficient at 1,500 gigaFLOPS per watt versus Huawei's 460 gigaFLOPS per watt.

Access to cheap power may be a major bottleneck in the West, but it is not necessarily such a big deal in China. Over the past few years, Beijing has [15]invested aggressively in its national grid systems, building out large numbers of solar farms and nuclear reactors to offset its reliance on coal-fired power plants.

[16]Taxman picks up $140M tab after Cadence fined for China export violations

[17]A billion dollars' worth of Nvidia chips fell off a truck and found their way to China, report says

[18]Republican calls out Trump admin's decision to resume GPU sales to China

[19]AMD cleared to join Nvidia and resume selling some underpowered AI chips to China

The bigger issue may be infrastructure cost. Huawei's CloudMatrix 384 will [20]reportedly retail for somewhere in the neighborhood of $8.2 million. Nvidia's NVL72 rack systems are [21]estimated to cost around $3.5 million a piece.

But if you happen to be a Chinese model dev, Nvidia's NVL racks aren't even a consideration. Thanks to Uncle Sam's export controls on AI accelerators, Huawei doesn't have much if any competition in the rack-scale arena, and its only major bottleneck may be just how many P910Cs China's foundry champion SMIC can pump out.

Lawmakers in the US remain [22]convinced that SMIC lacks the ability to manufacture chips of this complexity in high volumes. Then again, not too many years ago, industry experts believed SMIC [23]lacked the tech necessary to manufacture 7nm and smaller process nodes, which turned out not to be the case.

It remains to see in what volumes Huawei will be able to churn out CloudMatrix systems, but in the meantime, Nvidia CEO Jensen Huang would be happy to pack Chinese datacenters with as many H20s as they can handle. Nvidia has [24]reportedly ordered another 300,000 H20 chips from TSMC to meet strong demand from Chinese customers.

The Register reached out to Huawei for comment, but hadn't heard back at the time of publication. ®

Get our [25]Tech Resources



[1] https://www.reuters.com/world/china/huawei-shows-off-ai-computing-system-rival-nvidias-top-product-2025-07-26/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aIlEdD419fmMafz2_HP_-QAAAAM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://www.theregister.com/2024/03/21/nvidia_dgx_gb200_nvk72/

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aIlEdD419fmMafz2_HP_-QAAAAM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aIlEdD419fmMafz2_HP_-QAAAAM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://regmedia.co.uk/2025/07/29/huawei_ascend_p910c.jpg

[7] https://www.theregister.com/2025/07/24/nvidia_chips_china_whoops/

[8] https://www.theregister.com/2025/07/15/us_allows_nvidia_china_sales/

[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aIlEdD419fmMafz2_HP_-QAAAAM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[10] https://regmedia.co.uk/2025/07/29/huawei_cloudmatrix_384.jpg

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/systems&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aIlEdD419fmMafz2_HP_-QAAAAM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[12] https://arxiv.org/html/2506.12708v1#S3

[13] https://www.theregister.com/2025/03/23/nvidia_dynamo/

[14] https://semianalysis.com/2025/04/16/huawei-ai-cloudmatrix-384-chinas-answer-to-nvidia-gb200-nvl72/

[15] https://www.economist.com/china/2023/11/30/china-is-building-nuclear-reactors-faster-than-any-other-country

[16] https://www.theregister.com/2025/07/29/cadence_fine_export_violations/

[17] https://www.theregister.com/2025/07/24/nvidia_chips_china_whoops/

[18] https://www.theregister.com/2025/07/18/trump_gpu_china/

[19] https://www.theregister.com/2025/07/16/amd_china_chips/

[20] https://www.ft.com/content/cac568a2-5fd1-455c-b985-f3a8ce31c097

[21] https://www.theregister.com/2025/05/27/oracle_openai_40b/

[22] https://www.theregister.com/2025/06/23/huaweis_foldable_shows_china_years_behind_tsmc/

[23] https://www.theregister.com/2022/07/22/china_smic_7nm_chips/

[24] https://www.reuters.com/world/china/nvidia-orders-300000-h20-chips-tsmc-due-robust-china-demand-sources-say-2025-07-29/

[25] https://whitepapers.theregister.com/



[He] took me into his library and showed me his books, of which he had
a complete set.
-- Ring Lardner