Nvidia leans on emulation to squeeze more HPC oomph from AI chips in race against AMD
- Reference: 1768738874
- News link: https://www.theregister.co.uk/2026/01/18/nvidia_fp64_emulation/
- Source link:
This emulation, we should note, hasn't replaced hardware FP64 in Nvidia's GPUs. Nvidia's newly [1]unveiled Rubin GPUs still deliver about 33 teraFLOPS of peak FP64 performance, but that's actually one teraFLOP less than the now four-year-old H100.
If you switch on software emulation in Nvidia's [2]CUDA libraries , the chip can purportedly achieve up to [3]200 teraFLOPS of FP64 matrix performance. That's 4.4x of what its outgoing Blackwell accelerators could muster in hardware.
[4]
On paper, Rubin isn't just Nvidia's most powerful AI accelerator ever, but it's the most potent GPU for scientific computing in years.
[5]
[6]
"What we found is, through many studies with partners and with our own internal investigations, is that the accuracy that we get from emulation is at least as good as what we would get out of a tensor core piece of hardware," Dan Ernst, senior director of supercomputing products at Nvidia, told El Reg .
Emulated FP64, which is not exclusive to Nvidia, has the potential to dramatically improve the throughput and efficiency of modern GPUs. But not everyone is convinced.
[7]
"It's quite good in some of the benchmarks, it's not obvious it's good in real, physical scientific simulations," Nicholas Malaya, an AMD fellow, told us. He argued that, while FP64 emulation certainly warrants further research and experimentation, it's not quite ready for prime time.
Why FP64 still matters in the age of AI
Even as chip designs push for ever lower-precision data types, FP64 remains the gold standard for scientific computing for good reason. FP64 is unmatched in its dynamic range, capable of expressing more than 18.44 quintillion (2 64 ) unique values.
To put that in perspective,-modern AI models like DeepSeek R1 are commonly trained at FP8, which can express a paltry 256 unique values. Taking advantage of general homogeneity of neural networks, block-floating-point data types like MXFP8 or MXFP4 can be used to [8]expand their dynamic range.
That's fine for the fuzzy math that defines large language models, but it's no replacement for FP64, particularly when it's the difference between life or death.
Unlike AI workloads, which are highly error-tolerant, HPC simulations rely on fundamental physical principles like conservation of mass and energy. "As soon as you start incurring errors, these finite errors propagate, and they cause things like blow ups," Malaya said.
Emulated FP64 and the Ozaki scheme
The idea of using lower-precision, often integer datatypes, to emulate FP64 isn't a new idea. "Emulation is old as dirt," Ernst said. "We had emulation in the mid '50s before we had hardware for floating point."
This process required significantly more operations to complete, and often incurred a stiff performance penalty as a result, but enabled floating point mathematics even when hardware lacked a dedicated floating point unit (FPU).
[9]
By the 1980s, FPUs were becoming commonplace and the need for emulation largely disappeared. However, in early 2024, researchers at the Tokyo and Shibaura institutes of technology published a [10]paper reviving the concept by showing that FP64 matrix operations could be decomposed into multiple INT8 operations that, when run on Nvidia's tensor cores, achieved higher-than-native performance.
This approach is commonly referred to as the Ozaki scheme, and it's the foundation for Nvidia's own FP64 emulation libraries, which were [11]released late last year. And, as Ernst was quick to point out, "it's still FP64. It's not mixed precision. It's just done and constructed in a different way from the hardware perspective."
Modern GPUs are packed with low-precision tensor cores. Even without the fancy adaptive compression found in Rubin's tensor cores, the chips are capable of 35 petaFLOPS of dense FP4 compute. By comparison, at FP64, the chips are more than 1,000x slower.
These low-precision tensor cores are really efficient to build and run, so the question became why not use them to do FP64, Ernst explained. "We have the hardware, let's try use it. That's the history of supercomputing."
But is it actually accurate?
While Nvidia is keen to highlight the capabilities FP64 emulation enables on its Rubin and even its older Blackwell GPUs, rival AMD doesn't believe the approach is quite ready.
According to Malaya, FP64 emulation works best for well-conditioned numerical systems, with the High Performance Linpack (HPL) bench being a prime example. "But when you look at material science, combustion codes, banded linear algebra systems, things like that, they are much less well conditioned systems, and suddenly it starts to break down," he said.
In other words, whether or not FP64 emulation makes sense actually depends on the application in question. For some it's fine, while in others it's not.
One of the major sticking points for AMD is that FP64 emulation isn't exactly IEEE compliant. Nvidia's algorithms don't account for things like positive versus negative zeros, not number errors, or infinite number errors.
Because of this, small errors in the intermediary operations used to emulate the higher precision can result in perturbations that can throw off the final result, Malaya explained.
One way around this is to increase the number of operations used. However, at a certain point, the sheer number of operations required outweighs any advantage emulation might have provided.
All of those operations also take up memory. "We have data that shows you're using about twice the memory capacity in Ozaki to emulate that FP64 matrices," Malaya said.
For these reasons, the House of Zen is focusing its attention on specialized hardware for applications that rely on double and single precision. Its upcoming MI430X takes advantage of AMD's chiplet architecture to bolster double and single precision hardware performance.
Filling the gaps
The challenges facing FP64 emulation algorithms like the Ozaki scheme aren't lost on Ernst, who is well aware of the gaps in Nvidia's implementation.
Ernst contended that, for most HPC practitioners, things like positive negative zeroes aren't that big a deal. Meanwhile, Nvidia has developed supplemental algorithms to detect and mitigate issues like non-numbers, infinite numbers, and inefficient emulation operations.
As for memory consumption, Ernst conceded that it can be a bit higher but emphasized that this overhead is relative to the operation not the application itself. Most of the time, he said, we're talking about matrices that are at most a few gigabytes in size.
So while it's true that FP64 emulation isn't IEEE-compliant, whether this actually matters is heavily dependent on the application in question, Ernst argued. "Most of the use cases where IEEE compliance ordering rules are in play don't come up in matrix, matrix multiplication cases. There's not a DGEMM that tends to actually follow that rule anyway," he said.
Great for matrices, not so much for vectors
Even if Nvidia can overcome the potential pitfalls of FP64 emulation, it doesn't change the fact that the method is only useful for a subset of HPC applications that rely on dense general matrix multiply (DGEMM) operations.
According to Malaya, for somewhere between 60 and 70 percent of HPC workloads, emulation offers little to no benefit.
"In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM," he said. "I wouldn't say it's a tiny fraction of the market, but it's actually a niche piece."
For vector-heavy workloads, like computational fluid dynamics, Nvidia's Rubin GPUs are forced to run on the slower FP64 vector accelerators in the chip's CUDA cores.
However, as Ernst was quick to point out: more FLOPS doesn't always mean useful FLOPS. The same workloads that tend to run on the FP64 vector engines rarely manage to harness more than a fraction of the chip's theoretical performance, all because the memory can't keep up.
We see this quite clearly on the TOP500's vector-heavy High Performance Conjugate Gradient benchmark where CPUs tend to dominate thanks to the higher ratio of bits per FLOPS afforded by their memory subsystems.
Rubin may not deliver the fastest FP64 vector perf, but with 22 TB/s of HBM4 its real world performance in these workloads, it is likely to be much higher than the spec sheet would suggest.
[12]AMD threatens to go medieval on Nvidia with Epyc and Instinct: What we know so far
[13]Every conference is an AI conference as Nvidia unpacks its Vera Rubin CPUs and GPUs at CES
[14]Trump's AI 'Genesis Mission' emerges from Land of Confusion
[15]HPC won't be an x86 monoculture forever – and it's starting to show
Ready or not, here FP64 emulation comes
With an influx of new supercomputers [16]powered by Nvidia's Blackwell and Rubin GPUs coming online over the next few years, any questions regarding the viability of FP64 emulation will be put to the test sooner rather than later.
And since this emulation isn't tied to specific hardware, there's the potential for the algorithms to improve over time as researchers uncover scenarios where the technique excels or struggles.
Despite Malaya's concerns, he noted that AMD is also investigating the use of FP64 emulation on chips like the MI355X, through software flags, to see where it may be appropriate.
IEEE compliance, he told us, would go a long way towards validating the approach by ensuring that the results you get from emulation are the same as what you'd get from dedicated silicon.
"If I can go to a partner and say run these two binaries: this one gives you the same answer as the other and is faster, and yeah under the hood we're doing some scheme — think that's a compelling argument that is ready for prime time," Malaya said.
It may turn out that, for some applications, emulation is more reliable than others, he noted. "We should, as a community, build a basket of apps to look at. I think that's the way to progress here." ®
Get our [17]Tech Resources
[1] https://www.theregister.com/2026/01/05/ces_rubin_nvidia/
[2] https://developer.nvidia.com/blog/unlocking-tensor-core-performance-with-floating-point-emulation-in-cublas/
[3] https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/hpc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aW0RsXTX7jwD_MtPnvaiEQAAAIg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/hpc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aW0RsXTX7jwD_MtPnvaiEQAAAIg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/hpc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aW0RsXTX7jwD_MtPnvaiEQAAAIg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/hpc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aW0RsXTX7jwD_MtPnvaiEQAAAIg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2025/08/10/openai_mxfp4/
[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/hpc&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aW0RsXTX7jwD_MtPnvaiEQAAAIg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[10] https://arxiv.org/abs/2306.11975
[11] https://developer.nvidia.com/blog/unlocking-tensor-core-performance-with-floating-point-emulation-in-cublas/
[12] https://www.theregister.com/2026/01/07/mi500x_amd_ai/
[13] https://www.theregister.com/2026/01/05/ces_rubin_nvidia/
[14] https://www.theregister.com/2025/12/11/doe_genesis_mission_funding/
[15] https://www.theregister.com/2025/11/27/arm_riscv_hpc/
[16] https://www.theregister.com/2025/10/28/nvidia_oracle_supercomputers_doe/
[17] https://whitepapers.theregister.com/
Re: "By the 1980s, FPUs were becoming commonplace"
Intel used to sell a separate math coprocessor floating point accelerator chip which you could buy and install in your PC. So the 8086/8088 for example had a corresponding 8087, and the 80286 and 80386 had corresponding 80287 and 80387 math chips.
The big market for the 8087 was people running Lotus-123. Applications had to be written specifically to recognize that the 8087 was present and to make use of it, and Lotus 123 was one of the few which did. Since Lotus 123 completely dominated the business spreadsheet market, and since spreadsheets were one of the main uses for PCs, the Lotus market was closely associated with 8087 sales. If I recall correctly, you could buy a package which included both Lotus-123 and an 8087 chip together. I don't know if that was direct from Lotus however, or if it was something that distributors put together.
I was under the impression though that the 80486 had floating point math built in as standard. The 486SX actually disabled the on chip floating point unit so they could sell it at a lower price without affecting sales of the higher priced standard 80486. If you then bought a 487SX and installed it later, it actually was a full 486 chip which disabled the 486SX and took over all of the CPU duties.
Re: "By the 1980s, FPUs were becoming commonplace"
Not only 80x87 (and i know that history). There were MANY coprocessor manufacturers around. Did not know Lotus 123 used it too! But to make a difference you surly (Shirley?) had to calculate enough to make it pay for itself. Which is usually scientific/engineering, and to some extend financial area. For most Lotus-123 users the speed difference did not matter since typing in the data took more time :D.
Re: "By the 1980s, FPUs were becoming commonplace"
Yes--thank you for mentioning this, I remembered buying a FP add-on chip but couldn't remember which system it was for.
Being compliant with things like NaN (Not a Number) and +/- infinity is actually pretty important with floating point. I have done a fair bit of work with floating point SIMD (CPU based, not GPU) on large arrays of data and the "proper" way to deal with errors in most cases is to let them flow through to the end and check for them then rather than to check as you go along. NaN and infinity handling is designed so that once you get one of them as a result it continues to propagate through the math. Doing the check at the end results in insignificant error checking overhead, while doing it as you go along results in a lot of overhead and a significant performance hit.
What this means is that if you emulate floating point, if you don't handle NaN and infinity the same way as is "normal", people may have to come up with entirely new algorithms at the application level. It also means that well proven math libraries may work most of the time under emulation, but give incorrect results for edge cases. Figuring out what those edge cases are is non-trivial once you are dealing with applications rather than benchmarks.
The performance advantages of having errors flow through in a predictable manner are so great that as I understand it, some of the hardware people at CPU companies are talking about introducing similar features for integer math to deal with overflow, although I have no idea how this would work. Traditional simple overflow trapping apparently is somewhat problematic with instruction pipelines, instruction re-ordering, and SIMD.
As for this emulation system, I can't realistically see it being used outside of very specific hand coded libraries handling very specific algorithms for very specific applications. They're really competing against SIMD, and the latter has been getting better as well.
“NaN and infinity handling is designed so that once you get one of them as a result it continues to propagate through the math.”
With respect, this is a common misconception and a massive problem in numerical modelling codes that I see. In fact, the problem isn’t “dealing with infinity”, the problem is “dealing with near-singular numbers that aren’t infinity (or zero)”. So, if one of your intermediate results is 2.01x10^300, it is perfectly representable, but is it correct and meaningful? Answer: no it is not. What you have found is a data-point within a gnats nadger of a singularity. The issue is that had you been even 1 part in a billion either side on the input data, that 10^300 could have been 10^200 or 10^800. Your intermediate result is completely meaningless. You should not propagate it. There’s always going to be loads more data-points near a singularity than at the singularity itself. By addressing floating-point value infinities, you are patching the symptom, not the underlying cause.
There is no substitute for proper numerical analysis. Double-precision (or even quad-precision) is not a magic wand.
How you fix stuff like this, is a big topic. The issue, of course, is that people say stuff like “but it matches the Matlab result” without digging in, becaause they don’t want to know they have a problem. Matlab can be wrong too….GIGO
Basically, you need to improve your numerical analysis algorithm to avoid ill-conditioning. The underlying solution is recognising that the input data is not perfectly accurate, so you need an MSE estimator which recognises that and propagates it through the full chain of calculations, giving an MMSE estimator at each point. If you do this correctly, you will never have to invert a near-singular matrix.
But even more important, you will gain a lot more physical insight into what the data really mean.
I can count on the fingers on one hand, the number of codes I have seen do this correctly.
"By the 1980s, FPUs were becoming commonplace"
First thanks for the article! It throws me back to the time when 8 bit CPUs made software FP with MS-basic (which bought their FP logic from somewhere, forgot where). Others were more precise and faster with FP on the same CPU.
About "1980s": Here I thought it was Quake 1 from 1996, which was the first big game that REQUIRED an FPU. There were other games before which could use, but not require an FPU. So it could run on an 486-DX, albeit not fast and therefore Pentium was set to be the minimum. The precision improvement was clearly visible in the world rendering. No more of those known artifacts, which other engines threw, when having the right angle and/or distance from walls. The monsters in Quake 1 however, where still rendered less precise, but that changed soon with 3D acceleration.
The 1980s rather the scientific and engineering environment FPUs were used, and you had quite a selection to choose depending on which computation should be the fastest. But my definition of "commonplace" is the normal computer people have at home.
Similar for Quake II: First game engine to require MMX. A technique which the Cray 1 computers used too: One command to do the same operation on a range of input numbers, (later evolved in to SIMD for other operations).