How OpenAI used a new data type to cut inference costs by 75%

(2025/08/10)

Reference: 1754818274
News link: https://www.theregister.co.uk/2025/08/10/openai_mxfp4/
Source link:

Analysis Whether or not OpenAI's new open weights models are any good is still up for debate, but their use of a relatively new data type called MXFP4 is arguably more important, especially if it catches on among OpenAI's rivals.

The format promises massive compute savings compared to data types traditionally used by LLMs, allowing cloud providers or enterprises to run them using just a quarter of the hardware.

What the heck is MXFP4?

If you've never heard of MXFP4, that's because, while it's been in [1]development for a while now, OpenAI's gpt-oss [2]models are among the first mainstream LLMs to take advantage of it.

This is going to get really nerdy, really quickly here, so we won't judge if you want to jump straight to the why it matters section.

MXFP4 is a 4-bit floating point data type defined by the Open Compute Project (OCP), the hyperscaler cabal originally kicked off by Facebook in 2011 to try and make datacenter components cheaper and more readily available. Specifically, MXFP4 is a micro-scaling block floating-point format, hence the name MXFP4 rather than just FP4.

[3]

This micro-scaling function is kind of important, as FP4 doesn't offer a whole lot of resolution on its own. With just four bits — one for the sign bit, two for the exponent, and one for the mantissa — it can represent 16 distinct values: eight positive and eight negative. That's compared to BF16, which can represent 65,536 values.

[4]

[5]

If you took these four BF16 values, 0.0625, 0.375, 0.078125, and 0.25, and converted them directly to FP4, their values would now be 0, 0.5, 0, and 0.5 due to what becomes rather aggressive rounding.

Through some clever mathematics, MXFP4 is able to represent a much broader range of values. This is where the scaling bit of MX data types comes into play.

[6]

Here's a basic overview of how MX datatypes work - Click to enlarge

MXFP4 quantization works by taking a block of higher-precision values (32 by default) and multiplying them by a common scaling factor in the form of an 8-bit binary exponent. Using this approach, our four BF16 values become 1, 6, 1.5, and 4. As you've probably already noticed, that's a big improvement over standard FP4.

This is sort of like how [7]FP8 works , but rather than applying the scaling factor to the entire tensor, MXFP4 applies this to smaller blocks within the tensor, allowing for much greater granularity between values.

[8]

During inference, these figures are then de-quantized on the fly by multiplying the inverse of their 4-bit floating point value by the scaling factor, resulting in: 0.0625, 0.375, 0.09375, and 0.0625. We still run into rounding errors, but it's still more precise than 0, 0.5, 0, 0.5.

MXFP4, we should note, is only one of several micro-scaling data types. There are also MXFP6 and even MXFP8 versions, which function similarly in principle.

Why MXFP4 matters

MXFP4 matters because the smaller the weights are, the less VRAM, memory bandwidth, and potentially compute are required to run the models. In other words, MXFP4 makes genAI a whole lot cheaper.

How much cheaper? Well, that depends on your point of reference. Compared to a model trained at BF16 — the most common data type used for LLMs these days — MXFP4 would cut compute and memory requirements by roughly 75 percent.

We say roughly because realistically you won't be quantizing every model weight. According to the gpt-oss [9]model card [PDF], OpenAI said it applied MXFP4 quantization to about 90 percent of the model's weights. This is how they were able to cram a 120 billion parameter model into a GPU with just 80GB of VRAM or the smaller 20 billion parameter version on one with as little as 16GB of memory.

Want to give gpt-oss a try? Check out our hands-on guide [10]here .

By quantizing gpt-oss to MXFP4, the LLM doesn't just occupy 4x less memory than an equivalently sized model trained at BF16, but can generate tokens up to 4x faster as well.

Some of that will depend on the compute. As a general rule, every time you halve the floating point precision, you can double the chip's floating point throughput. A single B200 SXM module offers about 2.2 petaFLOPS of dense BF16 compute. Drop down to FP4, which Nvidia's Blackwell silicon offers hardware acceleration for, and that jumps to nine petaFLOPS.

[11]

While this may boost throughput a little, when it comes to inference, more FLOPS really means less time waiting for the model to start generating its answer.

To be clear, your hardware doesn't need native FP4 support to work with MXFP4 models. Nvidia's H100s, which were used to train gpt-oss, don't support FP4 natively, yet can run the models just fine. It just doesn't enjoy all the data types' benefits.

OpenAI is setting the tone

Quantization isn't a new concept. Model devs have been releasing FP8 and even 4-bit quantized versions of their models for a while now.

However, these quants are often perceived as a compromise, as lower precision inherently comes with a loss in quality. How significant that loss is depends on the specific quantization method, of which there are many.

That said, research has repeatedly shown the loss in quality going from 16 bits to eight is essentially nil, at least for LLMs anyway. There's still enough information at this precision for the model to work as intended. In fact, some model builders like DeepSeek have started training models natively in FP8 for this reason.

While vastly better than standard FP4, MXFP4 isn't necessarily a silver bullet. Nvidia [12]argues the data type can still suffer from a degradation compared to FP8, in part because its 32-value block sizes aren't granular enough. To address this, the GPU giant has introduced its own micro-scaling data type called NVFP4, which aims to improve quality by using 16-value blocks and an FP8 scaling factor.

[13]Humans make better content cops than AI, but cost 40x more

[14]OpenAI's GPT-5 is here with up to 80% fewer hallucinations

[15]OpenAI's new model can't believe that Trump is back in office

[16]Meet President Willian H. Brusen from the great state of Onegon

Ultimately, however, it's up to the Enterprise, API, or cloud provider to decide whether to deploy the quant or stick with the original BF16 release.

With gpt-oss, OpenAI has made that choice for them. There is no BF16 or FP8 version of the models. MXFP4 is all we get. Given their outsized position in the market, OpenAI is basically saying, if MXFP4 is good enough for us, it should be good enough for you.

And that's no doubt welcome news for the infrastructure providers tasked with serving these models. Cloud providers in particular don't get much say in what their customers do with the resources they've leased. The more model builders that embrace MXFP4, the more likely folks are to use it.

Until then, OpenAI gets to talk up how much easier its open models are to run than everyone else's and how they can take advantage of newer chips from Nvidia and AMD that support the FP4 data type natively. ®

Get our [17]Tech Resources

[1] https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

[2] https://www.theregister.com/2025/08/05/openai_open_gpt/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aJhttNVLpITvPuNhV1CYEwAAAEE&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJhttNVLpITvPuNhV1CYEwAAAEE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJhttNVLpITvPuNhV1CYEwAAAEE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://regmedia.co.uk/2025/08/08/mx_block_fp.jpg

[7] https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJhttNVLpITvPuNhV1CYEwAAAEE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[9] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf

[10] https://www.theregister.com/2025/08/07/run_openai_gpt_oss_locally/

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJhttNVLpITvPuNhV1CYEwAAAEE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[12] https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

[13] https://www.theregister.com/2025/08/08/humans_outperform_ai_models_brand_safety/

[14] https://www.theregister.com/2025/08/07/openai_gpt_5/

[15] https://www.theregister.com/2025/08/06/openai_model_election_disinformation/

[16] https://www.theregister.com/2025/08/08/gpt-5-fake-presidents-states/

[17] https://whitepapers.theregister.com/

Quantum states

b0llchit

Next up, (de-)compressing weights on the fly to use less than a bit per weight. When that is no longer enough, then the weights must be encoded in the quantum states of the fabric of the computer chips. That will also be insufficient to satisfy the curious language models and the weights will have to be encoded in the quantum state of the universe itself.

But since we are already living in this universe, we can be assured that the universe already has the quantum state necessary to encapsulate the quantum state for the models and because we do not see the result of this quantum state we must conclude that either the whole language model stuff is folly or the universe simply prevents us from reading the proper weights. So, a dead and alive situation waiting for the wave to collapse and the model to cease to exist. Speaking statistical probabilistic, of course.

As a total and utter pleb

Anonymous Coward

the TL;DR seems to be edging towards how neurons work - at some level.

News: 1754818274

How OpenAI used a new data type to cut inference costs by 75%

Quantum states

As a total and utter pleb