News: 1752746589

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Boffins detail new algorithms to losslessly boost AI perf by up to 2.8x

(2025/07/17)


We all know that AI is expensive, but a new set of algorithms developed by researchers at the Weizmann Institute of Science, Intel Labs, and d-Matrix could significantly reduce the cost of serving up your favorite large language model (LLM) with just a few lines of code.

Presented at the International Conference on Machine Learning this week and detailed in this [1]paper , the algorithms offer a new spin on speculative decoding that they say can boost token generation rates by as much as 2.8x while also eliminating the need for specialized draft models.

Speculative decoding, if you're not familiar, isn't a new [2]concept . It works by using a small "draft" model ("drafter" for short) to predict the outputs of larger, slower, but higher quality "target" models.

[3]

If the draft model can successfully predict, say, the next four tokens in the sequence, that's four tokens the bigger model doesn't have to generate, and so we get a speed-up. If it's wrong, the larger model discards the draft tokens and generates new ones itself. That last bit is important as it means the entire process is lossless — there's no trade-off in quality required to get that speed-up.

[4]

[5]

The whole concept is a bit like predictive text on a modern smartphone. As you type, it tries to guess what you're going to say next. When it's right, you can complete the sentence with a single tap; when it's wrong, you just type it out yourself.

In practice, speculative decoding can effectively double or even [6]triple token generation depending on the application. But as wonderful as 3x the tokens for the same amount of compute might sound, the trick is finding a compatible draft model.

[7]

One of the challenges to the adoption of speculative decoding is that the two models' vocabularies — i.e. their dictionaries — have to match. Unless the model you're trying to run happens to have a smaller variant, taking advantage of speculative decoding has often required training specialized draft models. Making matters worse, these specialized draft models have to be retrained every time a new target model, say a new version of Llama, comes out, Nadav Timor, a Ph.D student at the Weizmann Institute, tells El Reg .

Universal draft model

The algorithms aim to overcome this limitation by enabling any model to serve draft duty regardless of whether the vocabularies are the same or not.

To do this, the researchers explored three distinct approaches to the problem. The first of these, called Token-Level-Intersection (TLI), is essentially the equivalent of running diff on the two models' vocabularies to figure out which words the drafter should avoid. This way the draft model only predicts tokens that are also in the target model's vocabulary.

So long as there's sufficient overlap in the model's vocabularies, the rate at which the draft model's predictions are accepted stays high. Using this approach, the researchers observed a 1.7x speed up over conventional autoregressive decoding, where the entirety of the model weights are read from memory every time a token is generated.

The second algorithm, called String-Level Exact Match (SLEM), works more like a translation layer between the draft and target model's tokenizers.

[8]

Tokenizers, if you're not familiar, are how large language models break up words, punctuation, and other expressions into chunks they can understand. OpenAI has a great demo showing this in practice, which you can find [9]here .

Draft predictions using the SLEM algorithm generate a complete string of tokens, which are converted into an intermediary format — in this case, plain text — that both models can understand. The output is then retokenized by the target model for review.

This approach, Timor notes, "replaces the standard verification method of speculative decoding with exact string matching, which is an even stricter verification method."

This introduced certain challenges for the team as differences in how the tokenizers handle text could introduce nearly imperceptible changes. "For example, if you have leading white spaces, it might squash them," he explained.

That might not sound like a big deal, but the string must match exactly, or it will be rejected and any potential speedup will be lost. To get around this, SLEM introduced a heuristic function to help smooth out the differences and drive up the acceptance rates. And, at least in long-context tasks like summarization and programming, the improvements can be dramatic up to 2.8x in the team's testing.

It's a single line change for developers

Neither of these algorithms, Timor emphasizes, is theoretical. Both SLEM and TLI are already part of Hugging Face's Transformers library, which is among the most widely deployed frameworks for running LLMs at scale today. "It's a single line change for developers," he said.

Which of these you should use is going to depend on what exactly you're doing with these models, Timor said. "Sometimes the first one works better, sometimes the second one does. You have to check it on your specific configuration."

In some cases, it may still be worth training a dedicated drafter. But as Timor points out, the algorithms researchers have developed significantly reduce the barrier to adoption for speculative decoding.

[10]With Tomahawk Ultra, Broadcom asks who needs UALink when there's Ethernet?

[11]Intel's leaders have stopped pretending – and it's about time

[12]IBM boasts new Power11 chips are stingy on power usage

[13]Amazon built a massive AI supercluster for Anthropic called Project Rainier – here's what we know so far

More research to be done

Timor's research into speculative decoding doesn't stop here. As we mentioned earlier, the team developed three algorithms.

The third, called String-Level Rejection Sampling (SLRS) aimed to address the relatively poor acceptance rates associated with string verification based approaches.

"It uses a generalized drafter that considers probabilities over strings rather than tokens, and we proved that it boosts acceptance rates," Timor said. "The problem is that computing this generalized drafter in runtime, it's computationally expensive, so you have to redesign vocabularies to make this algorithm practical."

The team is also looking at ways to address the explosive growth of model vocabularies and make the draft models even faster.

"The vocabularies are getting huge. Llama 4 for example, is like 200,000 tokens," Timor said, adding that most of that isn't actually used, driving up latency. "We're currently working on shrinking the vocabulary."

That research, he says, is ongoing. ®

Get our [14]Tech Resources



[1] https://arxiv.org/pdf/2502.05202

[2] https://www.theregister.com/2024/12/15/speculative_decoding/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aHkeE-fv4Vt4M14MboPY3AAAAFc&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aHkeE-fv4Vt4M14MboPY3AAAAFc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aHkeE-fv4Vt4M14MboPY3AAAAFc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://research.google/blog/looking-back-at-speculative-decoding/

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aHkeE-fv4Vt4M14MboPY3AAAAFc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aHkeE-fv4Vt4M14MboPY3AAAAFc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[9] https://platform.openai.com/tokenizer?view=bpe

[10] https://www.theregister.com/2025/07/15/broadcom_ethernet_scale_up/

[11] https://www.theregister.com/2025/07/10/intel_ceo_offers_reality_check/

[12] https://www.theregister.com/2025/07/08/ibm_claims_x86_beating_efficiency/

[13] https://www.theregister.com/2025/07/04/project_rainier_deep_dive/

[14] https://whitepapers.theregister.com/



So...

Mentat74

LLM's can spit out bullshiat 2.8 times faster now ?

Confused

Andy E

I'm having trouble with this analogy "The whole concept is a bit like predictive text on a modern smartphone. As you type, it tries to guess what you're going to say next. When it's right, you can complete the sentence with a single tap; when it's wrong, you just type it out yourself."

I know what the answer should be with predictive text because of prior knowledge. How does the LLM know that the predicted tokens are right or wrong? Does it have prior knowledge?

Re: Confused

FeepingCreature

The trick is that LLMs are a lot faster if they can evaluate multiple queries in parallel. So we use the small model to predict the next ten tokens, then on the *assumption* that those are all going to be correct we feed all ten prefixes into the big model to get ten next tokens.

Optimally, what happens is the big LLM happens to output at each step the token that the small model has guessed. In that case we just got ten tokens for the price of one. But usually, the big model calculates at least one token differently from the small one; in that case we just throw away all the tokens after it and restart from that point.

So you say "Hello " and the small model says "workdays", you complete:

- Hello

- Hello w

- Hello wo

- Hello wor

- Hello work

- Hello workd

and so on, which you can do efficiently in parallel.

The true (big) completions are: 'Hello [w]', 'Hello w[o]', 'Hello wo[r]', 'Hello wor[l]', 'Hello 'work[d]'.

Now you just go through: "w was right, cool. o was right, cool. r was right, cool. k was wrong, it was l instead, so I got 4 tokens for the price of one" and start over from "Hello worl".

Or...

Steve Foster

...they could just switch the whole thing off and stop wasting energy and time.

Recurse!

Chris Gray 1

If the drafter is just a smaller model, could they have an even smaller drafter for the main drafter?

It's turtles all the way down...

Re: Recurse!

FeepingCreature

You could absolutely do that. But optimally you want to run the draft model on separate hardware (cpu/gpu) anyways so you can keep the big model maximally busy. At that point, so long as the draft model runs faster than the big model, there's no point to speeding it up further, because the big model is what limits your throughput anyway.

"You don't go out and kick a mad dog. If you have a mad dog with rabies, you
take a gun and shoot him."
-- Pat Robertson, TV Evangelist, about Muammar Kadhafy