How Do Olympiad Medalists Judge LLMs in Competitive Programming?

(Tuesday June 17, 2025 @11:25AM (msmash) from the reality-check dept.)

Reference: 0178075803
News link: https://slashdot.org/story/25/06/17/149238/how-do-olympiad-medalists-judge-llms-in-competitive-programming
Source link:

A new benchmark assembled by a team of International Olympiad medalists suggests the hype about large language models beating elite human coders is premature. LiveCodeBench Pro, [1]unveiled in a 584-problem study [PDF] drawn from Codeforces, ICPC and IOI contests, shows the best frontier model clears just 53% of medium-difficulty tasks on its first attempt and none of the hard ones, while grandmaster-level humans routinely solve at least some of those highest-tier problems.

The researchers measured models and humans on the same Elo scale used by Codeforces and found that OpenAI's o4-mini-high, when stripped of terminal tools and limited to one try per task, lands at an Elo rating of 2,116 -- hundreds of points below the grandmaster cutoff and roughly the 1.5 percentile among human contestants. A granular tag-by-tag autopsy identified implementation-friendly, knowledge-heavy problems -- segment trees, graph templates, classic dynamic programming -- as the models' comfort zone; observation-driven puzzles such as game-theory endgames and trick-greedy constructs remain stubborn roadblocks.

Because the dataset is harvested in real time as contests conclude, the authors argue it minimizes training-data leakage and offers a moving target for future systems. The broader takeaway is that impressive leaderboard jumps often reflect tool use, multiple retries or easier benchmarks rather than genuine algorithmic reasoning, leaving a conspicuous gap between today's models and top human problem-solvers.

[1] https://arxiv.org/pdf/2506.11928

Most of the word are English... (Score:2)

by TWX ( 665546 )

...but I guess that you have to be very, very into this particular niche for this to make any sense.

When I read, "International Olympiad," I do not think programming. I think track-and-field and other competition where physical fitness and physical skill define the event.

As for, "LLM", does anyone else see that and think, "MLM"? As in, scam?

Olympiad medalists are not allowed to use bionics (Score:2)

by Joe_Dragon ( 2206452 )

Olympiad medalists are not allowed to use bionics or drugs that can't be traced right?

Category Problems (Score:2)

by bill_mcgonigle ( 4333 ) *

Some neural nets have been good at solving sticky programming problems. Whether finding game cheats, doing voice recognition, modeling proteins, or other tasks humans haven't done well at.

But an LLM is more of an information retrieval tool, so tasking it with clever algorithm design is asking the wrong tool the wrong question.

Then there are the people who complete in programming challenges. In high school I would sometimes stay after to do the ACSL competition tests - no big deal, the school was a five minu

Not even retrival. (Score:2)

by DrYak ( 748999 )

> But an LLM is more of an information retrieval tool,

And not even really that. At its core an LLM is a "plausible-sounding sentence generator".

It merely puts tokens together, given a context (the prompt, etc.) and given a statistical model (the distribution of tokens found in the corpus that the LLM was trained on).

It's like an insanely advance super-duper autocomplete on steroids (pun intended given the context).

If the model is rich enough the plausible-sounding sentence have a higher chance to be close to truth.

(Just like on a smartphone the autocomplete do

Was this written by an LLM? (Score:2)

by Press2ToContinue ( 2424598 )

It sure sounds like it.

Unfortunately (Score:2)

by I've Got Three Cats ( 4794043 )

The nuance and complexity of reality is very often a bit of a parade rainer when it comes to the need of the media to promote sensationalist headlines like "AI beats best humans at... !"

News: 0178075803