Google Unveils Gemini 2.5 Pro, Its Latest AI Reasoning Model With Significant Benchmark Gains (blog.google)

(Tuesday March 25, 2025 @06:20PM (msmash) from the moving-forward dept.)

Google DeepMind has [1]launched Gemini 2.5 , a new family of AI models designed to "think" before responding to queries. The initial release, Gemini 2.5 Pro Experimental, tops the LMArena leaderboard by what Google claims is a "significant margin" and demonstrates enhanced reasoning capabilities across technical tasks. The model achieved 18.8% on Humanity's Last Exam without tools, outperforming most competing flagship models. In mathematics, it scored 86.7% on AIME 2025 and 92.0% on AIME 2024 in single attempts, while reaching 84.0% on GPQA's diamond benchmark for scientific reasoning.

For developers, Gemini 2.5 Pro demonstrates improved coding abilities with 63.8% on SWE-Bench Verified using a custom agent setup, though this falls short of Anthropic's Claude 3.7 Sonnet score of 70.3%. On Aider Polyglot for code editing, it scores 68.6%, which Google claims surpasses competing models. The reasoning approach builds on Google's previous experiments with reinforcement learning and chain-of-thought prompting. These techniques allow the model to analyze information, incorporate context, and draw conclusions before delivering responses. Gemini 2.5 Pro ships with a 1 million token context window (approximately 750,000 words). The model is available immediately in Google AI Studio and for Gemini Advanced subscribers, with Vertex AI integration planned in the coming weeks.

[1] https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#gemini-2-5-thinking

Benchmarks are meaningless (Score:2)

by gweihir ( 88907 )

Whenever a peddler of LLM-crap stresses their artificial moron is doing better on benchmarks, that just means they have given up and are cheating now.

Re: (Score:2)

by dinfinity ( 2300094 )

Are you still doing this? Move on, man. The world has.

Your contentless ranting is just noise that pollutes Slashdot.

Re: (Score:2)

by smooth wombat ( 796938 )

You never have benchmarks in your life? When putting together your new system, you don't look at how well the various components perform? When hiring for a position, you don't look at their credentials or what they've done? When judging which ar to buy you don't look at its 0-60 times, its fuel mileage, its reliability?

Explain how one is to gauge the good or bad of something without a consistent benchmark to compare against.

there is that stupid shit again (Score:2)

by dfghjk ( 711126 )

"...designed to "think" before responding to queries..."

Literally every piece of software EVER was "designed to think before responding to queries". It is impossible to do otherwise.

I am so sick of this anthropomorphizing of AI. It is computer software.

"...demonstrates enhanced reasoning capabilities across technical tasks."

Does better than some other things at some tasks.

"For developers, Gemini 2.5 Pro demonstrates improved coding abilities ..."

Not to be confused with "coding abilities" of developers.

"Th

Re: (Score:1)

by CallMeTim ( 6454842 )

In this case 'reasoning' describes the technique used to improve the LLMs that is different (https://en.wikipedia.org/wiki/Reasoning_language_model). You may disagree with the name, but it isn't just marketing hype. It is what the technique is called in the industry.

News: 0176814351

Google Unveils Gemini 2.5 Pro, Its Latest AI Reasoning Model With Significant Benchmark Gains (blog.google)

Benchmarks are meaningless (Score:2)

Re: (Score:2)

Re: (Score:2)

there is that stupid shit again (Score:2)

Re: (Score:1)