How Google Finally Leapfrogged Rivals With New Gemini Rollout (msn.com)

(Monday November 24, 2025 @11:41AM (msmash) from the closer-look dept.)

An anonymous reader shares a report:

> With the release of its [1]third version last week , Google's Gemini large language model surged past ChatGPT and other competitors to [2]become the most capable AI chatbot , as determined by consensus industry-benchmark tests. [...] Aaron Levie, chief executive of the cloud content management company Box, got early access to Gemini 3 several days ahead of the launch. The company ran its own evaluations of the model over the weekend to see how well it could analyze large sets of complex documents. "At first we kind of had to squint and be like, 'OK, did we do something wrong in our eval?' because the jump was so big," he said. "But every time we tested it, it came out double-digit points ahead."

>

> [...] Google has been scrambling to get an edge in the AI race since the launch of ChatGPT three years ago, which stoked fears among investors that the company's iconic search engine would lose significant traffic to chatbots. The company struggled for months to get traction. Chief Executive Sundar Pichai and other executives have since worked to overhaul the company's AI development strategy by breaking down internal silos, streamlining leadership and consolidating work on its models, employees say. Sergey Brin, one of Google's co-founders, resumed a day-to-day role at the company helping to oversee its AI-development efforts.

[1] https://tech.slashdot.org/story/25/11/18/1634253/google-launches-gemini-3-its-most-intelligent-ai-model-yet

[2] https://www.msn.com/en-us/news/technology/how-google-finally-leapfrogged-rivals-with-new-gemini-rollout/ar-AA1QWgd8

Like GPU benchmarks (Score:5, Insightful)

by FictionPimp ( 712802 )

Eventually the models will be tuned to just perform well on the tests and perform like crap outside of the tests.

Re:Like GPU benchmarks (Score:4, Interesting)

by Junta ( 36770 )

Eventually? We are kind of already there. I recall some question on one of these going viral, attracting a lot of actual humans to write up why they felt the AIs struggled with it including answering in their writeups. So then their writeups made their way into the RAG inputs into LLMs and also into training material. The AIs suddenly got better at that question, what a surprise...

Just like most specific examples of LLM screwups get self-corrected in short order, automatically as the mocking ironically shapes the RAG component to avoid the specific behavior. Suddenly the LLMs got really good at counting the number of 'r's in strawberry, even as they couldn't actually count letters, but the internet now said how many rs were in strawberry just a whole bunch of times...

Re: (Score:3)

by Bobartig ( 61456 )

What you're describing isn't an implementation of RAG, but supervised post-training like Directed Preference Optimization. In DPO, researchers compile sets of answers from LLMs to the same question with a human reviewer selecting which one is a better answer. This data is used to fine tune, which steers models towards the good answers and away from the bad ones, which is why it is such an effective means of changing a models' behavior to a certain style of answer.

It gets pushed way too far, which is why you

Have VW entered the market? (Score:3)

by Viol8 ( 599362 )

Just saying...

Re: (Score:2)

by dunkelfalke ( 91624 )

Seems to me that it already is. ChatGPT is significantly better at reading circuit diagrams (although it struggles with them too). Still, in this regard (and generally electronics) ChatGPT performs better than both Claude and Gemini.

New version of Rectum (Score:1)

by i kan reed ( 749298 )

New version of Rectum now produces shit at 3 times closer to shit consistency benchmarks than previous versions of competitors Anus and Colon. Given that we've already decided that you just need more shit constantly and forever and will shove it into every aspect of your life, this must make you very happy.

Good job (Score:5, Interesting)

by coofercat ( 719737 )

I saw this news on LI a few days ago, so I headed over to try my 'stock test'. I have a programming question which isn't so easily found in online examples and whatnot, and on the face of it is easy, but actually it's a bit involved. I requires a two step solution (a sort of parse and then parse again type thing). I'd say it's maybe 100 lines of python to solve (at most). I didn't give it any examples of input/output, and my prompt was maybe one or two lines long at most - not a long essay spoon feeding it the implementation.

I asked ChatGPT and got the usual one-pass not-quite-there-yet answer. It's the sort of answer you'd expect a human junior to give you before you talked it over and explained the deficiencies of it.

I asked Gemini, and it gave me a working program, with a decent example input (with a proper two pass solution). The code had comments which actually explained what was going on, and the code was pretty nice (descriptive variable names and so on). I went on to ask it to make a relatively simple change, and again, same sort of response. I'd say it was close to "commit ready", if it were a task in a ticket I was working on or something. I'd probably do some more testing with a few more inputs (maybe get it to write some unit tests?) and assuming all was well, I'd commit it.

I realise this is just one test of billions of possible ones, but it's one that every AI I've tried it on has failed to answer properly. Since it's the first that actually did answer it, and honestly, answered it really well, I'd say they do really seem to have 'cracked it' somewhat, at least for Python programs. It probably doesn't solve every problem, and it's still prone to making stuff up, but it's definitely got something about it that's good. I am tempted to try and connect it up to my IDE to see what I can do with it, but haven't taken the plunge yet - it's the first AI I've felt is worth using.

What it knows about Lindsey Lohan I couldn't say though ;-)

Re: (Score:2)

by Viol8 ( 599362 )

So how long before it can start to rewrite its own code to "improve" itself?

I'm only half joking.

Re: (Score:2)

by swillden ( 191260 )

> So how long before it can start to rewrite its own code to "improve" itself?

> I'm only half joking.

Well, the LLMs don't really consist of "code" per se, but I think the AI labs are already using them to work on improving their own design. How far are they from being able to do this without human oversight and supervision? I have no idea.

Re: (Score:3)

by Viol8 ( 599362 )

"Well, the LLMs don't really consist of "code" per se"

Oh they do, a LOT of it, all the way from the high level python libraries such as TensorFlow, via sigmoid or relu activation functions down to the low level CUDA GPU libs. The only part that is pure data is the weighting of the neurons which themselves are code.

Re: (Score:2)

by Night Goat ( 18437 )

I just tried my standard "draw me an ASCII middle finger" and it flat-out refused to generate it for me! Grok still sucks at it but at least it tried. And it gave me the finger through emoji afterward. Gemini needs more work, I think.

Re: (Score:2)

by swillden ( 191260 )

> I just tried my standard "draw me an ASCII middle finger" and it flat-out refused to generate it for me!

Yep, the guardrails are also improving.

Re: (Score:2)

by Bobartig ( 61456 )

At some point when I was bored and playing around with LLMs, I had chatGPT keep making ascii art over and over. I'd ask for something like a duck holding an umbrella. No matter what blobbly garbage it produced (and it was all blobby garbage), I just kept encouraging it and telling it to make more. Add details, refine, making it even duckier, etc. And it just kept taking the input blob and adding more blobs to it, like an infinite ascii blob spiral.

Re: (Score:2)

by RobinH ( 124750 )

I just asked it a fairly simple (in my opinion) question: "What are the top 3 tier one parts suppliers in the North American automotive market, by revenue?"

It very confidently gave me 3 tier 1 suppliers for the 2022 fiscal year. The top, not surprisingly, was Magna, which is probably true. But it said the revenue was ~$18.9 billion. That doesn't seem to line up with any facts I can find online about Magna. Typical revenue is more like $10 billion per quarter, or $40 billion per year. I can't figure out

Re: Good job (Score:3)

by LindleyF ( 9395567 )

If you're trying to get facts out of AI, you're doing it wrong. It's not a knowledge base, it's a natural language search and summarization tool. Its facts are only right if its sources are right, and it may not be able to rank sources by quality yet. (That will come.) And if you're asking an obscure question that its sources don't answer, it will make something up. Don't go to AI for facts. Go to it for its natural language abilities......which includes programming languages.

Re: (Score:2)

by RobinH ( 124750 )

True, when I asked it to generate a spam email campaign, or a deepfake video of a local politician, it did great. I'm glad our society now has access to this wondrous new technology. I can't wait to see what amazing impact it will have on our lives. Too bad it can't, you know, go find me some facts and all, or at least tell me when it can't find any. Actually AI doesn't even go looking for facts. It generates text that looks statistically like text it has seen. So in no way does it do anything related

Re: Good job (Score:2)

by LindleyF ( 9395567 )

That's only sort of true. Here's an example. If you ask it "how do I use feature X" and give it some code and documentation as context, it will find relevant references in the documentation, locate examples of using that feature, adapt the examples to your code (variable names etc), and optionally update your code if you let it. That's the sort of thing it does well. What it down do well is STOP when it goes off the rails. You have to perform that function.

Re: (Score:2)

by ZipNada ( 10152669 )

Similar results here, Gemini 3 does an outstanding job at coding. Best I have seen.

Re: (Score:2)

by fafalone ( 633739 )

I asked it one of my favorites... how to do something very unusual in a specific programming language. A traditional Google search easily surfaces on the 1st page the 1 major forum thread demonstrating how it is indeed possible with a full sample and lengthy explanation, 10y old now, my 2 public GitHub projects expanding on the subject, 1 and 2y old, and articles mentioning them.

Gemini failed like all others, falsely claiming it's impossible. I'd say it failed worse than any AI yet, as it offered an mostl

How much better is it Really ? (Score:1)

by greytree ( 7124971 )

"Open"AI's last release was said to be way better on tests the AI companies use, but real users didn't rank it much better than the previous iteration and some even said they would stick with that.

So how much better is Google's new product for real users ?

And are we seeing signs that LLMs are asymptotically approaching a maximum, a hard limit of the technology that throwing more bytes at them will not solve ?

Useless AI benchmarks (Score:2)

by WaffleMonster ( 969671 )

Benchmarks are more marketing tools than accurate reflections of model capability. The only thing that matters is what users think.

Who cares? (Score:2)

by ebunga ( 95613 )

Dead company walking.

Maybe gemini will stop (Score:2)

by wakeboarder ( 2695839 )

making things up. I've searched for products and it has make up entire products complete with SKU's out of thin air.

News: 0180193169

How Google Finally Leapfrogged Rivals With New Gemini Rollout (msn.com)

Like GPU benchmarks (Score:5, Insightful)

Re:Like GPU benchmarks (Score:4, Interesting)

Re: (Score:3)

Have VW entered the market? (Score:3)

Re: (Score:2)

New version of Rectum (Score:1)

Good job (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Good job (Score:3)

Re: (Score:2)

Re: Good job (Score:2)

Re: (Score:2)

Re: (Score:2)

How much better is it *Really* ? (Score:1)

Useless AI benchmarks (Score:2)

Who cares? (Score:2)

Maybe gemini will stop (Score:2)

How much better is it Really ? (Score:1)