Mathematicians Find GPT-5 Makes Critical Errors in Original Proof Generation
- Reference: 0179087706
- News link: https://science.slashdot.org/story/25/09/08/165206/mathematicians-find-gpt-5-makes-critical-errors-in-original-proof-generation
- Source link:
GPT-5 overlooked an essential covariance property easily deducible from provided documents. The researchers compared the experience to working with a junior assistant needing careful verification. They warned AI reliance during doctoral training risks students losing opportunities to develop fundamental mathematical skills through mistakes and exploration.
[1] https://www.alphaxiv.org/pdf/2509.03065
As expected (Score:3)
Real progress is being made, but LLMs are still imperfect and incomplete
They find unexpected patterns in large sets of words, but have no understanding
Pundits and hypemongers tell fantastic stories to attract investment
To effectively use the tech, skepticism and cross-checking is essential
LLMs Bad At Math (Score:2)
it should be well known that LLMs are bad at math.
LLMs work on tokens(syllables). Their number crunching capabilities are a work in progress. But, still fairly good.
Re: (Score:2)
Just a few years ago it would have been inconceivable that we'd be critiquing them at this level.
But anyways, I'm unclear why deriving proofs wouldn't be rooted in a theorem-prover, using the deep net as a search heuristic. (Or is advanced math not rigidly rooted in applying logical deduction to axioms after all?)
Re: (Score:2)
Theorem provers cannot get to the required depth in most cases. And neither can LLMs. This is not a surprise, just one more data point that LLMs fundamentally suck and cannot be trusted.
Re: (Score:3)
This is not a surprise, just one more data point that LLMs fundamentally suck and cannot be trusted.
Huh? LLMs are not perfect and are not expert-level in every single thing ever. But that doesn't mean they suck. Nothing does everything. A great LLM can fail to produce a perfect original proof but still be excellent at helping people adjust the tone of their writing or understanding interactions with others or developing communication skills, developing coping skills, or learning new subjects quickly. I've u
Re: (Score:2)
To extend the hammer analogy, an LLM is like a hammer that's great at putting in nails, but sometimes it makes a really convincing looking nail insertion, but the nail hasn't actually gone in. You will find that out when you house collapses.
Re:LLMs Bad At Math (Score:5, Insightful)
> Their number crunching capabilities are a work in progress. But, still fairly good.
They still fail at basic arithmetic.
Remember that these do not and can not reason, analyze, consider, deliberate, or anything else we'd associate with a complex task. All they do, all they can do, is next-token prediction based on learned relationships between tokens. That's all. This is why they can articulate simple rules, but not apply them. It's why they can generate summaries of text that doesn't exist. They're not actually doing the task. They can't. That's simply not how they work.
In cases like TFA, they're not doing advanced mathematics, that's impossible, they're just generating text that looks like other text the exact same way they generate any other text.
Re: (Score:2)
> All they do, all they can do, is next-token prediction based on learned relationships between tokens. That's all.
Humans could arguably also be described as continuously making a decision on what word to say next, but it would be misleading.
"easily deducible" (Score:5, Informative)
Yeah, GPT doesn't "deduce" anything, it predicts the most probable next word.
Re: (Score:2)
They do quite a bit more than that. There's a good bit of reasoning that comes into play and newer models (really beginning with o3 on the ChatGPT side) can do multi-step reasoning where it'll first determine what the user is actually seeking, then determine what it needs to provide that, then begin the process of response generation based on all of that.
Re: (Score:3)
To Funny! once you guess wrong! it is a quick trip to hallucination.
"There's a good bit of reasoning that comes into play and newer models (really beginning with o3 on the ChatGPT side) can do multi-step reasoning"
Re: (Score:2)
Chained prediction, not deducing. Deduction is the process of inferring facts. LLM do not do that at all.
True deduction requires actual understanding, not merely predictive.
The classic explanation is "All men are mortal. Socrates is a man. Socrates is mortal." Deduction. But an LLM would not go through your process of creating the categories that you do in your head when you work it through. Instead it creates relationships and probabilities. Not deduction.
Figures! LLM's are bad @ math (Score:3)
LLM are just advanced pattern matching based guessing. With no ability to determine accuracy, due to the lack of intelligence or reasoning.
Re: (Score:2)
Some (but not me) have argued that human cognition is little more than advanced pattern matching.
Re: (Score:2)
Interesting! And the decisions many people make do look like LLM output!
Newsflash: AI good at making stuff up (Score:2)
not so good when asked to produce exact information.
That's why it's great to make deepfakes of Natalie Portman covered in hot grits, but not so great at coming up with real, existing law cases or math proofs.
Re: (Score:2)
I can't think of a single more important thing for it to do. Except maybe substituting oatmeal for grits. Unless we're talking about yellow grits, the sexiest of the hot cereals.
Re: Newsflash: AI good at making stuff up (Score:2)
I feel old. Iâ(TM)ve been reading this site for so long, I remember when Natalie Portman was always the goto reference for attractive woman here. Along with BSD is dying, and âoein Soviet Russia âoe comments.
Re: (Score:2)
Hey, but at least you can take solace in the fact that UTF-8 characters are still fucked up. ®
Re: (Score:2)
And so of course, it doesn't mess up ®.
Wait... does UTF-8 work now??? Høly smökes!!!
Re: Newsflash: AI good at making stuff up (Score:2)
Can someone please help me, and not mock my uid number vs my ability to grok the utf 8 problem. What am I doing wrong? Iâ(TM)d gladly set something. Iâ(TM)m using safari on an iPhone.
I was amused by the link. (Score:2)
I opened it and was presented with the paper taking up one half of the page, and Gemini taking up the other half. I tried to ask it why it was there next to a paper about its flaws, but I would have had to sign in to get an answer.
Emphasis is interesting (Score:5, Informative)
The emphasis here is interesting. The users were impressed by it, and comparing it to a junior researcher is wildly better than earlier systems. And the warning at the end is precisely relevant because the systems are starting to have some potential usefulness in research. I'm a mathematician and this sort of very mixed experience with ChatGPT in the GPT5 form is close to my own experience. Relevant recent anecdote:
Relevant math background: the Gaussian integers are the complex numbers of the form a+bi where a and b are good, old-fashioned integers. For example, 2+3i or -1 +2i are Gaussian integers. Any integer n is a Gaussian integer since you can write it as n+0i. But say or 3- 0.5 i would not be Gaussian integers. Also notation: We write x|y to mean y is a multiple of x. We can use this notation in our regular integers (so for example 2|8 but it is not true that 3|8 ) or in the Gaussian integers where we are then allowed to multiple by another Gaussian integer. For example (2+i)| (2+i)(3-i). A good exercise if you have not seen the Gaussian integers before: Convince yourself that 1+i | 1+3i.
It also turns out that the Gaussian integers have an analog of unique prime factorization just as that in the usual integers. The Gaussian integers also have a notion of size called the norm. For a given Gaussian integer a+bi, the norm is a^2 +b^2 .
Recently I had to prove a specific Lemma where I needed to find all Gaussian integers and where both are Gaussian primes, and b|a^2 + a +1 and a|b+1. I had as a template a very similar Lemma in the integers which was a Lemma which said exactly which integers and b such that b|a^2 + a +1 and a|b+1. I worked out the proof, essentially modifying the version in the integers. Then, I did something I've often been doing after I've completed a small Lemma, namely giving the task to ChatGPT or another system and seen how they've done. For prior iterations (GPT3, ChatGPT , GPT4, 4o) this has almost universally been a disaster. But this time I gave the task to GPT5, and gave it the integer version to start with. It tried to do the same basic task and produced a result pretty close to mine, but it had multiple small errors in the process, to the point where I'm unsure if using it would have sped things up. But at the same time, the errors were genuinely small. For example, in proving in one subcase the system claimed that a specific number's norm needed to be at most 9, when it needed to be at most 10. These are not the sort of large jumps in reasoning that one saw with GPT4 or 4o. It might have been the case that if I had given this to GPT5 before proving it myself and then had corrected its errors I would have saved time. I generally doubt this is the case, but the fact that it is close to the point where that's now plausible is striking.
Re: (Score:2)
It's important to note the skills of these AI math solvers don't come from the stochastic transformer networks for the most part. Instead they come from logic engines with entirely predictable output steps based on inputs, for which the transformers only assist by trying to (appropriately enough) transform the word/notation parts into the appropriate inputs for the logic engines. So the math that comes out of these can be entirely correct, assuming the inputs are correct.
However that part, transforming th
Humanity is slowly learning... (Score:5, Insightful)
...the difference between appearing intelligent and being intelligent.
We're so good at recognizing patterns that we see some patterns when they're not really there. Faces on toast, intelligence in writing, etc, etc.
Predicting the next word in a sequence is shockingly useful, but it's not a substitute for symbolic reasoning.
Luxus problems (Score:2)
Only a few years ago we were happy when transformers made it possible that a text generator stays consistent for more than two sentences. Now we're picky when a LLM makes mistakes in "PhD level" problems. I think we shouldn't be too harsh with the tech, it does way more than one would have ever expected.
Details are not LLM's strong suit (Score:2)
Ask an LLM to paint a picture of a kitchen in Greece. It will generate something nice and pretty, with some Greek touches. But will the cooktop have the right number of control knobs? Will the electric outlets have the correct configuration, or will the painting even show electrical outlets? Will the faucets have knobs with a plausibly working design? Will the cabinet doors below the sink extend only to the bottom of the sink? There are SO many errors such a request is likely to generate.
An LLM's ability is
through mistakes and exploration (Score:2)
Doesn't guiding AI count as mistakes and exploration? Or are these guys afraid their expertise might be in jeopardy
Re: (Score:2)
No, the concern is that as these systems get better, they will be substitutes for the sort of exploration and connection making that a mathematician needs to learn as a fundamental foundation for doing research. Using an LLM to suggest approaches and connections can be useful, but it doesn't give the same underlying basic skill set to do deeply original work.
THIS JUST IN (Score:2)
A glorified auto-complete system, that has at its heart a random number generator , sometimes makes mistakes. Film at 11.
I'm surprised! (Score:5, Informative)
Actually no, I'm not. Anyone who has used it, likewise will not be surprised.
Re: Where are all the pet projects at, then, right (Score:2)
I'm sure that vibe coding is real and that it works, but nobody ever said the code will do what you think it does. The best thing it could do is crash and burn, and I'm pretty sure vibe coders can figure out fixes for it...provided they do more vibe coding. The second best thing it could do is fail in a really obvious way without actually crashing. And again, more vibe coding to the rescue.
But what happens when it fails in very subtle ways? Particularly on unknown edge cases. That's where fluffernutter or a
Re: (Score:2)
About the only thing I've made work is reading a user manual to someone without making stuff up. Or short code blocks for a language I don't know well (which is most of them if we're being honest)
Re: (Score:3)
I'm finding more and more hallucinations coming from AI every day. I asked multiple LLMs for a configuration file for some software, and they all made stuff up that didn't exist. For example, one result told me to use a specific plugin to achieve what I wanted because it was designed just for that purpose. Problem was, that plugin doesn't exist. Even the same LLM would come back and tell me there was no such thing.
Re: (Score:2)
I had my first LLM hallucination last week. When I asked how to perform a very specific task for an embedded Linux system, it suggested a specific package but hallucinated a command within that package that didn't exist. I thought maybe I just had an older version of the package, but nope, the latest version had no such feature. Looking at the man page for the utility in question, I was able to see how it jumped to that conclusion.
Re: (Score:2)
I've been using it to write grant applications, and I share your opinion. It frequently makes mistakes (and 5 is worse in many ways than 4o). While it can certainly be used to create a rough draft of a document, the result is similar to what you would expect from a junior associate, with the same kinds of mistakes that create an, "OMG, no," response in the reader when it starts to make things up.
There was a lot of talk about how rapidly it would accelerate in performance. That progress seems to have stal
Re: (Score:2)
Agreed. Another thing not included in all these efficiency calculations: the endless talk about AI at the start of every meetings, during meetings, at end end of every meetings, in workplace chat groups, in personal one-to-ones, in mandated workplace training etc. Is a technology that productive if say 5% of every day consists of non-productive AI related discussion, predictions, criticisms, etc?