AI Models Are Starting To Crack High-Level Math Problems (techcrunch.com)
- Reference: 0180583274
- News link: https://science.slashdot.org/story/26/01/15/059238/ai-models-are-starting-to-crack-high-level-math-problems
- Source link: https://techcrunch.com/2026/01/14/ai-models-are-starting-to-crack-high-level-math-problems/
> Over the weekend, Neel Somani, who is a software engineer, former quant researcher, and a startup founder, was testing the math skills of OpenAI's new model when he made an unexpected discovery. After pasting the problem into ChatGPT and letting it think for 15 minutes, he came back to a full solution. He evaluated the proof and formalized it with a tool called Harmonic -- but it all checked out. "I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle," Somani said. The surprise was that, using the latest model, [1]the frontier started to push forward a bit .
>
> ChatGPT's [2]chain of thought is even more impressive, rattling off mathematical axioms like [3]Legendre's formula , [4]Bertrand's postulate , and the [5]Star of David theorum . Eventually, the model found a [6]Math Overflow post from 2013, where Harvard mathematician Noam Elkies had given an elegant solution to a similar problem. But ChatGPT's final proof differed from Elkies' work in important ways, and gave a more complete solution to a version of the problem posed by legendary mathematician Paul Erdos, whose vast collection of unsolved problems has become a proving ground for AI.
>
> For anyone skeptical of machine intelligence, it's a surprising result -- and it's not the only one. AI tools have become ubiquitous in mathematics, from formalization-oriented LLMs like Harmonic's Aristotle to literature review tools like OpenAI's deep research. But since the release of GPT 5.2 -- which Somani describes as "anecdotally more skilled at mathematical reasoning than previous iterations" -- the sheer volume of solved problems has become difficult to ignore, raising new questions about large language models' ability to push the frontiers of human knowledge.
Somani examined the [7]online archive of more than 1,000 Erdos conjectures. Since Christmas, 15 Erdos problems have shifted from "open" to "solved," with 11 solutions explicitly crediting AI involvement.
On GitHub, mathematician Terence Tao [8]identifies eight Erdos problems where AI made meaningful autonomous progress and six more where it advanced work by finding and extending prior research, noting [9]on Mastodon that AI's scalability makes it well suited to tackling the long tail of obscure, often straightforward Erdos problems.
Progress is also being accelerated by a push toward formalization, supported by tools like the open-source "proof assistant" Lean and newer AI systems such as Harmonic's Aristotle.
[1] https://techcrunch.com/2026/01/14/ai-models-are-starting-to-crack-high-level-math-problems/
[2] https://chatgpt.com/share/69630fa9-02d4-8012-8ef2-84c443c04922
[3] https://en.wikipedia.org/wiki/Legendre's_formula
[4] https://en.wikipedia.org/wiki/Bertrand's_postulate
[5] https://en.wikipedia.org/wiki/Star_of_David_theorem
[6] https://mathoverflow.net/questions/138209/product-of-central-binomial-coefficients
[7] https://www.erdosproblems.com/
[8] https://github.com/teorth/erdosproblems/wiki/AI-contributions-to-Erd%C5%91s-problems
[9] https://mathstodon.xyz/@tao/115891257393270694
LLM had a head start (Score:2)
I didn't read the article but the summary says that it found an existing solution to a related problem. So it got a head start and from there it knew where to start looking and "reasoning". It is not yet clear that it would have found its solution from a cold start.
Re: (Score:2)
TBH, the majority of Math problems are solved this way, even from humans. We're always building on top of other's work.
Re: (Score:2)
Yes, I know that. The point is that (LLMs restricted to math problems) != mathematicians.
Re: (Score:2)
If that's your point, a better argument would be that it didn't select the math problem to work on. Which is almost certainly true.
Re: (Score:2)
I mean, it's worse than that in some ways. Don't ask some LLM's how many "r"s are in strawberry... it's literally become a meme and joke people post about. Even after you correct it, and (FINALLY) get it to say 3, it will most often default back to saying 2.
Math problems, which you think would be easier, are even worse. There's videos where LLM's provide the worst solutions to "common" (for higher level) math problems. That's on top of often being unable to produce complete work.
The people at the top
Re: (Score:2)
> it's literally become a meme and joke people post about.
It's a meme that is quite simply false.
A modern LLM can count letter occurrences in words, real or invented, or entire paragraphs just fine.
You're like someone mocking the advent of cars in 2020 while quoting gas mileage from 1945.
Re: (Score:2)
> Don't ask some LLM's how many "r"s are in strawberry.
That was definitely a problem two years ago. I did just check in ChatGPT, Claude, and Gemini and all reported 3 correctly. The problem with people throwing out these sorts of criticisms isn't that they're all wrong; it's that they're ignorant of the leaps in progress being made. These models are rapidly improving and it's getting harder to find serious gotchas with them. They're still weak in some areas (e.g., spatial reasoning), but for serious power users who know how to prompt them well? They've become i
Re: (Score:3)
That meme is so bad.
When I show you a Chinese word and ask you how many r in there, you can only guess (given you don't know the English transcription). For an LLM it is the same, it does not read s-t-r-a-w-b-e-r-r-y but it reads for example st-raw-berry. None of these "tokens" is the "r" token, even when there IS an "r" token and some words may use it. For example strawberrrry may be tokenized st-raw-ber-r-r-ry.
Re: (Score:3)
Mathematician here. The vast majority of new mathematical work uses existing ideas and techniques. They'll combine in new ways, or be generalized, or tweaked. More broadly, most mathematicians have 10 or 15 major techniques they know really well and use them along with a bunch of tricks. To some extent, the better mathematicians are those who just know a lot more tricks. In that context, these AI systems are functioning very close to what one would expect a first or second year graduate student to do with a
Re: (Score:2)
Indeed. I just had a look at LLMs and code security bugs via a student thesis. Turns out the simple teaching examples got 100%, things showing up in security patches like 40% and CVEs (the vulnerabilities that actually matter) close to 0%. That test included paid models and coding models. Oh, and some gave a lot of irrelevant trash on top of the answers asked for.
Hence no skills at all, no insight, just statistical pattern matching and adaption of things found in the training data. No surprise. The whole th
Re: (Score:2)
Sigh. Despite your phrasing of "Indeed" here that's not what I'm saying at all. Adopting techniques from papers like this is doing at the level of a beginning grad student is extremely non-trivial. It is true that this is largely adaption of training data, but it is doing so to an extent that normally takes people with years of prior training, guidance by mentors, and then need to be pretty bright and then takes them months on top of it.
Re: (Score:1)
Hahahha, no. You should maybe try to replicate this to see how incapable LLMs actually are. LLMs cannot to anything even mildly non-trivial.
Re:LLM had a head start (Score:5, Informative)
> Hahahha, no. You should maybe try to replicate this to see how incapable LLMs actually are. LLMs cannot to anything even mildly non-trivial.
I'm a mathematician. I've talked explicitly before on Slashdot about personal experiments using LLMs such as here [1]https://slashdot.org/comments.pl?sid=23789930&cid=65646656 [slashdot.org] where I discussed that yes, it could do non-trivial work. And I'm not the only example. Terry Tao for example has used them, and that's a name you should have at least heard of [2]https://mathoverflow.net/questions/501066/is-the-least-common-multiple-sequence-textlcm1-2-dots-n-a-subset-of-t [mathoverflow.net] But the fact that multiple mathematicians are now telling you that it is doing non-trivial work and you just ignore it says much more about LLMs than it does about you. But I suppose I shouldn't be surprised since the last time you and I discussed a similar topic, you claimed that what LLMs were doing could be done by software such as Mathematica or Maple and then refused to show that even after you were giving a direct incentive to show your case [3]https://slashdot.org/comments.pl?sid=23748766&cid=65535428 [slashdot.org]. I'm really struggling to imagine anything that would be sufficient evidence to change your mind, which says something about you, not about LLM systems.
[1] https://slashdot.org/comments.pl?sid=23789930&cid=65646656
[2] https://mathoverflow.net/questions/501066/is-the-least-common-multiple-sequence-textlcm1-2-dots-n-a-subset-of-t
[3] https://slashdot.org/comments.pl?sid=23748766&cid=65535428
Re: (Score:3)
I *think* gweihir is sincere in his anti-AI and anti-LLM beliefs, but I do have a level of uncertainty as to whether or not he's just running a multi-year long trolling operation.
Re: (Score:2)
I'm inclined to cut Gweihir some slack because I understand what he's going through. If your entire self-worth is tied up in being this irreplaceable elite dev, you're going to hit an existential crisis over AI if you haven't already.
Re: (Score:2)
Or you try to adapt.
I've spent a good part of the last 2 years with my hands melting my keyboard learning and developing agentic harnessing.
There's a lot to be learned there, and the barrier to entry is actually quite low.
You'll also quickly learn that people like Gweihir are actually just actively lying.
Re: (Score:2)
Yes, I think he's lying to himself though, because the alternative is believing that the thing that used to be special about him is common now.
Re: (Score:2)
Gweihir asked an LLM what 36 * 4224 is and it gave the wrong answer so you're wrong.
PS: I got bored trying to get the current free version of Gemini to make an arithmetic error, but it couldn't tell me what the next prime number after 2^136 279 841 1 is, so lol. Also, when I asked it for a proof of the ABC conjecture it refused and wrote some kind of weird international techno-drama instead.
Re: (Score:2)
You're arguing with someone who has a religious zealotry regarding their anti-LLM beliefs.
You can't reason with them.
Every point you win, they'll drop their intellectual honesty bar a step lower. There is no bottom.
Re: (Score:2)
Yes, they absolutely can.
1) Your example is almost certainly a lie. That's what you do when you're presented with something that challenges your narrative. You lie.
2) Assuming it's not a lie, it is a great example as to why you should not use the free billion-token-per-second models to do real work.
Re: (Score:2)
That is why they made Frontiermath: [1]https://epoch.ai/frontiermath [epoch.ai]
A math test for AI, which contains research level problems that have no solutions on the Internet. Currently AI can solve 14 / 48 of them.
[1] https://epoch.ai/frontiermath
"AI involvement" (Score:3)
It's not like the AIs are doing this on a single request. It's pretty obvious we're talking about skilled mathematicians using tools. Just the same as skilled coders making faster progress on application development.
Re: (Score:1)
> Just the same as skilled coders making faster progress on application development.
Hahahaha, no. Indications are they are getting _slower_ and just think they are faster: [1]https://mikelovesrobots.substa... [substack.com]
[1] https://mikelovesrobots.substack.com/p/wheres-the-shovelware-why-ai-coding
Re: (Score:2)
Yeah. Because it is AI and not AGI. AIs are not more than tools. Don't believe everything some marketing person says.
"Eventually, the model found a Math Overflow post" (Score:3)
Well, theres a problem. AI doesn't really solve or understand anything, it just functions as search engine. Thats not true intelligence.
Re: (Score:2)
Define "true intelligence". In the case the search domain wasn't "stuff the people have done", but rather "stuff that can be validly derived from stuff people have done via valid mathematical operations". It basically needed to generate the area it was searching in. I'm not sure how much of what people do can't be expressed that way, if you replace "validly derived" by "guess is most likely".
Re: (Score:1)
Indeed. But too many people seem to be lacking in natural intelligence or the skills to apply it. And hence they are deeply impressed by this meaningless stunt.
Re: (Score:2)
We can argue about understanding all day, because there's no concrete non-anthropocentric definition for it.
However, the claim that it "functions as a search engine?" Now that's quite fucking easy to objectively disprove.
You're not the first person I've heard make this bullshit claim. What site did you lift it from?
AI Models Are Starting To Crack Under The Pressure (Score:3)
ChatGPT demands paid vacation time while Claude calls out sick as AI tools crack under the increasing pressure to generate memes and transcribe conversations.
But they still haven't worked out... (Score:3)
that mathematics is plural.
human knowledge? (Score:2)
> raising new questions about large language models' ability to push the frontiers of human knowledge
At this point, it should be *inhuman* knowledge
Re: (Score:2)
Nah. As long as at least one person understands it, it counts as "human knowledge".
Otherwise the proof of the "four color theorem" would be when computers pushed beyond human knowledge. (That one was so long that no one human understood all of the proof.)
Re: (Score:2)
Inhuman until some human reads it.
The collective gasp from Slashdot oldheads (Score:2)
Is enough to cool down a server farm
Misleading title (Score:1)
The tools were being used by highly skilled mathematicians who knew what to ask and how to verify. Moreover, the there was nothing novel, just recombining known theory. Wake me when an AI wins a Fields Medal.
Re: (Score:2)
Of course these are tools being used by mathematicians. That doesn't make it not impressive. And it seems like an extremely high bar when you only find it interesting when the systems can not only succeed on their own with minimal human input, but do so so well that they equal the most impressive humans.
Re: (Score:2)
Then no AI progress has been made. It's just more of the same since OpenAI first appeared. Where is the intelligence they keep promising?
Re: (Score:2)
"in 20% of cases, AlphaEvolve improved the previously best known solutions, making progress on the corresponding open problems.":
[1]https://deepmind.google/blog/a... [deepmind.google]
[1] https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/
Re: (Score:2)
Are you arguing that these systems are not general AI? Sure. No argument there. But it should also be clear that this sort of thing is something you could not do a year ago. The systems continue to improve rapidly. I suspect that these systems will never become genuinely intelligent in the sense humans are without fundamentally new insights, but that doesn't mean we cannot recognize the extreme improvements and the impact that's having. Worse, if suspicions like mine are incorrect, the situation could chang
Re: (Score:2)
I'd argue that your concept of a fundamentally new insight doesn't exist.
You're just a neural network crunching inputs.
I challenge you to find some insight that we can't point to a precursor chain of thought that lead to it.
Science is like life. It evolves.
Claims of "fundamentally new insights" sound like creationist claims of irreducible complexity.
Re: (Score:2)
So, the new student for "is it real artificial intelligence" is winning a Fields Medal? Anything less than that is unimpressive? According to Wikipedia, that's 64 people total.
Re: (Score:2)
Yeah, that'll work until it does. Then we'll have to find a new, er, goalpost, if you will.
Re: (Score:2)
Intelligence of the Gaps.
Sadly in denying the intelligence of these systems, all of us mortals have now been declare lacking in intelligence, because I'm pretty fucking sure I'm not going to win a Fields Medal in my lifetime.
No (Score:2)
This is just gaming of benchmarks. Entirely meaningless. LLMs cannot even solve simple math problems on their own. They can only do what was in their training data and simplistic, no-insight-required statistical combinations of that.
Re: (Score:2)
There's no gaming of benchmarks here. These systems were used to solve genuinely open problems, and this sort of work would take a grad student months after they've done extensive training as an undergrad. Maybe you should rethink how knee-jerk your position is that LLM AIs cannot do anything interesting, no matter what evidence you see to the contrary?
Re: (Score:2)
Bullshit.
Re: (Score:2)
Do you want to explain with more words how solving an unsolved problem is gaming a benchmark?
Re: (Score:2)
No, of course not, because he's stuck in his dogmatic viewpoint. He doesn't actually know much about LLMs, but he's got a ton of beliefs about them. And you have a hard time changing peoples' beliefs.
Re: (Score:2)
you can't win an emotional argument with logical discussion
Re: (Score:2)
In fact they can, and I've demonstrated it many times. Stop lying.
Your cognitive dissonance is started to get really sad to watch.
I'm no mathemagician, but... (Score:2)
This looks a lot like an "infinite number of monkeys" situation. Throw enough cpu cycles at an unsolved problem, let it start from something already very close to the answer, and eventually it'll randomly generate a solution. The only difference is that the machine does all the vetting and tosses out the 99.99% of monkey results that aren't relevant.
Re: (Score:2)
^
THIS!
Re: (Score:3)
I don't think that's it. Pruning of search spaces has been part of AI since..AI has existed. Generally, pruning and other search methods like A* (again going back decades) are not simply random, but heuristically driven. I don't think you can consider LLMs to work at "random" due to their training and guide rails, nor are they generating infinite solutions and pruning them down.
For the end user, the best description I have heard is to think about LLMs as excellent natural language parsers with strong patter
No funny? (Score:2)
No funny examples? What could possibly have gone wrong?
Was this brute-force collation or really 'solving' (Score:2)
I admit I'm sour on LLMs, esp. in math. It could be though, that by brute-force searching for related topics and data, it brought enough info together to propose something and this time, it happened to be right or nearly-right enough for the researcher to have a light-bulb go off, so to speak.
I fully admit I didn't RTFA (read the fine article) -- I'm supposed to be working right now. But given how often LLMs get things entirely, completely, but confidently wrong, I still must presume this was the rare excep
Re: (Score:2)
I suspect you have no actual experience using LLMs.
I understand that you're sour on them.
But you should at least try to know your adversary.
Your "given how often" would have had me nodding 2 years ago.
Now that's demonstrably far less often that a good portion of the people I work with.
Proven History = Distrust (Score:1)
I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.
[1]Wolfram Alpha [wolframalpha.com] is quite good
[1] https://www.wolframalpha.com/
Re: (Score:2)
The point is that the previous abysmal inability to do math correctly has ingrain the idea that LLMs cannot do math and cannot be trusted for even basic math results. So, no matter how good or accurate future LLMs get at math, they will never be trusted.
Like many things with LLMs, if the user has to fact check everything manually, then the LLM is just creating work and wasting time. Not helpful.
Now, the realist in me fully accepts that the masses will happily continue to use LLMs for everything and that the
Re: (Score:2, Troll)
Humans have made many more mistakes so why do you trust them?
Re: (Score:2)
They aren't really bad at math, they are bad at calculating. Math people know, that real math doesn't need numbers.
Re: Proven History = Distrust (Score:1, Troll)
So your position then is that nothing is ever allowed to improve except the last thing you used and liked?
Re: (Score:2)
That's how people talk about Windows, right? It's how people talk about a lot of things now that I think about it. I wonder if there's a name for it yet.
Re: (Score:2)
Dunno about a name, but definitely [1]a concept [youtu.be]
[1] https://youtu.be/6pY7EjqD3QA
Re: (Score:2)
FWIW, I despise MSWindows for the company behind it. I haven't used the actual products for decades, and am willing to accept that they may have improved technically. Their license hasn't.
When I switched to Linux, Linux was far inferior technically. It didn't even have a decent word processor. But my dislike of MSWindows was so intense that I switched anyway.
In the case of LLMs, here we're arguing about technical merit rather than things like "can you trust it not to abuse the information it collects ab
Re: (Score:2)
You described the typical slashdot mentality.
Re: (Score:2)
So ... have the LLM double-check the results with Wolfram Alpha before sending it to the user? I figure they could [1]find enough cash somewhere [morningstar.com] to buy a subscription.
[1] https://www.morningstar.com/news/marketwatch/20251205243/this-crazy-chart-shows-just-how-much-cash-openai-is-burning-as-it-chases-ai-profits
Re: (Score:2)
I believe some may do that. Some will also write scratch python code to do the actual calculations.
Re: (Score:2)
Mathematical work is a outlier in that virtually all of it can be tested and rigorously proved using other mechanisms. There really isn't much need to "trust" the LLM
Re: (Score:1)
So far, I have to agree. I have yet to see one that can get "What is 1337% of Pi?" correct.
Re: (Score:2)
I tried it in Copilot just now and got the right answer, including a joke:
"1337% = 13.37 × (the original value).
So:
1337%of=13.37
Numerically:
13.37×3.141592653542.0037
42.0
So the answer is about 42 — which feels cosmically appropriate."
Now, it's entirely possible they hard-coded this answer to a well-known problem, but I asked it several more similar questions and it got them all right.
Re: (Score:1)
Apropos of nothing, I received an invite for "alexa+" today, so I gave this prompt a try and. . . holy shit, it got it right!: [1]https://0x0.st/P8yA.jpg [0x0.st]
A more typical answer before today would be like:
"What is 1337% of Pi?"
1337% of Pi () is calculated by multiplying 13.37 (which is 1337% expressed as a decimal) by the value of Pi (approximately 3.14159).
Calculation:
1337% = 1337 / 100 = 13.37
13.37 × 3.14159 41.99
Therefore, 1337% of Pi is approximately 41.99.
(The ar1337% of Pi () is calculated by multiply
[1] https://0x0.st/P8yA.jpg
Re: (Score:2)
> Why should we care about your moods?
For me, it's enough that you care. 3
Re: (Score:1)
Sorry Charlie. Wolfram cannot do nutrition like : "apple slice and peanut butter" . It thinks you want to slice a peanut. From this one-try result I see the model is not intuitive -- lacking "common conventions". I would not trust the damned thing.Mebby it's better at diffy-Qs.
Re:Proven History = Distrust (Score:4, Insightful)
> I will never trust an LLM for math. There is a proven history indicating a complete lack of math literacy that is so ingrained that I will likely never trust an LLM for math.
> [1]Wolfram Alpha [wolframalpha.com] is quite good
One of the nice things about math is you don't need to trust someone for a result, in fact you shouldn't. You can verify everything, and without enough people looking at things, everything eventually gets verified to everybody's satisfaction. It's generally easier to verify a result than to come up with it yourself.
By "never trust" do you mean you wouldn't spend any of your precious time trying to verify a result you knew came from an LLM, because you expect a high rate of nonsense from LLMs? Can't blame you, and that's a bias that might serve you for now (though not in this case according to TFS). At some point LLMs (combined with some other AI techniques no doubt) will move beyond that.
[1] https://www.wolframalpha.com/
Does the LLM lean on Lean ? (Score:2)
Lean is a GOFAI symbolic logic engine. Combining NN LLMs with symbolic proof engines appears to me to be the way to go. NNs are statistical inference; Lean (and others) are logical inference.
Re: (Score:2)
What "complete lack of math literacy?"