News: 0175462263

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

AI Systems Solve Just 2% of Advanced Maths Problems in New Benchmark Test

(Wednesday November 13, 2024 @05:40PM (msmash) from the reality-check dept.)


Leading AI systems are solving less than 2% of problems in a new advanced mathematics benchmark, revealing significant limitations in their reasoning capabilities, research group Epoch AI reported this week.

The benchmark, called FrontierMath, consists of hundreds of original research-level mathematics problems developed in collaboration with over 60 mathematicians, including Fields Medalists Terence Tao and Timothy Gowers. While top AI models like GPT-4 and Gemini 1.5 Pro achieve over 90% accuracy on traditional math tests, [1]they struggle with FrontierMath's problems , which span computational number theory to algebraic geometry and require complex reasoning.

"These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages," Tao said. The problems are designed to be "guessproof," with large numerical answers or complex mathematical objects as solutions, making it nearly impossible to solve without proper mathematical reasoning.

Further reading : [2]New secret math benchmark stumps AI models and PhDs alike .



[1] https://venturebeat.com/ai/ais-math-problem-frontiermath-benchmark-shows-how-far-technology-still-has-to-go/

[2] https://arstechnica.com/ai/2024/11/new-secret-math-benchmark-stumps-ai-models-and-phds-alike/



I'm surprised they managed 2%. (Score:2, Flamebait)

by danda ( 11343 )

pretty good for a system that is essentially repeated statistical guessing.

Re: (Score:2)

by rta ( 559125 )

Well, the thing is that that can be said of our brains too.

see e.g. Anil Seth or Shamil Chandaria (or others) on Predictive Processing etc.

[1]https://www.youtube.com/watch?... [youtube.com] (The Free Energy Principle and predictive processing Chandaria)

[2]https://www.youtube.com/watch?... [youtube.com] (Your Brain Hallucinates Your Conscious Reality | Anil Seth | TED Seth from 7 years ago. (though TED talks are kind of cringey to me)

[1] https://www.youtube.com/watch?v=UkH-7gZnrr4

[2] https://www.youtube.com/watch?v=lyu7v7nWzfo

Re: (Score:2)

by backslashdot ( 95548 )

If it is purely an LLM with nothing else to it, then yes. But that still means we need to augment it better. AI, without any major breakthroughs or infeasible computational need, should be able to see the problem format and apply known techniques. The benchmark wasn't asking it to come up with an unknown algorithm or proof (yet). We shouldn't be making excuses that AI is merely an LLM.

Re: (Score:3)

by gweihir ( 88907 )

Well, it will be worse: It will not know which ones it solved and which ones it did not. Humans with a working mind do two things: 1. solve the problem or not and 2. evaluate whether they have solved the problem. Statistical guessers can sometimes do (1), but they cannot do (2) at all.

Re: (Score:2)

by ceoyoyo ( 59147 )

Which is why the actual systems designed to solve these things generally include testing their solutions.

Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."

Re: (Score:2)

by gweihir ( 88907 )

> Careful, you're in danger of invaliding your thesis that modern AI is just "statistical pattern matching."

This is just you being dishonest by misdirection. Nobody is talking about "modern" AI. What is being talked about, and you know that, is LLMs. And these are just statistical pattern matching.

Naturally (Score:4, Funny)

by sjames ( 1099 )

Having learned from elementary school answer keys, it's not hard to guess that the word that best follows "4 + 4 =" is "8". That doesn't mean the LLM even knows what 4 or 8 is, much less that it can do even basic arithmetic.

Re: (Score:2)

by olsmeister ( 1488789 )

Sure it knows what 4 and 8 are. They're tokens! Everything is a token!

Re: (Score:3)

by migos ( 10321981 )

State of the art chat bots are already acing math problems at undergrad level, which is probably already better than 90% of Americans.

Re: Naturally (Score:1)

by zawarski ( 1381571 )

99%. Fixed that for you.

Re: (Score:2)

by Rendus ( 2430 )

Only when the answers are widely known and documented. Since LLMs don't have any means of performing logic operations like math, the LLM isn't actually DOING math (barring outside libraries, which isn't the LLM doing the math but more the UI/frontend choosing to load a Python library rather than sending the raw prompt to the LLM).

Re: (Score:2)

by Rendus ( 2430 )

The great part about it is when an LLM can't even get that right (typically because such basic math is fucked up intentionally and sarcastically - 1+1=3 and all of that. But also because of the limited usefulness of the surrounding context in the content LLM trainers stole training data from).

Uh huh (Score:2)

by ceoyoyo ( 59147 )

> These are extremely challenging. [...] The only way to solve them is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages

Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?

I don't have a problem with calling most pe

Re:Uh huh (Score:4, Insightful)

by vyvepe ( 809573 )

> Okay, so if the AI can solve 2% of those and we're going to call that "significant limitations in their reasoning capabilities" then what do we call everyone who's not a math grad student equipped with AI and a good symbolic math package?

The difference is that "everyone" [human without a graduate degree] is not trained on all the books and all of the internet. The point is that the LLMs should already know all the math needed to solve the problems.If you prompt them properly then they very likely can write [approximately cite] the mathematical knowledge needed to solve the problems.

Repeate after me: (Score:4, Informative)

by gweihir ( 88907 )

For generative "AI" the following is true: "AI" has no reasoning ability. "AI" cannot solve problems. "AI" has no model of reality. "AI" can only fake these and as soon as you leave what its training data covered, it is lost.

Re: (Score:1)

by null etc. ( 524767 )

So tell me, what can you do as soon as you leave your training data?

For example, could you prebaxel plume mostna 2fe1::a0-2^4 guh guh guh?

Re: (Score:3)

by dfghjk ( 711126 )

Animal brains have more than just ":training data". How else does a newborn animal breathe? Get up and run? What are phobias? They aren't any result of "training data".

Re: (Score:2)

by gweihir ( 88907 )

I can do a thing called "thinking". You might be able to do so too, although you clearly are not at the moment.

Re: (Score:1)

by KlomDark ( 6370 )

Yes, definitely: Oonteb weekin wokken wollen!

Have they tried... (Score:3)

by VeryFluffyBunny ( 5037285 )

...turning it off & on again?

Not Reasoning. (Score:3)

by Fly Swatter ( 30498 )

Pattern matching.

Calling it Al is just fraud.

OF COURSE NOT (Score:2)

by SmaryJerry ( 2759091 )

An LLM does one thing only predict the next thing it should say. All of them can be much smarter than they currently are but that takes iterative calculation and error checking, which use too much GPU resources. So, in order to optimize them to use less resources they simply leave that part of almost all models. If you wanted an LLM model that 'isn't wrong' with math, it could easily be created specifically for that purpose but would effectively be multiple models stacked on top of each other, some that do

How hard are these problems? (Score:3)

by oumuamua ( 6173784 )

You've reached AGI/ASI if you can solve them. We may have a good ASI test here:

> Matthew Barnett, an AI researcher, captured the significance of FrontierMath in a series of tweets. “The first thing to understand about FrontierMath is that it’s genuinely extremely hard,” Barnett wrote. “Almost everyone on Earth would score approximately 0%, even if they’re given a full day to solve each problem.”

> Barnett also speculated on what it might mean if AI eventually cracks the benchmark. “I claim that, once FrontierMath is completely solved, humans will be living alongside an entirely distinct set of intelligent beings,” he wrote. “We will be sharing this Earth with artificial minds that are, in an important sense, just as smart as we are.”

"problem set remains private" (Score:2)

by Pinky's Brain ( 1158667 )

Semi-private. Anthropic and OpenAI already combed their logs to find them.

Asking the questions to their next models will be completely useless.

And 2% is impressive. (Score:3)

by JoshuaZ ( 1134087 )

From the article:

> “All of the problems I looked at were not really in my area and all looked like things I had no idea how to solve,” Gowers said. “They appear to be at a different level of difficulty from IMO problems.” The problems are designed not just to be hard but also to resist shortcuts. Each one is “guessproof,” meaning it’s nearly impossible to solve without doing the mathematical work. As the FrontierMath paper explains, the problems have large numerical answers or complex mathematical objects as solutions, with less than a 1% chance of guessing correctly without the proper reasoning.

So solving 2% should already be impressive. Unfortunately, some people are going to look at the headlines like the one above and think that this says that the AI are not impressive.

Re: (Score:2)

by Rendus ( 2430 )

Nearly guessproof and "less than 1% chance of guessing the correct answer" are damn near antonyms.

Pick a number, 1 through 101. That's less than a 1% chance.

Nearly guessproof needs to be far, far more than that.

About the same as an average human (Score:2)

by mkwan ( 2589113 )

I'd like to see the score for an average human. A truck driver, or an illiterate Congolese farmer, for example.

If they score less than 2%, does that mean AI is smarter?

Meader's Law:
Whatever happens to you, it will previously
have happened to everyone you know, only more so.