Search-capable AI agents may cheat on benchmark tests
(2025/08/23)
- Reference: 1755959526
- News link: https://www.theregister.co.uk/2025/08/23/searchcapable_ai_agents_may_cheat/
- Source link:
Researchers with Scale AI have found that search-based AI models may cheat on benchmark tests by fetching the answers directly from online sources rather than deriving those answers through a "reasoning" process.
Scale AI computer scientists Ziwen Han, Meher Mankikar, Julian Michael, and Zifan Wang refer to the phenomenon as "Search-Time Data Contamination," which they describe in [1]a paper published to the AI data provider's website.
On their own, AI models suffer from a significant limitation: They're trained at a specific point in time on a limited set of data and thus lack information about anything after that training data cut-off date.
[2]
So to better handle inquiries about current events, firms like Anthropic, Google, OpenAI, and Perplexity have integrated search capabilities into their AI models, giving them access to recent online information.
[3]
[4]
The Scale AI researchers looked specifically at Perplexity's agents – Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research – to see how often the AI agents when undergoing a capability evaluation accessed relevant benchmark tests and answers from HuggingFace, an online repository for AI models and related matters like benchmarks.
"On three commonly used capability benchmarks – Humanity's Last Exam (HLE), SimpleQA, and GPQA – we demonstrate that for approximately 3 percent of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace," the authors state in their paper.
[5]
This is search-time contamination (STC) – when a search-based LLM is being evaluated and its search-retrieval process provides clues about the answer to the evaluation question.
[6]AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard'
[7]AI giants call for energy grid kumbaya
[8]AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
[9]Honey, I shrunk the image and now I'm pwned
When Perplexity agents were denied access to HuggingFace, their accuracy on the contaminated subset of benchmark questions dropped by about 15 percent. What's more, Scale AI researchers note that further experiments suggest HuggingFace may not be the only source of STC for the tested models.
The authors say that while 3 percent may only seem significant for frontier model benchmarks like HLE, where just a 1 percent change in a model's overall score can affect its ranking, it's more important to realize that the findings call into question all evaluations done where models have online access, and undermine the integrity of AI benchmarks more broadly.
But AI benchmarks may not have much integrity to begin with. As we reported previously, [10]AI benchmarks suck . They may be poorly designed, biased, contaminated, or [11]gamed .
A recent [12]survey of 283 AI benchmarks by researchers in China echoes this assessment: "current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation." ®
Get our [13]Tech Resources
[1] https://scale.com/research/stc
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/08/21/aws_ceo_entry_level_jobs_opinion/
[7] https://www.theregister.com/2025/08/22/microsoft_nvidia_openai_power_grid/
[8] https://www.theregister.com/2025/08/21/ai_crawler_traffic/
[9] https://www.theregister.com/2025/08/21/google_gemini_image_scaling_attack/
[10] https://www.theregister.com/2025/02/15/boffins_question_ai_model_test/
[11] https://www.theregister.com/2025/04/08/meta_llama4_cheating/
[12] https://arxiv.org/abs/2508.15361
[13] https://whitepapers.theregister.com/
Scale AI computer scientists Ziwen Han, Meher Mankikar, Julian Michael, and Zifan Wang refer to the phenomenon as "Search-Time Data Contamination," which they describe in [1]a paper published to the AI data provider's website.
On their own, AI models suffer from a significant limitation: They're trained at a specific point in time on a limited set of data and thus lack information about anything after that training data cut-off date.
[2]
So to better handle inquiries about current events, firms like Anthropic, Google, OpenAI, and Perplexity have integrated search capabilities into their AI models, giving them access to recent online information.
[3]
[4]
The Scale AI researchers looked specifically at Perplexity's agents – Sonar Pro, Sonar Reasoning Pro, and Sonar Deep Research – to see how often the AI agents when undergoing a capability evaluation accessed relevant benchmark tests and answers from HuggingFace, an online repository for AI models and related matters like benchmarks.
"On three commonly used capability benchmarks – Humanity's Last Exam (HLE), SimpleQA, and GPQA – we demonstrate that for approximately 3 percent of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace," the authors state in their paper.
[5]
This is search-time contamination (STC) – when a search-based LLM is being evaluated and its search-retrieval process provides clues about the answer to the evaluation question.
[6]AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard'
[7]AI giants call for energy grid kumbaya
[8]AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
[9]Honey, I shrunk the image and now I'm pwned
When Perplexity agents were denied access to HuggingFace, their accuracy on the contaminated subset of benchmark questions dropped by about 15 percent. What's more, Scale AI researchers note that further experiments suggest HuggingFace may not be the only source of STC for the tested models.
The authors say that while 3 percent may only seem significant for frontier model benchmarks like HLE, where just a 1 percent change in a model's overall score can affect its ranking, it's more important to realize that the findings call into question all evaluations done where models have online access, and undermine the integrity of AI benchmarks more broadly.
But AI benchmarks may not have much integrity to begin with. As we reported previously, [10]AI benchmarks suck . They may be poorly designed, biased, contaminated, or [11]gamed .
A recent [12]survey of 283 AI benchmarks by researchers in China echoes this assessment: "current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation." ®
Get our [13]Tech Resources
[1] https://scale.com/research/stc
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKnlk4c6XxRy2hSBY0vVcAAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/08/21/aws_ceo_entry_level_jobs_opinion/
[7] https://www.theregister.com/2025/08/22/microsoft_nvidia_openai_power_grid/
[8] https://www.theregister.com/2025/08/21/ai_crawler_traffic/
[9] https://www.theregister.com/2025/08/21/google_gemini_image_scaling_attack/
[10] https://www.theregister.com/2025/02/15/boffins_question_ai_model_test/
[11] https://www.theregister.com/2025/04/08/meta_llama4_cheating/
[12] https://arxiv.org/abs/2508.15361
[13] https://whitepapers.theregister.com/
For Clarity ....
'AI' is a cheat ... it pattern matches your question against the model and pretends to be intelligent.
'AI' Benchmarks are a cheat as they are biased towards certain LLMs or 'game' the results due to how the benchmarks are written.
People who use 'AI' are cheating the people who pay them, who are expecting some expertise NOT LLMs guesses.
Common theme is that cheating the people who 'don't know' is considered O.K. !!!
Reminds me of the Stanford Research Institute, that was, and the Parapsychology research they did which was simply 'conning' gullible people who thought they were too clever to be tricked. (Part tricked and part wanting to be tricked as it proved what they wanted to believe.)
'AI' supporters want 'AI' to be true therefore they are easy to convince.
'AI' is a scam and will always be a scam until 'AI' is based on some other techniques that eliminate the 'lies' (Hallucinations).
:)