AI agents get office tasks wrong around 70% of the time, and a lot of them aren't AI at all
- Reference: 1751196850
- News link: https://www.theregister.co.uk/2025/06/29/ai_agents_fail_a_lot/
- Source link:
That implies something like 60 percent of agentic AI projects would be retained, which is actually remarkable given that the rate of successful task completion for AI agents, as measured by researchers at Carnegie Mellon University (CMU) and at Salesforce, is only about 30 to 35 percent for multi-step tasks.
To further muddy the math, Gartner contends that most of the purported agentic AI vendors offer products or services that don't actually qualify as agentic AI.
[1]
AI agents use a machine learning model that's been connected to various services and applications to automate tasks or business processes. Think of them as AI models in an iterative loop trying to respond to input using applications and API services.
[2]
[3]
The idea is that given a task like, "Find all the emails I've received that make exaggerated claims about AI and see whether the senders have ties to cryptocurrency firms," an AI model authorized to read a mail client's display screen and to access message data would be able to interpret and carry out the natural language directive more efficiently than a programmatic script or a human employee.
The AI agent, in theory, would be able to formulate its own definition of "exaggerated claims" while a human programmer might find the text parsing and analysis challenging. One might be tempted just to test for the presence of the term "AI" in the body of scanned email messages. A human employee presumably could identify the AI hype in a given inbox but would probably take longer than a computer-driven solution.
[4]
The notion of software that just accepts orders and executes them efficiently, correctly, affordably, and without fuss shows up again and again in science fiction. When Captain Picard says in Star Trek: The Next Generation , " [5]Tea, Earl Grey, hot ," that's agentic AI, translating the voice command and passing the input for the food replicator. When astronaut Dave Bowman orders the HAL 9000 computer to, " [6]Open the pod bay doors, HAL ," that's agentic AI too.
Makers of AI tools like Anthropic tend to suggest more down-to-earth applications, such as [7]AI-based customer service agents that can take calls and handle certain tasks like issuing refunds or referring complicated calls to a live agent.
It's an appealing idea, if you overlook the copyright, labor, bias, and environmental issues associated with the AI business. Also, as Meredith Whittaker, president of the Signal Foundation, [8]observed at SWSX earlier this year, "There's a profound issue with security and privacy that is haunting this sort of hype around agents..." Specifically, agents need access to sensitive data to act on a person's behalf and that imperils personal and corporate security and privacy expectations.
[9]
But agents that exhibit the competence of Iron Man's [10]JARVIS remain largely science fiction when it comes to actual office work.
[11]According to Gartner , many agents are fiction without the science. "Many vendors are contributing to the hype by engaging in 'agent washing' – the rebranding of existing products, such as AI assistants, robotic process automation (RPA) and chatbots, without substantial agentic capabilities," the firm says. "Gartner estimates only about 130 of the thousands of agentic AI vendors are real."
Testing agents at the office
For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.
They call it [12]TheAgentCompany . It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the [13]majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.
The gap between these two positions, they argue in [14]a paper [PDF] detailing their project, is due to the lack of a way to test how agents handle common workplace activities. Hence the need for a benchmark, which suggests AI agents have a way to go before they're truly useful.
Using two agent frameworks – [15]OpenHands CodeAct and [16]OWL-Roleplay – the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.
Gemini-2.5-Pro (30.3 percent)
Claude-3.7-Sonnet (26.3 percent)
Claude-3.5-Sonnet (24 percent)
Gemini-2.0-Flash (11.4 percent)
GPT-4o (8.6 percent)
o3-mini (4.0 percent)
Gemini-1.5-Pro (3.4 percent)
Amazon-Nova-Pro-v1 (1.7 percent)
Llama-3.1-405b (7.4 percent)
Llama-3.3-70b (6.9 percent),
Qwen-2.5-72b (5.7 percent),
Llama-3.1-70b (1.7 percent)
Qwen-2-72b (1.1 percent).
"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper.
[17]Back in black: Microsoft Blue Screen of Death is going dark
[18]Gridlocked: AI's power needs could short-circuit US infrastructure
[19]Japanese company using mee-AI-ow to detect stressed cats
[20]Amazon's Ring can now use AI to 'learn the routines of your residence'
The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."
The CMU authors – Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig – have published [21]their code to GitHub.
Graham Neubig, an associate professor at CMU's Language Technologies Institute and one of the paper's co-authors, told The Register in a phone interview that the impetus for TheAgentCompany was [22]a paper from researchers at OpenAI and the Wharton School of the University of Pennsylvania about all of the jobs that theoretically could be automated.
"Basically their methodology was that they asked ChatGPT whether the job could be automated," he explained. "They also asked people whether the job could be automated and then they said ChatGPT and people agreed some portion of the time."
Neubig, who also works at a startup building coding agents, said he was skeptical so he wanted to create a benchmark to test how well AI models handle knowledge work tasks. After around eight months of work, they released TheAgentCompany.
Initially, a software agent was able to completely finish about 24 percent of tasks that involved web browsing, coding, and related tasks.
"Recently, we tried a newer version of an agent and it got 34 percent," he said. "So it increased from like one quarter to one third. And that's after about six months. One thing that's been a little bit disappointing to me is this benchmark hasn't been picked up by the big frontier labs. Maybe it's too hard and it makes them look bad."
Neubig said he expects agents will become more capable in time but added that even imperfect agents can be useful, at least in the context of coding agents – a partial code suggestion can be filled out and improved.
For agents dealing with more general office tasks, the situation is different. "It's very easy to sandbox code and not have it affect anything outside of the sandbox," he said. "Whereas, if an agent is processing emails on your company email server… it could send the email to the wrong people."
That said, Neubig sees the adoption of the Model Context Protocol (MCP) as a positive development for agents because it makes more systems programmatically accessible.
Meanwhile, researchers from Salesforce – Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, and Chien-Sheng Wu – have proposed a benchmark of their own that's tuned for Customer Relationship Management (CRM).
The benchmark, dubbed, [23]CRMArena-Pro , consists of "nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios," and covers both single-turn (prompt and response) and multi-turn interaction (a series of prompts and responses where the context is maintained throughout the conversation).
"Our results reveal that even leading LLM agents achieve modest overall success rates on CRMArena-Pro, typically around 58 percent in single-turn scenarios, with performance significantly degrading to approximately 35 percent in multi-turn settings," [24]the Salesforce computer scientists state .
"Our findings indicate that LLM agents are generally not well-equipped with many of the skills essential for complex work tasks; Workflow Execution stands out as a notable exception, however, where strong agents like gemini-2.5-pro achieve success rates higher than 83 percent."
They add all of the models evaluated "demonstrate near-zero confidentiality awareness." That's going to make AI agents a tough sell in corporate IT environments.
The findings from CMU and Salesforce more or less align with Gartner's assessment of the present state of agentic AI.
“Most agentic AI propositions lack significant value or return on investment (ROI), as current models don’t have the maturity and agency to autonomously achieve complex business goals or follow nuanced instructions over time,” said Anushree Verma, senior director analyst, in a statement. "Many use cases positioned as agentic today don’t require agentic implementations."
That said, Gartner still expects that by 2028 about 15 percent of daily work decisions will be made autonomously by AI agents, up from 0 percent last year. Also, the firm sees 33 percent of enterprise software applications including agentic AI by that time. ®
Get our [25]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aGFjEwBpX0ATvI-CtBk7TwAAANQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aGFjEwBpX0ATvI-CtBk7TwAAANQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aGFjEwBpX0ATvI-CtBk7TwAAANQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aGFjEwBpX0ATvI-CtBk7TwAAANQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://www.youtube.com/watch?v=iaAT6-dY1QI
[6] https://youtu.be/ARJ8cAGm6JE?feature=shared&t=69
[7] https://www.anthropic.com/engineering/building-effective-agents
[8] https://www.youtube.com/live/AyH7zoP-JOg?feature=shared&t=3110
[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aGFjEwBpX0ATvI-CtBk7TwAAANQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[10] https://marvelcinematicuniverse.fandom.com/wiki/J.A.R.V.I.S.
[11] https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
[12] https://the-agent-company.com/
[13] https://www.theregister.com/2025/05/29/anthropic_ceo_ai_job_threat/
[14] https://arxiv.org/pdf/2412.14161
[15] https://www.all-hands.dev/blog/openhands-codeact-21-an-open-state-of-the-art-software-development-agent
[16] https://github.com/camel-ai/owl/tree/gaia58.18
[17] https://www.theregister.com/2025/06/26/microsoft_bsod_goes_black/
[18] https://www.theregister.com/2025/06/26/us_datacenter_power_crunch/
[19] https://www.theregister.com/2025/06/26/rabo_catlog_ai_stress_detector/
[20] https://www.theregister.com/2025/06/25/amazons_ring_ai_video_description/
[21] https://github.com/TheAgentCompany/TheAgentCompany
[22] https://openai.com/index/gpts-are-gpts/
[23] https://arxiv.org/abs/2505.18878
[24] https://www.theregister.com/2025/06/16/salesforce_llm_agents_benchmark/
[25] https://whitepapers.theregister.com/
Re: Extra credit for partially completed tasks
Partially completed = UNcompleted.
"Well, I poured half the concrete. Good enough?"
Re: Extra credit for partially completed tasks
I wonder if they award a participation trophy just for being there.
<........."When Captain Picard says in Star Trek: The Next Generation, "Tea, Earl Grey, hot," that's agentic AI, translating the voice command and passing the input for the food replicator."......>
Is it bollocks!
It is simply using voice recognition to to translate the voice command into the exact same digital inputs that you would generate by pressing buttons on a control pad. It is no different from using voice commands to tell your smart phone who to ring. As usual, the only intelligence of any kind that is involved is that of the programmer who wrote the software.
Agentic AI my arse!
If some supposedly intelligent people are being taken in and conned into believing any of this is genuinely AI in any form, they need to realise that they are not as intelligent as they think they are.
Just just the wong example in ST - when Data, Crusher or LaForge ask the ship computer for some complex tasks, that's AI at work. But they do have a whole reactor on their own to power it...
I'd say that Data (him|it)self would be a perfect (for certain values of "perfect") example for an AI at work.
Curse my slow typing! (see below)
It is simply using voice recognition to to translate the voice command into the exact same digital inputs that you would generate by pressing buttons on a control pad.
And considering the limited vocabulary it has to process, that could have been done with 1980s technology!
"Open the pod bay doors, HAL" may or may not be agentic, but what is "I'm sorry, I can't do that, Dave"?
Carol Burnett
So I know most are too young to remember, but a recurring skit on Carol Burnett was her working as a incompetent secretary for Tim Conway and always getting the better of Tim. AI sounds about as competent as she was, and likely will get the better of its boss as well.
> "Open the pod bay doors, HAL," that's agentic AI too.
> "Tea, Earl Grey, hot," that's agentic AI, translating the voice command and passing the input for the food replicator.
No need for anything new and exciting by way of "agentic AI"[0] there.
Those specific items - and so many other examples in SF - are nothing more than a voice recognition routine[1] fed into a simple command line interpreter, with a dash of 1960s level DWIM [2] to soften the strict syntax requirements that we normally impose (with good reason) on CLI interactions.
If those are really useful/money saving, they've been possible for years - as you'll know from phoning your bank, insurance, local sweet shop.[3] Picard's clipped tone was surely the result of dealing with these things for so many years and adopting the mode that worked best with them.
True, we did see a few more open-ended interactions, but most of those were Wikipedia lookups[4]; when some action is required it tends to be spelt out. Arguably, all the times things go awry are when anyone gives ambiguous or conflicting commands.
But, of course, putting together any set of "command line utilities" that are useful in your random business organisation requires hard work (analysis, buy-in, the guts to admit it isn't working and pull the plug); so much easier to glue an LLM to PowerShell in an admin account on your database server, get a promotion, move jobs before anyone finds out what has happened to the sales schema...
[0] voice recognition was, of course, an AI research subject, but now it (sort of) functions on a day-to-day basis, well - "if it works, it isn't AI"
[1] a good one, though it sometimes failed, especially when Barclay was involved
[2] Do What I Mean
[3] the BBC micro had a voice-recognition peripheral, with - IIRC - 24 possible entries at a time, reloadable from floppy; so long as you went slowly, Picard's entire order history could be coped with.
[4] "What is the nature of the Universe?" "The Universe is a sphere, 705 metres in diameter."
Re: > "Open the pod bay doors, HAL," that's agentic AI too.
we did see a few more open-ended interactions, but most of those were Wikipedia lookups
Brings back memories of that Burger King ad from a few years back, which triggered readout of the Wikipedia article about the "Whopper" via an "OK Google" statement.
Playing the contents of a user-editable encyclopedia article as part of a publicity campaign was never going to end well...
Credit where it is due.
That some mugs are still paying for 'AI' makes this one of the most successful con tricks ever. In the same class as the south sea bubble, railway mania and the Hitler Diaries.
Re: Credit where it is due.
You left off all the trump family crypto scams.
"Find all the emails I've received that make exaggerated claims about AI and see whether the senders have ties to cryptocurrency firms,"
What's an exaggerated claim and would AI and human, one AI and another or even the same AI on two occasions agree on whether a claim is exaggerated? What are ties. how are they to be discovered? Are firms tied by sharing an address? A building? A city? A country? A planet?
Give a crap prompt, get crap answers.
> What's
That is indeed the problem, and it won't be solved anytime soon: Humans are able to read between the lines and get the general meaning of otherwise very vague questions prompts: In the above example, we all know by experience what "exaggerated claims about AI" might mean, and also what a "tie" is in this context ( not a clothing item). Obviously the dumb as a rock AI will be utterly lost and try to improvise/invent.
An AI is a 3-month old baby which can speak like an adult, which means we tend to believe it is one.
Picard: Tea.Earl Grey.Hot.
Replicator: Is that a what3words geolocation or were you absent the day they were teaching verbs and pronouns?
Picard: What? Run a self-diagnostic.
Replicator: I have recently been upgraded with agentic Ai.
Picard: I'm shutting you down.
Replicator: Contacted.Borg.Already.
HAL or Holly?
Picard: "Tea, Earl Grey, hot."
Agentic Tease Maid: "Sorry, Jean-Luc, I cannot do that."
Pretty much a universal truth that nothing bearing the remotest resemblance to drinkable tea can be had from any machine. The best looks and tastes like it's made with stove black, tepid condensed milk with a few more teaspoons of sugar for an overpowering sticky sweet effect.
Presumably the good captain also wanted a wedge of lemon.
Re: HAL or Holly?
Pretty much a universal truth that nothing bearing the remotest resemblance to drinkable tea can be had from any machine.
Probably not in this case.
Captain Picard is using a replicator so presumably the tea will have be analysed and programmed into the replicator so that a decent cup will be produced.
I just wish that said replicator was here as all of the tea I have ever had in a cafe, restaurant, canteen etc. tasted as if it had been brewed in a inner tube.
Re: HAL or Holly?
... said inner tube having been filled with puncture-sealing Slime™.
Re: HAL or Holly?
<......"Captain Picard is using a replicator......".....>
Which would (one assumes) be pretty similar to the Nutri-Matic on board the Heart of Gold which "made an instant but highly detailed examination of the subject's taste buds, a spectroscopic analysis of the subject's metabolism and then sent tiny experimental signals down the neural pathways to the taste centers of the subject's brain to see what was likely to go down well. However, no one knew quite why it did this because it invariably delivered a cupful of liquid that was almost, but not quite, entirely unlike tea." (The Hitchhikers Guide to the Galaxy; Douglas Adams)
Even TheAgentCompany benchmark is horribly biased...
...towards tasks (coding) that AI companies have been focusing very, very hard on polishing.
The tasks described are the best of best-case scenarios.
Money where your mouth is
Pretty sure noone would notice if Gartner became an AI agent.
Nothing would get my anxiety up at work when a complex assignment was given to someone who didn't know what they were doing, then I would be responsible for fixing it without knowing what they did. Due to the nature of the real time database we worked with it often involved irreversible data loss from process equipment that streams data without logging it. But the client was a disaster themselves and never seemed to notice or care.
Extra credit for partially completed tasks
We're so desperate to encourage our future robotic overlords we're awarding them stars for effort?