News: 1752650948

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

If you want a picture of the future, imagine humans checking AI didn't make a mistake – forever

(2025/07/16)


Column Agentic AI will make jobs – but many will involve picking its failures off automated conveyor belts.

Barely two-and-a-half years into the modern era of AI, we're stuck in a hype cycle that promises all our productivity Christmases will soon come at once. People want to believe the magic is real.

It will remain cheaper to let professionals do the work than to check agentic outputs

Surprisingly little progress has been made on the harder problems in artificial intelligence – the problems that involve actual intelligence, such as the reflective capacities needed to understand the intention of actions undertaken, and thereby remain on task.

In retrospect, the first of the modern agents, [1]AutoGPT , arrived before its time: March 2023, when the API for GPT-4 became publicly available.

I used it. Lots of people used it. We gave AutoGPT a goal, then watched it methodically work its way toward that goal. Sometimes. Other times, it might fail in ways both small and spectacular.

[2]

AutoGPT remains a tantalizing demo – but far from useful tech. You'd never deploy it in production. I came to think of it more like a malfunctioning magic wand: When grasped, you never quite knew whether it would run its course to a mind-blowing transformation – or just sputter out after a few weak sparks.

[3]

[4]

Two-and-a-bit years hasn't changed that outlook very much. A [5]recent essay by Oxford AI Governance Initiative Senior Researcher Toby Ord lays out the math in terms of "half-lives" – how likely is it that an agent that takes X minutes to complete Y tasks will successfully reach its goal?

Individual task completion has become measurably better since 2022 – but [6]remains far from perfect , and for that reason, far from production. When chained together, tying task output to task input – as all agents do – the risk of failure across any chain of tasks becomes a game of probabilities: This task has a 90 percent completion rate, the next one following on from that 75 percent, the third one feeding off of that completes 95 percent of the time - which works out to a 64 percent success rate on just the first three tasks of a business process likely decomposed into hundreds of individual tasks.

[7]AI's the end of the Shell as we know it and I feel fine … but insecure

[8]AI can't replace devs until it understands office politics

[9]Meta's AI, built on ill-gotten content, can probably build a digital you

[10]Apple has locked me in the same monopolistic cage Microsoft's built for Windows 10 users

The likelihood that a moderately sophisticated agent will successfully progress from goal setting to outcome appears to roughly the same as the probability you'll see a unicorn.

Quelle surprise: The only possible remediation remains a healthy input of human intelligence. Rather than performing tasks directly, humans need to monitor agents’ outputs for accuracy, instruct them to repeat a task when they fail, or direct them to tackle the next task when they meet spec.

[11]

This is the AI-era version of [12]human quality control at the end of a production line on which most things are automated, leaving humans to pick out defective products.

That's dull work - the sort of work organizations will demand in increasing supply as we go all-in on agents.

Depending on the level of expertise required to assess agent outputs, that work will either fall into the class of exploitation wages - $2/hour paid to workers in developing economies to perform our [13]onerous digital tasks , or it'll be the full wage rate that highly qualified professionals receive for being highly qualified professionals.

[14]

The amount of highly qualified labor needed to maintain the quality of agentic AI determines its first-order cost to the business. In many cases, it looks as though it will remain cheaper to let the professional do the work than to check agentic outputs. Any forecast savings from AI automations gets consumed in a new class of highly intense and highly paid professional labor.

Still, even the idea of a magic wand inspires an unending hope. "It'll be better in six months," we hear. "And in six years - who knows?"

Sure, the successful task completion rate will increase for agents, as it has for the last two years. Simultaneously, over the next months and years, work process automation of business logic will continue decomposing office work into long chains of agentic tasks. Even at 95 percent completion rates - which could well mark the point of diminishing returns on investment - the probabilities argue against any but fairly simple agents being practical.

Do these calculations mean we'll give up on agents? So long as CEOs can daydream of a future as leader of "one-man bands" or "tiny teams" that orchestrate profits into their own (and shareholders') pockets, the push will continue, unlikely to ever reach production – at least not without a lot of expensive human oversight. ®

Get our [15]Tech Resources



[1] https://www.theregister.com/2023/06/15/gpts_in_the_real_world/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aHd4NVgSB4nstdO9_2lWRQAAAMk&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aHd4NVgSB4nstdO9_2lWRQAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aHd4NVgSB4nstdO9_2lWRQAAAMk&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://www.tobyord.com/writing/half-life

[6] https://www.theregister.com/2025/06/29/ai_agents_fail_a_lot/

[7] https://www.theregister.com/2025/06/11/opinion_column_mcp_von_neumann_machine/

[8] https://www.theregister.com/2025/05/21/opinion_column_ai_cant_replace_developers/

[9] https://www.theregister.com/2025/04/10/meta_copyright_digital_you/

[10] https://www.theregister.com/2025/03/12/hardware_os_lockin_monopolies/

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aHd4NVgSB4nstdO9_2lWRQAAAMk&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[12] https://youtu.be/A2x8N4DjxnE

[13] https://www.theregister.com/2025/01/24/scale_ai_outlier_sued_over/

[14] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aHd4NVgSB4nstdO9_2lWRQAAAMk&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[15] https://whitepapers.theregister.com/



Orwell

Forget It

a good name for an LLM?

FOR EVER.

Isn't that how software developers have worked for years?

DS999

Whether it was a bunch of fresh faced new college grads starting in the summer, or more recently outsourced coders halfway across the world, the more experienced developers have had to find/fix the bugs, refactor bad code, and so forth to fix the mistakes of others.

From the perspective of the experienced developer does it matter if they are fixing AI mistakes rather than the mistakes of inexperienced and/or incompetent people? Who knows, maybe it'll turn out that the AI is more trainable so at least it'll come up with new mistakes rather than making the same ones over and over again?

Re: Isn't that how software developers have worked for years?

Ken Hagan

Yes, but it's a question of scale. For anything beyond pretty tiny values of n and e, the value of (1-e)^n is so much less than 1 that (as the article states) you are better off skipping the agent step and just asking your "checker" to do the whole job.

Novice programmers obviously vary, but half-decent ones at least know how to check their own output and will soon drive their "e" down to a fairly low value. (Certainly much better than the tens of percent error rates mentioned in the article.) That makes them useful.

Re: Isn't that how software developers have worked for years?

Sam not the Viking

It's an interesting scenario.

I've always felt that the further you go up the knowledge-tree or the longer you get in the tooth, the more you realise what you don't know. We also assume an awareness of other, outside elements that contribute to 'completion'. We have been in situations where the customer-specification has been wrong or incomplete and you are tasked with the job. It's that other-knowledge which includes previous experience that enables you to provide a working solution.

I don't see how AI can make judgements based on the things it hasn't been told. If it doesn't know (I mean: not in its reference data), it makes it up. Until corrected, then it apologises and either agrees with you or invents more lies. Humans do this as well, but experience helps us identify the bullshitters.

Re: Isn't that how software developers have worked for years?

otinokyad

From John Wiles…

“…the real product when we write software is our mental model of the program we've created. This model is what allowed us to build the software, and in future is what allows us to understand the system, diagnose problems within it, and work on it effectively. If you agree with this theory, which I do, then it explains things like why everyone hates legacy code, why small teams can outperform larger ones, why outsourcing generally goes badly, etc.”

— https://johnwhiles.com/posts/mental-models-vs-ai-tools

This says to me that (1) devs will hate agentic code as much as that by any new or outsourced coder; but more importantly (2) agents can *never* have such a model..

Ian Johnston

Quelle surprise: The only possible remediation remains a healthy input of human intelligence. Rather than performing tasks directly, humans need to monitor agents’ outputs for accuracy, instruct them to repeat a task when they fail, or direct them to tackle the next task when they meet spec.

In a similar way, the NTSB has suggested that the main reason for the reasonably good safety record of autonomous cars is that the overwhelming majority of human drivers around them are good at dealing with the stupid and dangerous things the robots frequently do. To repeat an analogy I have used before: put one toddler on a busy dance floor and it will be fine, put toddlers in the majority and there will be multiple collisions and tears before bedtime.

A phrase to remember for your managers

DrStrangeLug

"If you can replace me with AI today then they can replace you with it tomorrow"

Radgie Gadgie

I find it much, much harder to comb through the reams of data created looking for the one - small, but lethal - error embedded in the output.

Lost: gray and white female cat. Answers to electric can opener.