Fixing Claude with Claude: Anthropic reports on AI site reliability engineering

(2026/03/19)

Reference: 1773922977
News link: https://www.theregister.co.uk/2026/03/19/anthropic_claude_sre/
Source link:

QCon London A member of Anthropic's AI reliability engineering team spoke at [1]QCon London on why Claude excels at finding issues but still makes a poor substitute for a site reliability engineer (SRE), constantly mistaking correlation with causation.

Alex Palcuie was formerly an SRE for Google Cloud Platform. "My job is keeping Claude up," Palcuie said, adding: "I've been using LLMs for actual incident response." Since January, he's been reaching for Claude before looking at other monitoring tools.

[2]

Alex Palcuie speaks at QCon London 2026

His team is busy. "Claude goes down more often than any of us would like. Earlier today, I was involved in an incident, even if I'm at a conference."

Is Palcuie automating himself out of a job? No, he said. "It would be hypocritical to say that Claude fixes everything. My team exists, we're hiring for many positions, this should show you that no, it doesn't work."

However, he said "many of us would not be surprised" if it did work in future, and his talk demonstrated that AI is already helpful.

[3]

Speaking of his career in incident response, Palcuie reflected that having engineers on call is "a tax on humans because our systems are not good enough to look after themselves." Palcuie spoke of the stress of being on call. "Your phone buzzes, there's half a second where you go from asleep, to incident commander mode... then at 9:00 am you show up at work and have to look professional and presentable."

[4]

[5]

Incident response, he said, can loosely be broken down into a loop of four phases: observe, orient, decide, act.

AI, he said, is fantastic for the observation part. "It reads the logs at the speed of I/O, it doesn't get bored, this at scale is something no human can match."

[6]

He recounts a real incident when, on New Year's Eve, Claude Opus 4.5 was returning HTTP 500 errors. "I open Claude Code and ask it to have a look." The AI wrote a SQL query and "within seconds it has the answer, an unhandled exception in the image processing class." It posts the Python stack trace but "it doesn't stop there." Claude identified the failing requests, checked the accounts that sent them, and found 200 accounts "all sending 22 images at the same time." That looked suspicious. Claude looked further and found 4,000 accounts all created at the same time, most sitting dormant. The AI said: "Stop looking at the 500s, this is fraud."

Without AI, "I would have marked this as a bug, I would not have paged account abuse," Palcuie said.

His next anecdote is less positive. AI processing relies on a key-value (KV) cache for performance. "This KV cache can be gigabytes in size, it's really easy to break it, it's finicky, it's fragile." When it breaks, it causes a lot of extra compute and monitoring shows many more requests.

[7]

"Every single time, I would ask Claude, what happened here? Claude would say, request volume increase, this is a capacity problem, you need to add more servers."

[8]AI for software developers is in a 'dangerous state'

[9]Oracle unveils Project Detroit for faster Java interop with JavaScript and Python

[10]Vite team boasts 10-30x faster builds with Rust-powered Rolldown

[11]Users protest as Google Antigravity price floats upward

The problem, he said, is that Claude "will get wrong correlation versus causation." It's like a new joiner on the team, they will think "oh, it's a capacity problem, when actually you lost your cache."

"This is why we can't trust LLMs for incident response," said Palcuie. The problem is its inability to "step back and start discerning between causation and correlation... For us humans, it is hard as well."

When Claude is asked to produce a postmortem report, it delivers "an 80 percent story that's pretty, it's readable and convincing," said Palcuie, but "it's really bad at root causes." Claude says "this was the thing, and we all know it is not one thing. It's not one root cause... It was never the rollout. It was never the code change. It was all the processes in the company that allowed the incident. And Claude doesn't know the history of your system, especially if your system has been there for ten years."

It is important, said Palcuie, to have SREs that "have been burnt before... they have the scar tissue." He worries that if AI is used more, "will we have our skills atrophy?" – in parallel with the concerns software developers often express regarding having AI write most of the code.

The Jevons Paradox, said Palcuie, is "the favorite paradox in the AI industry. It's when technological improvements increase the efficiency of our resources used, but the resulting lower cost causes consumption to rise rather than fall."

In the case of software, "it's easier to write software, so we write much more of it, so the complexity goes up and not down, which means things break in more interesting ways, which means more incidents, more on call... all the improvements in the tooling will be cancelled by this ever-growing complexity."

Maybe, said Palcuie, AI agents can simplify and manage the complexity, maybe "do what we've collectively learned in our industry, but that's a big if."

He ended on a positive note, saying: "The models are the worst today that they'll ever be."

The overall story, though, is not to leave SRE to AI and keep training reliability engineers because they will be needed in future. ®

Get our [12]Tech Resources

[1] https://qconlondon.com

[2] https://regmedia.co.uk/2026/03/19/qcon-palcuie.jpg

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2abwrtZTKKV2qP52a8gxo6wAAAlY&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abwrtZTKKV2qP52a8gxo6wAAAlY&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abwrtZTKKV2qP52a8gxo6wAAAlY&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44abwrtZTKKV2qP52a8gxo6wAAAlY&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33abwrtZTKKV2qP52a8gxo6wAAAlY&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://www.theregister.com/2026/03/18/ai_for_software_developers_qcon/

[9] https://www.theregister.com/2026/03/17/oracle_project_detroit_java/

[10] https://www.theregister.com/2026/03/16/vite_8_rolldown/

[11] https://www.theregister.com/2026/03/12/users_protest_as_google_antigravity/

[12] https://whitepapers.theregister.com/

"Claude goes down more often than any of us would like."

Jedit

Something else it's not good at, then.

F**k. Off.

af108

Palcuie reflected that having engineers on call is "a tax on humans because our systems are not good enough to look after themselves." Palcuie spoke of the stress of being on call.

This bellend is someone working for an organisation which is basically going to create metric fuck tons more of this in the future.

All of this shitty AI generated output is, at some point, going to get "dealt with" by real humans. There are an increasing number of people using it to manage ever more critical infrastructure. What could possibly go wrong?!

The sooner AI - and the morons behind it - get in the bin the better.

Fido

I think the comment on complexity is a relevant one. Human-managed projects become unnecessarily complex due to deadlines and people being too lazy to make it simple. The resulting technical debt can be seen as the result of an inefficient engineering process.

Bill Gates once said automating an inefficient process magnifies the inefficiency. Aside from that rumour about 640K being enough for anyone, this may be the most useful observation anyone has ever made about the computing industry.

Since in this context AI is basically a tool to automate coding, then AI can very easily magnify inefficiency in projects that don't control technical debt.

News: 1773922977

Fixing Claude with Claude: Anthropic reports on AI site reliability engineering

"Claude goes down more often than any of us would like."

F**k. Off.