Today is when the Amazon brain drain finally sent AWS down the spout

(2025/10/20)

Reference: 1760990156
News link: https://www.theregister.co.uk/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/
Source link:

column "It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. But if you or I know this, AWS certainly does.

And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them.

What happened?

AWS reports that on October 20, at 12:11 AM PDT, it began investigating “increased error rates and latencies for multiple AWS services in the US-EAST-1 Region.” About an hour later, at 1:26 AM, the company confirmed “significant error rates for requests made to the DynamoDB endpoint” in that region. By 2:01 AM, engineers had identified [1]DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause, which led to cascading failures for most other things in that region. DynamoDB is a "foundational service" upon which a whole mess of other AWS services rely, so the blast radius for an outage touching this thing can be huge.

As a result, [2]much of the internet stopped working : banking, gaming, social media, government services, buying things I don't need on Amazon.com itself, etc.

AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.

[3]

Note that for those 75 minutes, visitors to the AWS status page (reasonably wondering why their websites and other workloads had just burned down and crashed into the sea) were met with an "all is well!" default response. Ah well, it's not as if AWS had [4]previously called out slow outage notification times as an area for improvement. [5]Multiple times even. We can [6]keep doing this if you'd like.

The prophecy

AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.

Once you reach a certain point of scale, there are no simple problems left. What's more concerning to me is the way it seems AWS has been flailing all day trying to run this one to ground. Suddenly, I'm reminded of something I had tried very hard to forget.

[7]

[8]

At the end of 2023, Justin Garrison left AWS and [9]roasted them on his way out the door . He stated that AWS had seen an increase in Large Scale Events (or LSEs), and predicted significant outages in 2024. It would seem that he discounted the power of inertia, but the pace of senior AWS departures certainly hasn't slowed — and now, with an outage like this, one is forced to wonder whether those departures are themselves a contributing factor.

You can hire a bunch of very smart people who will explain how DNS works at a deep technical level (or you can hire me, who will incorrect you by explaining that it's a database), but the one thing you can't hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it's historically played a contributing role to some outages of yesteryear.

[10]

When that tribal knowledge departs, you're left having to reinvent an awful lot of in-house expertise that didn't want to participate in your RTO games, or play Layoff Roulette yet again this cycle. This doesn't impact your service reliability — until one day it very much does, in spectacular fashion. I suspect that day is today.

[11]AWS outage exposes Achilles heel: central control plane

[12]Major AWS outage across US-East region breaks half the internet

[13]Amazon spills plan to nuke Washington...with X-Energy mini-reactors

[14]Amazon turns James Bond into the Man Without the Golden Gun

The talent drain evidence

This is The Register , a respected journalistic outlet. As a result, I know that if I publish this piece as it stands now, an AWS PR flak will appear as if by magic, waving their hands, insisting that "there is no talent exodus at AWS," a la Baghdad Bob. Therefore, let me forestall that time-wasting enterprise with some data.

It is a fact that there have been [15]27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues.

Internal documents reportedly say that Amazon [16]suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't."

The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; [17]experts have weighed in citing similar concerns.

If you were one of the early employees who built these systems, the world is your oyster. There's little reason to remain at a company that increasingly demonstrates apparent disdain for your expertise.

My take

This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. AWS's operational strength was built on redundant, experienced people, and when you cut to the bone, basic things start breaking.

I want to be very clear on one last point. This isn't about the technology being old. It's about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue.

AWS will almost certainly say this was an "isolated incident," but when you've hollowed out your engineering ranks, every incident becomes more likely. The next outage is already brewing. It's just a matter of which understaffed team trips over which edge case first, because the chickens are coming home to roost. ®

Get our [18]Tech Resources

[1] https://www.theregister.com/2025/10/20/aws_outage_chaos/

[2] https://www.theregister.com/2025/10/20/amazon_aws_outage/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aPaw9xK97eLT5PbZ1IKILwAAAEg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://aws.amazon.com/message/12721/

[5] https://aws.amazon.com/message/11201/

[6] https://aws.amazon.com/message/41926/

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPaw9xK97eLT5PbZ1IKILwAAAEg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aPaw9xK97eLT5PbZ1IKILwAAAEg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[9] https://justingarrison.com/blog/2023-12-30-amazons-silent-sacking/

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPaw9xK97eLT5PbZ1IKILwAAAEg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[11] https://www.theregister.com/2025/10/20/aws_outage_chaos/

[12] https://www.theregister.com/2025/10/20/amazon_aws_outage/

[13] https://www.theregister.com/2025/10/17/amazon_nuke_washington/

[14] https://www.theregister.com/2025/10/06/amazon_007_without_golden_gun/

[15] https://www.cnbc.com/2025/07/17/amazon-web-services-has-some-layoffs.html

[16] https://www.engadget.com/amazon-attrition-leadership-ctsmd-201800110-201800100.html

[17] https://finance.yahoo.com/news/amazon-back-office-crusade-could-090200105.html/

[18] https://whitepapers.theregister.com/

Anonymous Coward

Jesus, this makes very stressful reading for senior AWS executives. So stressful in fact, that a significant remuneration bump is needed to keep them onboard. And a performance bonus if it doesn't happen again in the next 3 months

dsch

Brain drain in this case is the mirror image of enshittification, applied on the inside of the service (employees) rather than the outside (users).

Fret not, shareholders

mostly average

AI will save us!

Re: Fret not, shareholders

Goodwin Sands

>AI will save us!

Funny you should say that. Here's Musk, this morning, not long after things started falling over.

https://twitter.com/elonmusk/status/1980221072512635117

Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

elDog

Just let all of their hosted AI systems look at Amazon's web services - performance, network diagrams, outages, etc. and I'm sure they'll come up with some fixes. If all the circuits are software patchable then no need for techies plugging/unplugging.

Then throw in the Amazon financials and personnel organization plans and see if Bezos and every other human can't be sent packing.

Anonymous Coward

I currently work with AWS on a daily basis (and for the most part it's been an infinitely better experience than working with Azure).

I've encountered several people who have worked for AWS - and most of them hated the culture and pressure. So much so that even though Amazon have tapped me up to go and work for them on what would probably be close to double my salary, I stay the hell away - I don't want to work in that kind of environment.

Today doesn't surprise me - it's been coming. They've known about the us-east-1 SPOFs for years and clearly haven't been bothered to fix them. The quality of technical support has been getting worse and worse.

Hopefully today will serve as a massive wake-up call for AWS. A lot of what they provide is great, and there are clearly still very good teams they work there. But they need to refocus on quality, and they can only do that with the best people. And in order to attract and retain the best, salary alone isn't enough any more for most of us. Daft thing is, if AWS sorted out their culture and working environment, they wouldn't necessarily even need to pay the top salaries, because the opportunity to work on interesting tech that underpins so much of the world's major companies would be a draw in of itself.

Decay

"Hopefully today will serve as a massive wake-up call for AWS"

I wouldn't hold your breath. There will be incident reviews, meetings, assessments, analysis etc. but basically boil down to what can we do to stop this from happening again without actually spending any more money. So no, not hiring fresh talent or retaining that talent already in play, no to radical overhaul of process and knowledge. No to remediation of known issues if it involves expenditure. Instead it will be do more with less. Beat the employees harder, enforce more and more diligence and output from less and less people for the same or less money. Spin it like mad with catchy titles like knowledge sharing, centers of excellence, efficiency improvement initiatives, agile resilience, and continuous operational excellence.

There’ll be shiny PowerPoint decks about empowering ownership and shifting left, while the remaining engineers are shifting caffeine straight into their bloodstream at 3 a.m.

Next quarter, they’ll unveil a bold new policy called Focus Fridays which will be promptly filled with mandatory incident retrospectives. Someone will suggest replacing ancient tooling, only to be told, “We’ll revisit that next fiscal year,” which is code for never.

Then come the internal awards: “Unsung Hero of the Outage” goes to the one poor sod who rebooted the wrong thing but accidentally fixed it.

HR will roll out a “Resilience Recognition” badge on the intranet. This will be marketed with great fanfare and excitement, showcasing how the company truly values it's employees and recognized their contribution because badges are cheap. Leadership will congratulate themselves for “learning from adversity,” and by the time the next blackout happens, they’ll have a snazzy new dashboard to watch it fail in real time along side their investment portfolio dashboard that takes up a greater fraction of their attention.

But don’t worry!!!! There’ll be a T-shirt. “I survived the 2025 AWS outage.” Comes in gray. Just like morale. If it wasn't for the negative impacts on the employees and customers the word Schadenfreude would be very applicable.

And it's a sad indictment on current management practices and in particular the MBA brigade* that this is all by design, acceptable losses on the alter of profit, albeit short-term profit. Efficiency theatre as far as the eye can see.

*Yes, the same people who think Jack Welch was a misunderstood visionary rather than the spiritual father of mass layoffs, short-termism, and shareholder-value human sacrifices. The kind who see burnout as a KPI and chaos as a “scaling opportunity.”

Next they’ll launch a “Transformation Task Force” whose primary transformation will be renaming the same broken process from post-mortem to value realization review. A new acronym, a new logo, and boom, problem solved at a low low cost, honest, the consultants said so. Until the next outage, at which point someone will quote Sun Tzu in Slack.

As well as brain drain at Amazon, there's brain rot (AI)...

Dan 55

[1]Universities Are Part of the Cursor Resistance

So these students were in for some culture shock when they spent the past summer interning at Amazon, where their managers strongly encouraged them to use AI coding tools. When they used Cline, the coding agent of choice for their teams, their managers told them to keep up the good work. When they didn’t use Cline, their managers asked why not.

One intern recounted bringing errors to his manager to help solve. The manager would copy and paste the code into Cline and instruct the AI to fix the error, instead of fixing the bug manually.

As a result, the intern said he wrote fewer than 100 lines of code himself over the summer, while Cline wrote thousands. A spokesperson for Amazon said employees are encouraged but not required to use AI tools.

If they're just copying and pasting and doing what the coding assistant says, eventually they're going to screw up.

[1] https://archive.ph/https://www.theinformation.com/articles/universities-part-cursor-resistance

Re: As well as brain drain at Amazon, there's brain rot (AI)...

disgruntled yank

Cline, as in Patsy? Because I'm crazy for crying/And crazy for trying/And crazy for trusting you? Or because It Falls to Pieces?

Blackjack

[Amazon suffers from 69 percent to 81 percent regretted attrition]

That's a whole lot, that means the working environment is really bad.

The great Godaddy outage of 2012

mikus

Everyone forgets when Godaddy went down in 2012, much of the internet stopped working. Certainly not because they hosted most of it, but because most used them as a DNS registrar. That day a config glitch sent them spiraling down to take down their entire DNS anycast network globally, causing glue records for some 70M domains to stop working at all. It took a good 12 hours to fix I think with network vendors involved.

Much the same, too many eggs in one basket, and AWS has very large baskets.

The Original Steve

I did initially think that essentially things like IAM will ultimately be tied to a single region, and just like on-prem if AD DS goes down then you're screwed

But then again I thought that AD DS doesn't really go down due to a server or datacentre failure due to it being multi-master. Why isn't AWS IAM the same?

DNS is also multi-master, so combined they should fail due to an issue in a single region. I can understand issues occuring, but we've hammered these out on-prem for decades - surely the big cloud providers can surpass this?

News: 1760990156