A single DNS race condition brought Amazon's cloud empire to its knees
- Reference: 1761225492
- News link: https://www.theregister.co.uk/2025/10/23/amazon_outage_postmortem/
- Source link:
The incident began at 11:48 PM PDT on October 19 (7.48 UTC on October 20), when [1]customers reported increased DynamoDB API error rates in the Northern Virginia US-EAST-1 Region. The root cause was a race condition in DynamoDB's automated DNS management system that left an empty DNS record for the service's regional endpoint.
The DNS management system comprises two independent components (for availability reasons): a DNS Planner that monitors load balancer health and creates DNS plans, and a DNS Enactor that applies changes via Amazon Route 53.
[2]
Amazon's [3]postmortem says the error rate was triggered by "a latent defect" within the service's automated DNS management system.
[4]
[5]
The race condition occurred when one DNS Enactor experienced "unusually high delays" while the DNS Planner continued generating new plans. A second DNS Enactor began applying the newer plans and executed a clean-up process just as the first Enactor completed its delayed run. This clean-up deleted the older plan as stale, immediately removing all IP addresses for the regional endpoint and leaving the system in an inconsistent state that prevented further automated updates applied by any DNS Enactors.
Before manual intervention, systems connecting to DynamoDB experienced DNS failures, including customer traffic and internal AWS services. This impacted EC2 instance launches and network configuration, the postmortem says.
[6]With impeccable timing, AWS debuts automated cloud incident report generator
[7]AWS outage turned smart homes into dumb boxes – and sysadmins into therapists
[8]AWS admits more bits of its cloud broke as it recovered from DynamoDB debacle
[9]Amazon brain drain finally sent AWS down the spout
The DropletWorkflow Manager (DWFM), which maintains leases for physical servers hosting EC2 instances, depends on DynamoDB. When DNS failures caused DWFM state checks to fail, droplets – the EC2 servers – couldn't establish new leases for instance state changes.
After DynamoDB recovered at 2.25 AM PDT (9:25 AM UTC), DWFM attempted to re-establish leases across the entire EC2 fleet. The massive scale meant the process took so long that leases began timing out before completion, causing DWFM to enter "congestive collapse" requiring manual intervention until 5:28 AM PDT (12:28 PM UTC).
[10]
Next, Network Manager began propagating a huge backlog of delayed network configurations, causing newly launched EC2 instances to experience network configuration delays.
These network propagation delays affected the Network Load Balancer (NLB) service. NLB's health checking subsystem removed new EC2 instances that failed health checks due to network delays, only to restore them when subsequent checks succeeded.
With EC2 instance launches impaired, dependent services including Lambda, Elastic Container Service (ECS), Elastic Kubernetes Service (EKS), and Fargate all experienced issues.
[11]
AWS has disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide until safeguards can be put in place to prevent the race condition reoccurring.
In its apology, Amazon stated: "As we continue to work through the details of this event across all AWS services, we will look for additional ways to avoid impact from a similar event in the future, and how to further reduce time to recovery."
The prolonged outage affected websites and services over the space of the day. It also hit government services. Some estimates suspect the resultant chaos and damage may yet reach [12]hundreds of billions of dollars . ®
Get our [13]Tech Resources
[1] https://www.theregister.com/2025/10/20/amazon_aws_outage/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aPpRFNBdhFCnASkDJNKLhgAAAUA&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://aws.amazon.com/message/101925/
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPpRFNBdhFCnASkDJNKLhgAAAUA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aPpRFNBdhFCnASkDJNKLhgAAAUA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/10/23/aws_cloudwatch_automated_incident_reports/
[7] https://www.theregister.com/2025/10/21/aws_outage_aftermath/
[8] https://www.theregister.com/2025/10/21/aws_outage_update/
[9] https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/
[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPpRFNBdhFCnASkDJNKLhgAAAUA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aPpRFNBdhFCnASkDJNKLhgAAAUA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[12] https://edition.cnn.com/business/live-news/amazon-tech-outage-10-20-25-intl
[13] https://whitepapers.theregister.com/
Re: Ouch
Not until you explain to mere mortals what exactly a "DNS Planner" is, and why it's needed in simple terms. It'll help if you assume (accurately) I know nothing about hyperscale or cloud operations.
Re: Ouch
I think it's part of their Route 53 service. I think it's to do with load balancing and DNS routing for latency. Read: "Voodoo".
Re: Ouch
Hmmm, I though route 53 was the bus service from Wolverhampton to Bilston, via Wednesfield. So I'm afraid there's no mind bleach for you, sir!
Recovery wasn't rate limited?
One thing missing in that description was any attempt to apply rate limiting to, well, any part of it.
So a huge pile of machines basically all try to come up at once, without the staggering that limiting would cause (or inflict, depending upin p.o.v.) and start getting into a mess.
Is this genuinely a surprise to anybody? Isn't everyone charged with engineering a system supposed to be asking "what happens if it all switches on at once?" no matter what the cause might be? From checking whether recovery from a power failure[1] means the hard drives[2] can be allowed to spin up all at once[3], to whether you can serve netboot images fast enough to prevent watchdog reboots or even how many DNS leases can you serve out before you are swamped just handling renewals because you KNOW you set the lease period way shorter than the DNS designers ever expected[4]
[1] or Lady Florence pushing the Big Red Switch on Opening Day, not realising this one isn't a dummy
[2] or the dynamos, each racing to be the master frequency the rest have to sync to
[3] even in your home lab, can the circuit take that strain
[4] I *think* I understand what was going on here, allowing machine identities to move around as hardware becomes available to handle requests for user operations (please, if anyone can correct that understanding, do so) but is that how people normally do load balancing? Not my area at all, but this really feels like a misuse of DNS.
Re: Recovery wasn't rate limited?
really feels like a misuse of DNS.
I also got the impression low TTL DNS records were being used to route traffic which I would think at this scale really is playing with fire.
One of the advantages of IPv6 is that you have such a large address space that each end point could be assigned a permanent address along with any number of ephemeral addresses which could be assigned to processes, applications etc and follow them as they migrate around the cluster and hopefully push the routing out to the network and dynamic routing processes.
I suspect much of this stuff is a dark art simply because many of its practitioners have avoided en light enment. ;)
is that how people normally do load balancing?
In my career (of distant memory) we'd have a layer of boxes behind a single IP address that distributed "work" one layer down to the servers.
I am vaguely reminded of an incident from my past
PHB decided that machines left on overnight is bad.
Everyone switches machine off, goes home.
Gets to work. Powers up. Machine needs to download and install an update.
3,000 machines and nearly 24 hours later when no one has done a days work, it's ITs fault.
Re: I am vaguely reminded of an incident from my past
That's the funniest thing I've read in a while, superb!
You have to hand it to the ignorant PHBs who will gladly shout the techies down, implement some dictat or we get fired, we do as commanded and still get a bollocking 'cos PHB is desperately trying to keep the C-suite from chewing his arse to pieces over whatever his latest cock-up was!
bed is stuck
I wanna know when my fucking bed will start working again!
Re: bed is stuck
You have a special bed for fornicating ?
Maybe you can use your sleeping bed for the time being ? Or the couch ...
Re: bed is stuck
https://www.msn.com/en-us/news/other/this-weeks-aws-crash-made-smart-beds-overheat-get-stuck-in-wrong-position/ar-AA1P0rN1
Re: bed is stuck
You have a special bed for fornicating ?
Maybe you can use your sleeping bed for the time being ? Or the couch ...
Or a sock.
Re: bed is stuck
I have to question what state humanity is in when we have to have a f**king bed connected to the world's biggest network 24/7 or it won't work!
Too big to turn on?
So when DNS was fixed there were so many services trying to restart many of them failed.
Looks like they still didn't catch the cause
The report explains what the race condition was, but not why the Enactor was running so slowly in the first place, which was technically the cause of the problem. I wonder whether they know - a challenge of work at this scale is that you can have problems that only happen with production workloads and it's hard to reproduce that and properly isolate the cause.
Re: Looks like they still didn't catch the cause
A cascade of FFS! causes:
1. The enactor should not have been running slow.
2. Nothing to monitor its status.
3. The various boxen had no idea what to do about it before jumping in feet-first, cuz nobody had thought through Murphy's Law.
4. Someone who knows what they are doing should have been retained with sweeteners, not driven out by bullying manglement.
I expect there are several more.
Asynchonous Programming is HARD
As soon as you have more than one or two operations happening at the same time, being (re)triggered (by external sources), or just operating at scale, then dealing with all of the permutations and combination is rather complex - to say the least. Even supposedly simple systems can be mind boggling sometimes.
At this point earlier articles about a brain drain in AWS make a lot of sense. Experienced developers/network admins would understand these complexities, and probably not roll out "quick changes" to anything without a thorough review (including in-house experience about parts of the system that are more prone to issues...etc). Juniors are more likely to feel the pressure to deliver (and kowtow to a PHB), and roll something out without understanding the implications. Worse, if someone was cutting and pasting simplistic code generated from AI...
Ouch
As someone with some levels of AWS certification, that's terrifying. Mostly because I understood most of that gobbledygook explanation. Pass the mind bleach.