A Single Point of Failure Triggered the Amazon Outage Affecting Million (arstechnica.com)
- Reference: 0179865214
- News link: https://slashdot.org/story/25/10/24/2212255/a-single-point-of-failure-triggered-the-amazon-outage-affecting-million
- Source link: https://arstechnica.com/gadgets/2025/10/a-single-point-of-failure-triggered-the-amazon-outage-affecting-millions/
> The outage that [1]hit Amazon Web Services and took out vital services worldwide was the [2]result of a single failure that cascaded from system to system within Amazon's sprawling network, according to a post-mortem from company engineers. [...] Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers' control. The result can be unexpected behavior and potentially harmful failures.
>
> In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it "experienced unusually high delays needing to retry its update on several of the DNS endpoints." While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. [...] The failure caused systems that relied on the DynamoDB in Amazon's US-East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected.
>
> The damage resulting from the DynamoDB failure then put a strain on Amazon's EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." The engineers went on to say: "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation." In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center.
Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer.
Further reading: [3]Amazon's AWS Shows Signs of Weakness as Competitors Charge Ahead
[1] https://slashdot.org/story/25/10/21/1942240/amazons-dns-problem-knocked-out-half-the-web-likely-costing-billions
[2] https://arstechnica.com/gadgets/2025/10/a-single-point-of-failure-triggered-the-amazon-outage-affecting-millions/
[3] https://slashdot.org/story/25/10/24/1830258/amazons-aws-shows-signs-of-weakness-as-competitors-charge-ahead
xkcd (Score:1)
[1]Here [xkcd.com] is the xkcd you were thinking of.
[1] https://xkcd.com/2347/
Race condition at a single point of failure... (Score:2)
...sounds like the Exercise #1 in the first tutorial of my Operating Systems class in 1982.
I think I still have the notes in a 3-ring if Amazon needs them.
Re: (Score:2)
It's great that you can remember every example you've read in your life in a way that precludes you from ever repeating that mistake elsewhere. Unfortunately the rest of us are human beings.
Complicated Shit (Score:1)
Reading through the summary, my thought is that the kind of stuff Amazon (and others) do is some really complicated shit. I can't begin to imagine how the people involved learn, design, operate and manage all that shit. I mean....damn! No wonder they make the big bucks.
Re: (Score:2)
It's not that complicated. When you run a system like this, at every single point, you ask yourself, "What if this goes wrong? What should will happen? How can I make it fail safe?"
Re: (Score:1)
I bet that you couldn't design, build, deploy and manage it.
If builders built buildings like programmers (Score:2)
If builders built buildings like programmers write programs, the first woodpecker to come along would destroy civilization.
-- Gerald Weinberg
Despite the title (Score:4, Insightful)
This was *not* a single point of failure. This was a failure *cascade*. Once the first failure was remedied, the other downstream failures were still far from solved.
Those downstream failures were separate failures all their own. Sure, their failure was triggered by the original DNS issue, but if those downstream services had been written in a more resilient way, the DNS issue wouldn't have resulted in long, drawn-out failures of those systems (like ECS for example).
Re: (Score:2)
Yes, it was bad engineering every step of the way.
Not a classic race condition (Score:2)
The classic race condition occurs within a single machine within a single process running multiple threads accessing the same data usually stored in memory. The DNS Enactor and DNS Planner for DynamoDB are likely part of a distributed system, which means they can be deployed on separate machines to handle different tasks efficiently. And the article describes each as having their own cache of the same data. This just sounds like a really bad design with poor timing control more than anything whereby the c
Re: (Score:2)
And these types of people run the world.
Re: Not a classic race condition (Score:2)
And people blame DNS, but it nothing to do with DNS, as you say itâ(TM)s their mechanism that is broken. Iâ(TM)m a fan of text based zone files, but I guess it wouldnâ(TM)t scale fast enough for their need. Nevertheless they need to redesign it.
AWS customer that stayed up (Score:2)
Wow, I'm on AWS and had zero downtime, apparently because I'm on a plain-vanilla Lightsail VPS. It's a tiny prod installation, but apparently too small to fail.
Maybe something positive to be said about "old ways" if you don't have to have instant scale...
So incompetence (Score:2)
Not that this is a surprise. Was "AI" involved in the coding?
It is always DNS (Score:4, Insightful)
Subject says it all
Re: (Score:2)
Denial: "It can't be DNS."
Anger: "Why doesn't DNS work?!"
Bargaining: "Maybe if I reload the network interface things will start working again."
Depression: "Where do I even begin troubleshooting? I can't believe this is happening to me."
Acceptance: "It was a faulty DNS record."
Re: (Score:2)
It's the lupus of the tech world.