AWS outage exposes Achilles heel: central control plane
- Reference: 1760976068
- News link: https://www.theregister.co.uk/2025/10/20/aws_outage_chaos/
- Source link:
The problems began just after midnight US Pacific Time today when Amazon Web Services (AWS) noticed increased error rates and latencies for multiple services running within its home US-EAST-1 region.
Within a couple of hours, Amazon's techies had identified DNS as a potential root cause of the issue – specifically the resolution of the DynamoDB API endpoint in US-EAST-1 – and were working on a fix.
[1]
However, it was affecting other AWS services, including global services and or features that rely on endpoints operating from AWS' original region, such as IAM (Identity and Access Management) updates and DynamoDB global tables.
[2]
[3]
While Amazon worked to fully resolve the problem, the issue was already causing widespread chaos to websites and online services beyond the Northern Virginia locale of US-EAST-1, and even outside of America's borders.
As The Register [4]reported earlier , Amazon.com itself was down for a time, while the company's Alexa smart speakers and Ring doorbells stopped working. But the effects were also felt by messaging apps such as Signal and WhatsApp, while in the UK, Lloyds Bank and even government services such as tax agency HMRC were impacted.
[5]
According to a [6]BBC report , outage monitor Downdetector indicated there had been more than 6.5 million reports globally, with upwards of 1,000 companies affected.
How could this happen? Amazon has a global footprint, and its infrastructure is split into regions, physical locations with a cluster of datacenters. Each region consists of a minimum of three isolated and physically separate availability zones (AZ), each with independent power and connected via redundant, ultra-low-latency networks.
Customers are encouraged to design their applications and services to run in multiple AZs to avoid being taken down by a failure in one of them.
[7]
Sadly, it seems that the entire edifice has an Achilles heel that can cause problems regardless of how much redundancy you design into your cloud-based operations, at least according to the experts we asked.
"The issue with AWS is that US East is the home of the common control plane for all of AWS locations except the federal government and European Sovereign Cloud. There was an issue some years ago when the problem was related to management of S3 policies that was felt globally," Omdia Chief Analyst Roy Illsley told us.
He explained that US-EAST-1 can cause global issues because many users and services default to using it since it was the first AWS region, even if they are in a different part of the world.
Certain "global" AWS services or features are run from US-EAST-1 and are dependent on its endpoints, and this includes DynamoDB Global Tables and the Amazon CloudFront content delivery network (CDN), Illsley added.
Sid Nag, president and chief research officer for Tekonyx, agreed.
"Although the impacted region is in the AWS US East region, many global services (including those used in Europe) depend on infrastructure or control-plane / cross-region features located in US-EAST-1. This means that even if the European region was unaffected in terms of its own availability zones, dependencies could still cause knock-on impact," he said.
"Some AWS features (for example global account-management, IAM, some control APIs, or even replication endpoints) are served from US-EAST-1, even if you're running workloads in Europe. If those services go down or become very slow, even European workloads may be impacted," he added.
Any organization whose resiliency plans extend to duplicating resources across two or more different cloud platforms will no doubt be feeling smug right now, but that level of redundancy costs money, and don't the cloud providers keep telling us how reliable they are?
The upshot of this is that many firms will likely be taking another look at the assumptions underpinning their cloud strategy.
"Today's massive AWS outage is a visceral reminder of the risks of over-reliance on two dominant cloud providers, an outage most of us will have felt in some way," said Nicky Stewart, Senior Advisor at the Open Cloud Coalition.
Cloud services in the UK are largely dominated by AWS and Microsoft's Azure, with Google Cloud coming a distant third.
"It's too soon to gauge the economic fallout, but for context, last year's global CrowdStrike outage was estimated to have cost the UK economy between £1.7 and £2.3 billion ($2.3 and $3.1 billion). Incidents like this make clear the need for a more open, competitive and interoperable cloud market; one where no single provider can bring so much of our digital world to a standstill," she added.
"The AWS outage is yet another reminder of the weakness of centralised systems. When a key component of internet infrastructure depends on a single US cloud provider, a single fault can bring global services to their knees - from banks to social media, and of course the likes of Signal, Slack and Zoom," said Amandine Le Pape, Co-Founder of Element, which provides sovereign and resilient communications for governments.
[8]AWS wiped my account of 10 years, says open source dev
[9]Amazon will refund $1.5B to 35M customers allegedly duped into paying for Prime
[10]Under-qualified sysadmin crashed Amazon.com for 3 hours with a typo
[11]Bezos plan for solar powered datacenters is out of this world… literally
But there could also be compensation claims in the offing, especially where financial transactions may have failed or missed deadlines because of the incident.
"An outage such as this can certainly open the provider and its users to risk of loss, especially businesses that rely on its infrastructure to operate critical services," said Henna Elahi, Senior Associate at Grosvenor Law.
Elahi added that it would, of course, depend on factors, such as the terms of service and any service level agreements between the business and AWS, the specific causes of the outage and its severity and length.
"The impacts on Lloyds Bank, for example, could have very serious implications for the end user. Key payments and transfers that are being made may fail and this could lead to far reaching issues for a user such as causing breaches of contracts, failure to complete purchases and failure to provide security information. This may very well lead to customer complaints and attempts to recover any loss caused by the outage from the business," she said.
At 15.13 UTC today, AWS updated its Health Dashboard:
"We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
Thirty minutes later, it added:
"We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches." ®
Get our [12]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aPaw-EG3RMa8zvD2YJooFgAAAAg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPaw-EG3RMa8zvD2YJooFgAAAAg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aPaw-EG3RMa8zvD2YJooFgAAAAg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[4] https://www.theregister.com/2025/10/20/amazon_aws_outage/
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aPaw-EG3RMa8zvD2YJooFgAAAAg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] http://www.bbc.co.uk/news/live/c5y8k7k6v1rt
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aPaw-EG3RMa8zvD2YJooFgAAAAg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2025/08/06/aws_wipes_ten_years/
[9] https://www.theregister.com/2025/09/25/amazon_will_refund_15b_to/
[10] https://www.theregister.com/2025/07/21/who_me/
[11] https://www.theregister.com/2025/10/03/bezos_space_dcs/
[12] https://whitepapers.theregister.com/
Re: It's DNS. No Surprise.
Apart from when it's BGP. But it's normally DNS.
Re: It's DNS. No Surprise.
unless it's dave from accounts clicking the random link
The fragility of the system
Modern computer systems are so fragile that a single line of code can bring down the house of cards.
It's a sad indictment of the state of our computer systems (and industry) that the whole lot can go mammaries skywards so easily.
It's long past time that companies started to get more active on making more resiliant software.
Re: The fragility of the system
In fairness, that's always been the case. A wayward line of code in an OS kernel could certainly bring your system to a screeching halt, no matter how resilient the rest of it is. The main things which have changed are that so many distributed systems are dependent on external resources and that, in the case of AWS users, those systems are apparently dependent on a single point of failure. Resolving these problems is a trivial task and thus left as an exercise for the reader.
Re: The fragility of the system
One could argue that it's actually a sign of the complexity of those systems.
After all, something simple like rupturing a fuel line in a car can cause all sorts of problems. Is this any different?
I'm not excusing poor design and deployment, and heaven knows we see far too much of that especially now we have "professional managers" running the patch, but engineering of all kinds is all about finding the balance between cost and the acceptable likelihood of failure. Let's hope that the relevant parties here will reassess that based on today's incident.
From the Wikipedia article on DNS: This mechanism provides distributed and fault-tolerant service and was designed to avoid a single large central database. (my emphasis)
Somebody forgot the "distributed" part and that fault tolerance depends on it.
It sounds like the DNS service was robust. It was robustly pointing things at the wrong place...
This isn't com[lexity.
It's penny pinching.
Somewhere I can guarantee there is a component that is a single point of failure that wasn't protected as it would have cost too much.
By definition, if you have redundancy, you have inefficiency. By all means choose efficiency over resilience, but for the love of god, own it when things go wrong.
Re: This isn't com[lexity.
There is also lack of care. Due to vendor lock in, what those businesses, government departments are going to do?
Re: This isn't com[lexity.
Still tickled by an old posting...
[1]Abend's Observation: "Many cloud systems are actually just distributed single points of failure"
And here the distributed single point of failure that all were referencing was AWS?
[1] https://forums.theregister.com/forum/all/2024/01/27/teams_outage_again/#c_4800389
Re: This isn't com[lexity.
wasn't protected as it would have cost too much
Nope, not a company like Amazon. It is because:
1) no one realized it was and always has been a single point
2) originally it was protected, but other changes caused it to become a single point without anyone realizing
3) they know it is a single point and there is a huge project underway to address that but it isn't complete yet
4) they know it is a single point but they can't address that without basically throwing out their entire design and starting over from scratch
HMRC
and even government services such as tax agency HMRC were impacted.
Surely HMRC shouldn't have anything stored in US-EAST-1 no? and if they had I am sure government will promptly launch an investigation?
(setting aside the whole omnishambles of using AWS at all - due to Cloud Act, tax payer data are not safe)
Re: HMRC
They may well have no /data/ stored there - it should all be in the UK region (eu-west-2). But as the article explains, it turns out that the entire global AWS cloud still has critical dependencies on its 'mothership' region, the original hub in N Virginia.
Re: HMRC
As far as I know, the UK isn’t a dependency of the US.
This isn’t just an inconvenience - it’s a sovereignty issue. It’s absurd that critical UK government systems can go down because something broke in Virginia. Whether or not data is physically stored there is irrelevant: AWS is bound by the US Cloud Act, and no amount of “UK region” branding changes that.
Re: HMRC
To clarify - governments are not exempt from US Cloud Act, so placing data in eu-west-2 is just coping mechanism. It has nothing to do with ensuring safety of data about us. This whole procurement should have been audited.
Re: HMRC
government should not be using any US hyperscaler. its laziness from crown commercial & matybe m envelopes.
UK government infrastructure & data should be on UK owned infrastructure
Re: HMRC
> it’s a sovereignty issue
We've exercised our sovereignty and chosen to store our data using a US provider.
Maybe ask your MP why they're not using https://crownhostingdc.co.uk/
Re: HMRC
>> the entire global AWS cloud still has critical dependencies on its 'mothership' region
> As far as I know, the UK isn’t a dependency of the US
Grabbing the wrong end of the stick there
DIY or rely.
Terms and conditions. Vol. 4, p.632, Item 23:
If AWS staff are trying their best, no compensation is payable.
The way we used to do tech, on prem, was much more reliable. Even if someone nuked the East coast of America, your server room would keep whirring, and your stuff would just work.
All the stuff GAFA flog us to ensure themselves a healthy income: SaaS, Cloud storage and AI, make our tech inherently less resilient and force us to pay regular fees to GAFA to exist.
You'd like to think people would change after this. I'm sure 'lessons will be learned' but nobody will actually do anything differently. They will choose the lazy option and just keep paying the subs.
If you want to depend on a third party for your music, pay for streaming. If not, buy CDs. Ditto for enterprise level tech. DIY or rely. Personally, I still buy CDs.
Re: DIY or rely.
The way we used to do tech,
Ever been to an AWS sales meeting with a client?
The old-school on-prem engineers just don’t have the same woo and bullshitting skills.
Nice when you can directly blame.
We had Windows Virtual Desktops going "poof" after 20 mins.
Was it our Cisco VPN using AWS for logging?
Was it VMWare phoning home and having no cloud db?
Was it some other logging timing out?
What are we going to do? Run TCPdump on our pipe, log all the IPs and work out of they are in AWS ? (well, we will now).
If our intranet was down all morning we'd be in the boardroom explaining.
Happily we can now just say "AWS was down" and everybody shrugs.
Re: Nice when you can directly blame.
Happily we can now just say "AWS was down" and everybody shrugs.
I used to say: Happily we can now just say "It's a MS product" and everybody shrugs.
Amazon's US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions.
Yes, you're not wrong. So much for how seriously big business takes GDPR.
GDPR was created for big business to legitimise personal data trade under guise of protecting privacy (classic gaslighting).
Most users don't read what they consent to (courtesy of Cookie Law that trained them to click agree to anything).
GDPR != Cookie law.
You missed my point. The Cookie Law wasn’t about privacy - it was behavioural conditioning. It trained people to click “Agree” without reading anything.
GDPR was the next step: once users were preconditioned, it gave corporations a legal framework to collect and trade data with subject consent. Before GDPR, that kind of large-scale data trading was legally murky.
Well it trained me to click disagree without reading anything.
Ho-Hum here we go again ... again ... again ... (Error Recursion level too high !!!)
Multiple levels of 'Someone else’s computer/system/service' relying on each other !!!
Each level is configured to be 'Good enough' as being 'better than good enough' costs more and needs more 'expensive' labour.
Everyone ignores the obvious risks because it 'probably won't happen' ... when it does it is 'someone else’s fault' !!!
Rinse, repeat and 'make bank' as the Americans say, supposedly !!!
P.S.
Don't ever learn any lessons ... because they cost money too.
Also the original C-level architect(s) of these disasters, who needs to learn, are usually long gone setting in place the next disaster to come ('AI' perhaps !!!???)
:)
So what was the problem?
Yes, it was DNS. But that could mean anything. Did a server fail - in which case their redundancy is hopeless - or did someone just put the wrong data in - in which case they need to improve their procedures - or what?
So the migthy AWS still has a single point of failure..
... in its original implementation because nobody bothered to replicate it across other regions - maybe because having authentication management close to Langley is a bonus? Ot it's jyst the proverbial sysadmin laziness? "It works [most of the time], don't touch it...."
This points to the fundamental issue with cloud computing and "centralization"
Cloud computing only increases your security attack surface. This of a wall of brains ... it only takes a cloud vendor's Igor the mess things up. You are still exposed to your error and have added the vendors (even thou they are generally very good. Another risk, though not in play today, is vendor lock in, don't use proprietary tools unless you have to and if you must ... do not use the cloud vendors or you have made them a monopoly.
Centralization is another issue because AWS is so large that the likelihood that sites that you depend on use AWS is high. So if AWS gets sick, it affects a lot of sites and services at once. Detecting the true cause may be hard to deduce.
Distributed single point of failure
AWS's control plane being centralization in US-EAST-1 creates a dependency. When US-EAST-1 experiences issues it can cause global impacts because orchestration and control APIs become unavailable or degraded.
TBTF
AWS is now critical infrastructure for humanity. Too big to fail. And, oops, it failed. It only takes one small corner of AWS to create this much chaos, worldwide? We kid ourselves that we have conquered the resilience problem...
When I was a lad...
Or even an old geezer, I wrote code that relied on remote databases and in said environment you would have alternate service "pipes" available.
So when a client process couldn't get a response to a request in a timely manner, it would try again and if that failed would try one of the alternate service "pipes".
For mission critical stuff, you had an active-active topology which meant that some alternate service "pipes" would route to the primary database and some to the secondary database.
And we are talking active-active systems with very large databases and 10,000's of active users.
Under pining this was also comms connections that were physically seperate (i.e. the digger driver can only take out one link) and also used different comms vendors with radically different routing (in the physical sense).
Of course, the above is an extreme (but necessary) topology for that app that is the life (and $$$$) blood of your company.
My reading of this AWS issue is some relatively noddy (in terms of scale) but global dynamodb database couldn't be reached and if said database is so important then why aren't their multiple, synced copies in multiple regions/countries with multiple pathways to these copies and the software that access it being able to detect a failed request and "self heal" by trying other pathways.
Kids today and all that but I sit here shaking my head that shit I and countless others did years ago with critical apps/databases has been lost to the current generation.
Bluck
AWS
Your one stop shop for decentralisation.
It's DNS. No Surprise.
It's always DNS.