News: 1761552914

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

The perfect AWS storm has blown over, but the climate is only getting worse

(2025/10/27)


Opinion When your cabbie asks you what you do for a living, and you answer "tech journalist," you never get asked about cloud infrastructure in return. Bitcoin, mobile phones, AI, yes. Until last week: "What's this AWS thing, then?" You already knew a lot of people were having a very bad day in Bezosville, but if the news had reached an Edinburgh black cab driver, new adjectives were needed.

As the world reluctantly touched grass, the [1]AWS outage of October 20 made the top of the mainstream news. It beautifully illustrated the success of the cloud concept as it took out banking services, gaming platforms, messaging apps, and [2]cat litter trays . Things got better after a few hours, and the nature of the collapse gradually revealed itself. A DNS failure led to a core database dropping off, leading to a [3]control plane malfunction that [4]broke load balancing .

Amazon brain drain finally sent AWS down the spout [5]READ MORE

Why this cascade was both possible and unexpected, and why it took so long to find and fix, is even more interesting. Here's [6]a clue : this kind of event had been predicted by an ex-Amazonian based on their perception that key engineering talent had been fleeing the company for years, removing irreplaceable wisdom built from knowledge and experience. Such a prediction, backed by the observation that AWS techs had to grope their way to the big picture, is compelling.

No similarly compelling answer exists for the final and best question of all: how do you stop this happening again? Building safeguards for this specific chain of events, even the class of such events, is obvious, as is re-engineering the chain of dependencies and contagion.

None of this answers the basic criticism that AWS itself is too complex to analyze for such systems of failure, at least with the resources and tools it has on hand right now. Exactly the same can be said of the systemic cybersecurity failings that power the [7]rolling thunder of ransomware and other eviscerating attacks.

[8]

Infrastructure expands to an event horizon where utility can no longer escape the gravitational pull of complexity. It's much cheaper to add more and more functionality than it is to add more and more stability. Eventually things will break. Most things will break in a small way, filling a sysadmin's day with entertainment and mild hypertension. Now and again, big things will break bigly, and you're in the news.

[9]

[10]

There are so many ways to add resilience to this picture. Edge services can keep cloudy IoT devices going during a central outage. Even better, enough local compute built into the devices will bring resilience even in the case that whatever godforsaken subscription revenue model underlying the initial offering can't keep the parent company alive. The same sort of tiered failover can work for all manner of apps, although getting steadily more expensive the more functionality is maintained.

The same realization is dawning, crudely, in ransomware defense, expressed as making sure your organization can carry on working with pencil and paper. Which it can't, of course, but there will be a minimum level of some independent tech that'll keep an acceptable minimum life support.

[11]

The consistent failure that prevents designing for resiliency is that while it will save billions during a rare but likely event, it eats away at the bottom line day by day, week by week, quarter by quarter. In that way, it's exactly like insurance. The difference is, capitalism and its symbiont governance have long recognized the enabling safety net that insurance provides. It doesn't feel that way about infrastructure resilience, certainly not enough to apply the legal and regulatory pressure to ensure its adoption.

[12]The real insight behind measuring Copilot usage is Microsoft's desperation

[13]AI: The ultimate slacker's dream come true

[14]VMware's in court again. Customer relationships rarely go this wrong

[15]Cloud vendor lock-in is shocking, but there's a get out of jail card

It would be easier if, as in aviation, failure of resilience murdered hundreds of people in eye-catching explosions of fiery death, instead of silencing Snapchat for an afternoon. Lack of resilience certainly kills people, but in an invisible, slow, and ambiguous way as it saps resources from critical systems and their supply chains. In an industrial and political environment where anti-regulatory hollering is the primary discourse, even fiery explosions of death wouldn't make much difference.

Which means that when the correction does come, it's going to be a biggie. If our systems break as often and obviously as they do in today's climate, what would they do if things got stickier and our interconnected financial and commercial systems got given a proper nudge by those who do not have our best interests at heart?

Fortunately, resilience can be improved from the bottom up rather than waiting for top-down to happen. As responsible individuals, within departments, at board level, or as industry groups, what-ifs can be wargamed. You know what a power outage means, and what level of power pack, UPS, or backup generator is worth having. If AWS was to go away for weeks instead of hours, what would it look like? What would redundancy look like? Can you afford an experiment or two? Can you afford not to?

If you've never had this sort of conversation about any or all of your core technologies and services, then you're part of the problem. Taking them seriously is the start of the solution, at any level. The alternative is waking up to your world on fire, realizing that when your cabbie asked you about AWS, you should have smelled the smoke. ®

Get our [16]Tech Resources



[1] https://www.theregister.com/2025/10/20/amazon_aws_outage/

[2] https://www.theregister.com/2025/10/21/aws_outage_aftermath/

[3] https://www.theregister.com/2025/10/20/aws_outage_chaos/

[4] https://www.theregister.com/2025/10/23/amazon_outage_postmortem/

[5] https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/

[6] https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/

[7] https://www.theregister.com/2025/10/22/jaguar_lander_rover_cost/

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aP9Qx9BdhFCnASkDJNKyagAAAVU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aP9Qx9BdhFCnASkDJNKyagAAAVU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aP9Qx9BdhFCnASkDJNKyagAAAVU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/paasiaas&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aP9Qx9BdhFCnASkDJNKyagAAAVU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[12] https://www.theregister.com/2025/10/20/copilot_viva_insights_column/

[13] https://www.theregister.com/2025/10/06/at_last_microsoft_leads_the/

[14] https://www.theregister.com/2025/09/08/vmware_in_court_opinion/

[15] https://www.theregister.com/2024/04/08/cloud_vendor_opinion_column/

[16] https://whitepapers.theregister.com/



Future on a wing and a prayer

Acrimonius

Systems have grown far too complex for anyone with a hand on heart to say the system design, the coding and the testing can be relied upon. More reslience and redundancy just adds more complexity that fewer and fewer can grasp or unravel. Buggy software added on top of buggy software. Complexity unfair on coders as they are only human. Will be like headless chickens soon. Cause and effect too deep to analyse. Root cause never found. Fixes become temporary and actually add even more fragility.

Re: Future on a wing and a prayer

cookiecutter

systems are complex because each project is done in isolation.

"we want to do X" - OK get in Accenshite or tati consultancy services & pay them a shit load of money to do it in isolation - then they leave

"we want to do Y" - OK get krappyMG or infoshite & pay them a shit load of money to do it in isolation - then they leave

"we want to do Z" - OK get craptia or shitpro in and pay them a shit load of money to do it in isolation - they they leave

At no point is anyone looking at the big picture. At no point is anyone doing the 20-30 year career overview across the environment and at no point does anyone come down and plan this stuff.

Someone comes in, they don't sit there for 6 months (which is what they SHOULD do), see what is already there, see how it fits together, see where it makes sense to keep things the same and where it makes sense to change things.

At no point does anyone ask "What is it that you ACTUALLY WANT TO DO?" - it's shovels in the ground and lets go so we can justify our high costs & as they actually work it out, charge them more.

There's a reason 99.5% of large projects across the entire world, across every country, across every industry over the lsat 200 years has failed on cost, deadlines or actual benefit of the program.

Re: Future on a wing and a prayer

The Organ Grinder's Monkey

A wise (and now long retired) man once told me that the space shuttle's computer was the last significant system ever built that had had every possible combination of input & output tested. Increasing complexity made that effectively impossible from then onwards.

it's too late baby, now, it's too late

Bluck Mutter

Launch dates for the big three:

aws - 2006

azure - 2010

google cloud - 2008

Thus these are mature businesses with associated legacy debit.

On top of that they have all moved fast to provision ever new services thus amplifying legacy debit.

And on top of that, they regularly undergo large purges of people, with little logic as to what knowledge the fired employees might have so important IP and institutional knowledge goes out the door.

And on top of that on top of, onperm tech staff are passionate about the systems they design, deploy and maintain, they have a sense of pride cause they have skin in the game... cloud staff are in many cases an off shore commodity that don't have the mandate nor time to care.

And finally, due to time to market pressures, the old waterfall project delivery that I grew up with is dead, it's all [FR]Agile... don't need to worry about all the nitty gritty ... just get something out the door (a MVP) and we can patch it later but they never do.

Summary: if they didn't account for every potential gotcha upfront as you might in a waterfall deployment then there are close to two decades worth of unseen "stuff" that can trip them up.

Bluck

PS. Do these cloud providers ever do DR tests....probably not cause it's all too big so the probability of it turning to shit is huge so why risk it.

When a butterfly flaps its wings ...

Dr Who

The internet as a thing could be compared to the weather, or the climate. Chaos reigns and there are tipping points everywhere. And to those who insist on saying "that's the cloud for you - on prem only for me", you may as well say the same of the electricity grid, or the road network. Whether we like it or, it is woven into our lives in a myriad of ways.

it's not the internet....

Bluck Mutter

The internet is the plumbing between the end points (AWS, your PC etc).

Even when it's DNS, it's their internal DNS not the internet's DNS.

Bluck

A resilient edge

Caver_Dave

Back last century I built a system (H/W and S/W) to conduct the front end transactions (transparently multi-protocol) with hundreds of Pharmacist ordering drugs from a wholesaler (some ordering many times a day). '486SX with 50 modems talking to a mainframe (it was an early '486SX PC and so that dates it for you). The mainframe was so unreliable that front end had to transparently buffer all of the transactions until the mainframe appeared again and was updated with the orders, on a very regular basis. And because the IT Manager of the wholesaler had recently heard about TSR's, the contract insisted that the front end ran as a TSR! It ran without interruption 24/7 for nearly 10 years until decommissioned. A resilient edge has been a thing for a long time.

been there, done that.

Bluck Mutter

The old "store and forward" messaging system.

Send the message, don't remove it from the queue it until you know it reached the end point ok and then archive it just in case you need to roll forward.

Bluck

Anonymous Coward

It's always going to be a problem, because bean-counters are at the helm. C-Suite are beholden to the share holders and their next big bonus payout. Getting rid of technical (aka expensive) expertise is always the go-to to make whatever insane profit margins that are being demanded. As the technical knowledge pool diminishes, quick hacks to get services back up and running become BAU, and lack of time/investment just ends up leaving those hacks in place (documented or otherwise), just waiting for a time to rear their ugly heads.

Now I'll be the first to admit I am getting old and jaded - but I have seen this play out countless times. I don't see that changing during my career.

Enshittification and lock in

cookiecutter

This was totally predictable and completely avoidable.

The fact that idiots are STILL going on about Cloud First is madness, especially as they put stuff in Oracle Cloud after decades of seeing how Oracle treat their customers.

I was able to get off Facebook easily, something a LOT of people can't seem to do because of the perceived societal lock in.

However, the hyperscalers are using exactly the same methods of enshitifcation to get idiots into their clouds and now they're locked in. The figure that I heard was for ever $1000 you spend going into the cloud will cost you $10,000 getting out.

Stage One: Be good to users - Cloud was cheap, easy to get into and you could move your loads in easily.

Stage Two: Good to Business customers - Cloud made it easy for vendors to sell it, other vendors to create virtualised versions of their products and sell them to customers in the cloud

Stage Three: A Giant Pile of Shit - Where we are now.

We have seen Amazon become more or less useless and they are still making money. Same with Anything Microsoft.

Azure had the Russians and Chinese bouncing around for 6 months before they noticed, Chinese engineers working on DoD machines, a 10 hour SQL outage across teh whole of South America,

AWS is well on the way to the kind of nightmare Azure is. OCI will ramp up its costs as soon as various governments have loaded their stuff on there. Google regularly deletes peoples data and yet developers are still allowed to put their stuff there!

Genuinely on some pricing, I could have migrated the last project I was on at 1/2 the price if not 1/3 of the price that the project was costing by doing a VMware migration and in 1/3 of the time rather than them sticking it into the cloud, but everyone makes more money for their shareholders sticking it in the cloud & "decision makers" get to look "up to date" being cloud first. The fact that the morons are probably now going on about somehow "Using AI" makes it even more depressing.

Too big to fail

glennsills@gmail.com

The technical problem is daunting but there might be non-technical remediation. If instead of hosting everything on Amazon, Google, and Microsoft cloud platforms, why not encourage lots of smaller cloud platforms with different teams. Currently lawyers of the big three would say that there is plenty of competition, but why not legally define competition as dozens or perhaps hundreds of cloud hosting companies. Each of these new companies would have the same problems as the big three, but on cloud hosting provider having problems would cause a lot few problems.

"The porcupine with the sharpest quills gets stuck on a tree more often."