News: 1744698788

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Google Cloud’s so-called uninterruptible power supplies caused a six-hour interruption

(2025/04/15)


Google has revealed that a recent six-hour outage at one of its cloudy regions was caused by uninterruptible power supplies not doing their job.

The outage commenced on March 29th and caused “degraded service or unavailability” for over 20 Google Cloud services in the us-east5-c zone. Google’s US east zone is centered on Columbus, Ohio.

Google’s [1]incident report states that the outage started with “loss of utility power in the affected zone.”

[2]

Hyperscalers build to survive that sort of thing with uninterruptible power supplies (UPSes) that are supposed to immediately provide power if the grid goes dead, and keep doing so for a few hours before diesel-powered generators kick in.

[3]

[4]

Google’s UPSes, however, suffered a “critical battery failure” and didn’t provide any juice. They also appear to have prevented power from generators reaching Google’s racks, because the incident report states the advertising giant’s engineers had to bypass the UPSes before power became available.

Engineers were alerted to the incident at 12:54 Pacific Time and their efforts saw generators come online at 14:49.

[5]

“The majority of Google Cloud services recovered shortly thereafter,” the incident report states, although “A few services experienced longer restoration times as manual actions were required in some cases to complete full recovery.”

[6]Datacenters near Heathrow seemingly stay up as substation fire closes airport

[7]'Once in a lifetime' IT outage at city council hit datacenter, but no files lost

[8]Oracle outage hits US Federal health records systems

[9]Microsoft blames Outlook's wobbly weekend on 'problematic code change'

Google is terribly sorry this happened and “committed to preventing a repeat of this issue in the future.” To avoid similar messes in future, the web giant has promised to do the following:

Harden cluster power failure and recovery path to achieve a predictable and faster time-to-serving after power is restored.

Audit systems that did not automatically failover and close any gaps that prevented this function.

Work with our uninterruptible power supply (UPS) vendor to understand and remediate issues in the battery backup system.

Oh, to be a fly on the wall when Google meets with that UPS vendor.

Hyperscalers promise resilience and mostly succeed, but even their plans can sometimes go awry. The lesson for the rest of us is that regular testing of all disaster recovery infrastructure and procedures - including what to do when public clouds have outages – is not optional or something that can be put off. ®

Get our [10]Tech Resources



[1] https://status.cloud.google.com/incidents/N3Dw7nbJ7rk7qwrtwh7X

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z_4uKRBEf4flnwbBBui7FgAAAtA&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z_4uKRBEf4flnwbBBui7FgAAAtA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z_4uKRBEf4flnwbBBui7FgAAAtA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offprem/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z_4uKRBEf4flnwbBBui7FgAAAtA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2025/03/21/heathrow_closure_datacenter_resilience/

[7] https://www.theregister.com/2025/03/19/nottingham_outage_sitrep/

[8] https://www.theregister.com/2025/03/07/oracle_outage_federal_health_records/

[9] https://www.theregister.com/2025/03/03/microsoft_outlook_outage/

[10] https://whitepapers.theregister.com/



You had just one job to do...

Paul Crawford

...and that is keeping power on. I wonder who the UPS vendor is?

From bitter experiences we had some Dell-branded APC 5kVA UPS and they were useless, often failing when tested and failed-hard, internal bypass also broken. Out of 5 in total they were all dead within 2 years of operation. APC now owned by Schneider, so I wonder if that crap has appeared elsewhere?

Tubz

APC has gone downhill badly with Schneider, currently using Eaton and actually had to tweak reporting as it was telling me too much information, that was not truly BAU requirement.

cyberdemon

This.

I bought a little Schneider UPS from scamazon, but so far it has had a lower reliability factor than my house's electricity supply (quite an achievement, given my over-sensitive RCD and its propensity to trip if I scorch a naan bread in the toaster).. It is is currently bypassed because it threw a wobbly for no apparent reason yesterday (continuous tone, light flashing, no way to shut it up without shutting everything down - it was only on about 1/3 load).

Curious as to what counts as "Too much information" from your Eaton UPS though? Was it talking about its piles?

Will Godfrey

Were they swollen?

Unbelievable

Will Godfrey

A battery failure should never take anything else down!

I can remember seeing a bit of military kit in the 1960s (government surplus place in Reading) that solved the problem quite simply. There were two identical lead/acid batteries with series fuse and diode, so whichever had the highest voltage delivered the current (although I suspect the battery internal resistances meant they both delivered some current). Recharging/trickle balance was done with separate fuse and current limiting resistor combos. These days I would have thought a UPS for something as important as a major data centre would have done very considerably better.

P.S. There was a guy who worked there who was very informative and helpful to us youngsters. Can't remember his name after all these years :(

Re: Unbelievable

Natalie Gritpants Jr

All well and good till you discover the diodes are backwards

Re: Unbelievable

Will Godfrey

Shirley, nobody would do that

Timop

The UPS was literally uninterrupted during power outage.

Procurement error

IanRS

They bought the Unavailable Power Supply instead.

Lots of folks confuse bad management with destiny.
-- Frank Hubbard