To avoid disaster-recovery disasters, learn from Reg readers' experiences
- Reference: 1743535993
- News link: https://www.theregister.co.uk/2025/04/01/on_call_dr_lessons/
- Source link:
You can find answers in the pages of The Register , specifically our reader-contributed tales of tech support triumph and terror: [1]On-Call and [2]Who, Me?
I was told to make backups, not test them. Why does that make you look so worried? [3]READ MORE
We've carefully reviewed both columns, by hand, plus perused unused submissions in both columns’ inboxes, and distilled them into the following causes of disaster recovery fails:
Whatever backup equipment and media you use has been [4]ignored for years and is now [5]incapable of reading and/or writing data ; you therefore lack effective and/or recent backups to restore from. You will discover this after a disaster.
Your users have stored data in places you don’t protect, including [6]tempfile directories and [7]the Windows Recycle Bin . When you cannot restore their data, you will be blamed for their errors.
You cannot restore backups on-site if you [8]can’t access your office . At this point you’re thinking cloud backup can save you …
… but you are wrong because some users [9]think cloud backup is a magic protection halo and will do things to nobble it.
Plans to [10]migrate and/or update systems pay too little attention to data migration, and you may find yourself needing to recover data at the same time you are rebuilding hardware.
Colleagues who think they have mastered bulk data erase commands such as rm -rf have not mastered them, and will [11]delete the wrong directory – or an [12]entire drive at the worst possible moment.
Your network [13]will choke when you need to perform an emergency restore.
The fix for all of the above is developing and observing proper backup and restoration processes, spending whatever it takes on infrastructure, on site and off site, that can securely store data and restore it at speed, then testing everything often and rigorously.
Of course, you knew that already. So do tech giants like Google and Cloudflare – both of which recently [14]lost [15]customer data .
[16]Bizarre backup taught techie to dumb things down for the boss
[17]Sysadmin flees asbestos scare with disk drive, blank pay cheques, angry builders in pursuit
[18]I don't have to save my work, it's in The Cloud. But Microsoft really must fix this files issue
[19]Undergrad thought he had mastered Unix in weeks. Then he discovered rm -rf
[20]The sad tale of the Alpha massacre
Even backup software vendor [21]Veeam recently lost some of its own data.
Those incidents tell us disaster recovery is hard. So hard, in fact, that even rocket scientists can't always get it right: NASA [22]appears to still be restoring data from tape after its [23]November 2024 server room flood destroyed several servers. ®
Get our [24]Tech Resources
[1] https://www.theregister.com/Tag/On%20Call/
[2] https://www.theregister.com/Tag/Who%2C%20Me%3F/
[3] https://www.theregister.com/2025/02/07/on_call/
[4] https://www.theregister.com/2020/01/31/on_call/
[5] https://www.theregister.com/2025/02/07/on_call/
[6] https://www.theregister.com/2023/09/01/on_call/
[7] https://www.theregister.com/2023/07/14/on_call/
[8] https://www.theregister.com/2016/10/21/on_call/
[9] https://www.theregister.com/2019/07/12/on_call/
[10] https://www.theregister.com/2024/05/24/on_call/
[11] https://www.theregister.com/2024/11/18/who_me/
[12] https://www.theregister.com/2024/11/11/who_me/
[13] https://www.theregister.com/2024/12/13/on_call/
[14] https://www.theregister.com/2025/03/24/google_maps_timeline_data_loss/
[15] https://www.theregister.com/2024/11/27/cloudflare_logs_data_loss_incident/
[16] https://www.theregister.com/2023/07/14/on_call/
[17] https://www.theregister.com/2016/10/21/on_call/
[18] https://www.theregister.com/2019/07/12/on_call/
[19] https://www.theregister.com/2024/11/18/who_me/
[20] https://www.theregister.com/2024/11/11/who_me/
[21] https://www.theregister.com/2025/02/17/veeam_forums_data_loss/
[22] https://solarweb1.stanford.edu/JSOC_Emergency_Resources.html
[23] https://www.theregister.com/2025/02/07/nasa_solar_mission_data_recovering/
[24] https://whitepapers.theregister.com/
And theres
the oft hilarious tale of the guy who made a disaster recovery partion on his single hard drive to save his valuable data in case of failure... yeah we know where this one is going.... failing to account for his HDD failing....
"Hello Boris... can you have a look at my computer......"
Wheres the "head banging into a wall" icon ?
Re: And theres
Yup. Home user wanted help to get his laptop working. I get there and the laptop says cannot find boot drive, and it is making clicking noises (aka, click of death).
Sorry Jim, it's dead. Unless you want to spend $400+ to send the drive off for data recovery.
One copy is no copy, two copies are half
tl;dr: Data is never safe. Don't make fun of those poor sysadmins.
I once was explained the design of a computer archive storing irreplaceable audio and video recordings. The last recordings of death languages and vanished people.
The primary system was a computer always running. Bits not on a life system rot away. Next to it, in the same room,a copy of that computer. Both mirroring each other.
There were two other such running twin systems in another country far away from each other. These other systems were life copied, mirrored.
Parts of the archive were also stored on other continents. And the archivists were worrying about the long term readability of audio and video formats.
The archivists had secured funding for 50 years for preserving the bit streams (just the bits). And I was tasting some anxiety in the room about the future integrity of the data.
Bits die when no one looks at them anymore.
Archiving data, backing up, is hard, very hard.
Re: One copy is no copy, two copies are half
I thought 3 backups tapes were good enough (long time ago).
Then my Amiga PC erred during a backup. So now I had two good backups, maybe. Luckily it was a file in the disk backup routine I wrote that was cross linked (or something) during the previous backup. So once I tracked that down and changed my code, all was good.
That is good after I bought several more tapes.
"You cannot restore backups on-site if you can’t access your office."
If you can't access your office not being able to restore them may well be the least of your problems. Recovering your organisation from a fire can be interesting.
Bombs too
Always surprised me how bt would take an absolute age to do anything. But with the houndsditch bomb (dump truck, not st Mary axe or Liverpool st that to out just glass) weren't, they could move all our lines from our office around the corner to Cobham in Surrey within 2hrs
'reviewed ... by hand'?
We've carefully reviewed both columns, by hand, plus perused unused submissions in both columns’ inboxes
Seriously, haven't heard about AI? Just feed the articles into ChatGPT and ask for a summary.*
But seriously, I was once consulting to a Uk Government Agency whose HQ was located in Bristol. Their plan for their essential system was that in the event of a disaster in Bristol, they would use their backup site in London, 2 hours later. I pointed out that unless they had a Harrier jump jet waiting on the roof** for each member of staff they needed to transfer from Bristol to London, they would never make the trip in 2 hours, let alone get there in time to load up the DR system and get it running in that time.
Moral of the story - it is not just the IT that counts.
*(Then you'll be really f**ked.)
** OK, I might not have said the bit about the Harrier out loud .