News: 1738915209

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

I was told to make backups, not test them. Why does that make you look so worried?

(2025/02/07)


On Call Each week at work creates memories many are happy to forget, but some are willing to share with fellow Register readers in On Call, our Friday column that tells your tales of tech support.

This week, meet a reader we'll Regomize as "Lionel" who recounted a story from a moment in his career that saw him serve as "senior developer/L2 help desk/the guy taking care of hardware for a mainframe software development team."

Lionel's role sprawled into that range of responsibilities "because almost everybody else on the team thought that Intel servers, PCs, and laptops would rob you of your soul."

[1]

But the 80-strong team of mainframe devs Lionel worked with couldn't escape the modern age entirely because the server that backed the code they created was an Intel box.

[2]

[3]

"Source code was stored on an SMB share on a large Intel server on which a backup was taken daily onto an 8 mm tape drive," Lionel told On Call. "We even kept weekly and monthly backups, just for good measure."

One day, the chap who maintained the backup box – let's call him "Richard" – decided he'd had enough of sullying himself with such chores after having done the job for around two years.

[4]

Lionel was given the gig and kicked things off by asking Richard when he last conducted a test restore.

The answer was "never" because Richard had been asked to make backups, not test them. But he assured Lionel that the tapes used for backups could all be found in a cabinet next to the server.

Lionel asked how Richard knew backups were working and was shown a backup software log file that included the line "Backup completed successfully."

[5]

Unconvinced that this was proof of successful, recoverable backups, Lionel asked Richard if he had ever verified the backups.

A blank stare ensued.

Lionel therefore tried to restore. The tape drive quickly complained that the tape he inserted was blank.

Then, as now, tape was finicky stuff so Lionel wondered if perhaps the tape drive needed cleaning before it could read the backup. Richard produced a cleaning tape he had inherited from a predecessor and had proudly used every month since … without realizing it was only supposed to be used five times.

And Richard had handled the backups for at least two years.

Lionel ordered fresh cleaning tapes, and watched in horror as a first pass produced a thick brown smear of dirt and left the drive still unable to read tapes. He tried a couple more tapes and eventually spotted one that carried a warning that it could only be used ten times before the vendor could not guarantee readability. Of course, Richard had never replaced the tapes.

After more experimentation, Lionel realized no amount of cleaning would allow the drive to read a tape.

[6]Arrr! Can a sailor's marlinspike fix a busted backplane?

[7]User said he did nothing that explained his dead PC – does a new motherboard count?

[8]Tech support fill-in given no budget, no help, no training, and no empathy for his plight

[9]Devs sent into security panic by 'feature that was helpful … until it wasn't'

By now more than a little worried about the company's data, he bought a new and expensive replacement that thankfully managed to complete a restore.

Of course, Lionel was not thanked for this feat. Instead, his managers asked about the unexpected and substantial spending on the replacement tape drive.

Lionel explained that the company may have gone without backups for a year or two, and to illustrate why he held up one of the tapes he'd tried to coax back to life.

"I held it up against the light. It had been worn so thin that you could see through it."

Which made it impossible for his managers to object to his expenditure.

"This was a profound learning experience that has stayed with me ever since," Lionel told On Call. "Unfortunately, even today, test restores are sadly very uncommon, in my experience. Perhaps because it's called 'Backups' and not 'Restores,' leading people to the mistaken belief that restores are only necessary when everything else fails and their world is ending."

"Fortunately, as long as that belief prevails, skilled and competent technical IT people will never be without a job," he concluded.

What's the longest period of time you've seen bad backups go undetected? And what happened when the time came to restore? Refresh your memory of data recovery romps, then [10]click here to send On Call an email so we can share your story on a future Friday. ®

Get our [11]Tech Resources



[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z6Xn1IV9VxBt4bCF0GpeiAAAAJc&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z6Xn1IV9VxBt4bCF0GpeiAAAAJc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z6Xn1IV9VxBt4bCF0GpeiAAAAJc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z6Xn1IV9VxBt4bCF0GpeiAAAAJc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z6Xn1IV9VxBt4bCF0GpeiAAAAJc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2025/01/31/on_call/

[7] https://www.theregister.com/2025/01/24/on_call/

[8] https://www.theregister.com/2025/01/17/on_call/

[9] https://www.theregister.com/2025/01/10/on_call/

[10] mailto:oncall@theregister.com

[11] https://whitepapers.theregister.com/



Ochib

The only verified backup is one that you have restored from

Joe W

The only backup is one that you have successfully restored from.

The things I have seen... also in my current job. Thankfully there is (at my current job) a team dedicated to backups, their implementation, testing etc. Things fall down (oh so hard) when people decide "they know better". Ugh.

The only verified backup is one that you have restored from

Headley_Grange

When I got my first Mac nearly 20 years ago I was amazed at the simplicity of setting up Time Machine backups after doing battle with Windows, Acronis and a NAS. It worked fine and invisibly. It saved my neck a couple of times by allowing me to roll back a couple of days on files that I messed up. Fast forward too many years and when I finally got round to buying a new Mac I selected the option to transfer all my old stuff from a TM backup. It failed because (in those days) the OS was also backed up along with user data. The new machine wouldn't restore from a TM run on an earlier OS. The solution was to update the old machine to the latest OS, but it was too old. That taught me to always have a plain copy backup of important stuff so now, as well as TM I use rsync to just copy my data to a couple of drives, one of which is stashed in the shed.

Re: The only verified backup is one that you have restored from

Pascal Monett

I do my backups on a rewritable DVD and my 4-bay NAS configured in RAID-5.

I haven't yet got to the point of renting a box in the bank to store another copy . . .

Re: The only verified backup is one that you have restored from

elsergiovolador

I do on BluRay. I think it is more durable. Then I bought a plot of land in the middle of nowhere, where I store them underground. All encrypted.

Proper cold storage.

Re: The only verified backup is one that you have restored from

Ishura

Although Time Machine has some cleverness, there's a folder on the drive (I think it's called "Latest") that just contains a plain copy of all your files. You can manually restore any files you need via Finder without needing to do a formal TM restore.

Re: The only verified backup is one that you have restored from

Tony W

This is the difference between a backup and an archive. A backup is for emergencies, it can be restored to the system it was backed up from. An archive can be restored to any current system. As I learnt when I got a new PC with a later version of Windows, can't remember which one but quite a long time ago. I'd backed up my Outlook Express email files with the backup facility provided, thinking the backup would form an archive, but the new version of Outlook Express was different and there was absolutely no way to convert the old backups to the new system. Not surprisingly there were complaints, and MS recommended a solution. This was to keep a PC running the old version of Windows, for as long as you think you might want the old emails.

I presume the programs were licensed from third parties and MS had found a cheaper source, with zero concern for the users. After all, only home users used Outlook Express and they have no clout.

Evil Auditor

But, but... "veeam guarantees the restore!"

werdsmith

I keep telling people that you don't have a recovery process if you haven't rehearsed it and drilled it and revised the docs at least every six months. That to take into account changes in personnel and changes in infrastructure, software, and business processes.

They are not interested because cost vs risk and it seems that to a manager, having the backups is sufficient to tick the box.

mickaroo

>> The only verified backup is one that you have restored from <<

Back in the day, we ran an application on OS/2 that generated sequences of files with OS/2 long filenames. Our backup department (off in another building) ran backups of our data daily.

One day, we needed to restore some data. The restore was successful with one caveat... all the files restored with DOS 8.3 filenames and were completely unusable.

After faffing around for a couple of weeks trying different things with different backups (same result), one of the ladies in the backup group called and said "There's a checkbox in the top right corner to restore long filenames. Should I try that?"

That's a big "YES"...!!!

Operational vs Disaster Recovery

Anonymous Coward

Aside from never testing the restores we only ever carry our mandated DR tests . There's less knowledge about how to bring up a single system without conflicting with those around it, unless they're all restarted in the order of the DR run book.

Ah, memories...

Yorick Hunt

About a quarter of a century ago, I installed a painfully expensive tape backup system for a customer and left instructions on best practices, including regular manual trial restorations.

About a year down the track, disaster struck - as luck would have it, when I was interstate with no easy/quick way to get back.

I had a local agent pop out with a replacement drive, but he said there were no valid backups to be found. Quizzing the customer yielded something to the effect of "we started getting tape errors so we removed the tape and the errors stopped."

They hadn't performed a proper backup in months, had only done one trial restore about a week after I set the backup up, and didn't even think to get a replacement tape or simply roll onto the next day's tape - nor even to call me for advice.

No skin off my nose as they were only a casual customer, but it really makes you wonder just which cereal packet some people got their brains from.

Here are the copies

Mishak

A friend who used to work for a long-defunct mainframe company got a callout to help a customer after a disk crashed (in the days when it was a spectacular event).

After replacing the hardware he asked "do you have the backup copies so that I can restore from them?"

He was handed two sheets of paper with "See, I remembered to copy both sides"!

At which point the cause of the crash became apparent - the "operator" had taken out the platter so that it could be (photo)copied...

Re: Here are the copies

Prst. V.Jeltz

Cmon ,

I have heard nearly every dumb IT support story going but that is just cartoon level dumb

I've heard of the cleaning lady pulling the plug out

I've heard of the home user ringing the shop only to reveal theres a powercut

I've heard about 5.25 disk stuck to filing cabinets with fridge magnets

I've heard of same disks being hole punched for a ring binder

these are all "user" issues

.

An actual "I.T. professional" , a mainframe operator , photocopying a platter ??

I've heard those mainframes things were pretty expensive , only to be maintained by people with some idea what they are doing surely !?

"A friend who used to work ..." is how all urban myth stories start.

Re: Here are the copies

Neil Barnes

I dunno... I've _seen_ photocopies of 8" floppies, back in the day... gentle words were had.

Re: Here are the copies

werdsmith

Many years ago a developer stapled a 5 1/4 inch floppy to a piece of paper and tried to blame an admin lady.

It was suspected that he had done it because he was way behind on his project and wasn't going to meet the deadline.

I remember him blanching when told a specialist company had managed to recover most of data from the unaffected tracks.

Re: Here are the copies

MrBanana

Our standard method for "gotta ship something, anything fast". Was to open the drive door (QIC tape or 5 1/4 floppy) while writing a few blocks of something plausible to the media. Put media in a jiffy bag, put an obvious footprint on the package and generally make it scuffed up. Post to customer. This would buy you a couple of days extra development/test time.

Re: Here are the copies

munnoch

Upvote for reminidng me I still have QIC tapes somewhere and possibly a drive or two... From the days when high speed data transfer meant sticking one of those in a taxi and sending it to the customer.

Hence the quotes round "operator"

Mishak

Said person was one of the office admin staff who had "volunteered" to take on the role.

Been there...

Mentat74

Once had to take over for a colleague working at another location of the company who was sick, only to find out that the "backups" he had been making every day and every week where empty !

Lazlo Woodbine

For many years I worked for a large retailler, we had over 300 stores in the UK.

Each store had a SUN server with a DAT backup unit.

Each store ran a backup to DAT every night. The first job each morning was to pop out last night's tape and slot in the next tape.

Each store was supplied a box of 10 DATs with the shiny new server, one tape each for Mon - Thur, Sat & Sun, and 4 for Fridays.

When I left the company a good 4 years after they'd installed the Sun gear, I'm not aware of any store replacing any of the tapes, even though they were only supposed to be used 10 times each. Apart from the Friday tapes, each of the other tapes had been used at least 200 times.

The store staff would be completely unaware of the backup status, as the servers had no local monitor, they were only accessed remotely, or by a visiting tech guy.

I remember the system was so nasty, one year the tech support company quadrupled their fee because they wanted out, and the retailler carried on using them, as no other company had tendered for the job.

It takes too long

Mishak

Another friend used to work as a developer for a well-known automotive OEM, with all of their work being stored on a central file server.

All the "important stuff" was backed up on daily basis, which was fortunate as a major failure trashed the content of nearly the whole server.

Hardware was replaced and the restore process initiated - followed immediately be "Error - tape is blank". Not a problem as the tapes from the previous day could be used - which were also empty. As were all the tapes in the rotation for most of the previous year.

On investigation it was found that the backup+verify cycle had grown to the point it was taking over 24 hours, so a "temporary" measure was introduced to drop the "verify". That temporary measure was, of course, forgotten about and became permanent.

At some point the wires to the write head on the tape drive broke, leaving just the erase head operational - leading to a lot of well-erased tapes in the rotation set.

Been there

Anonymous Coward

IT department didn't need to test backups because "we're asked to restore so often".

Come the day of the failure the backup was corrupted, and the previous day, and the previous day. In fact every backup was FUBAR for the previous month.

The backups had been trying to write all data to the first 1kb of the tape then proudly stating 'backup sucessful'.

So obviously they started doing test restores after that....no they didn't, don't be stupid they stuck to the "we restore so often we don't need to". Backed up eventually by "we've got volume shadow copies anyway" and ending with, "365 replication is the backup".

I've left since then.

Once upon a time...

42656e4d203239

I was working for a company doing IT/Network support, including backup stuff up on disparate systems and, yep, testing restores.

Then I was encouraged to leave. So with good grace I did - it was the right decision for everyone.

A few years later I heard that one of the systems I was responsible for backing up had a problem; disk failure or some such. The company I left hadn't made provision to hand over to anyone else (they couldn't, but that is another tale) so the system hadn't had any backups but had kept on trucking... until it didn't, several years down the line.

On hearing of the system's demise and subsequent discovery of "oops! no backups for n>3 years" I smiled sweetly and wondered about reaping what you sow... did they learn form the experience? You would have to ask them; the matter is in the big bucket containing things that are "Not my problem"

Hogbert

Always consider backups to be a quantum phenomenon. Unless you observe it, it's state is unknown.

Michael H.F. Wilkinson

When I was doing my PhD research, I had a habit of doing weekly back-ups of my development (MS-DOS) machine in duplicate on a pile of 3.5" floppy disks (talk about a tedious chore). I would then restore one of these on a bigger "production" image processing machine, thus testing and verifying the back-up. After that, the whole shebang was copied to tape. I am not sure if the tape jockey verified anything, but for good measure I took the other back-up home, and restored that on my home machine. Paranoid? Perhaps, but I didn't lose any data during that time.

Christoph

Due to the boss keeping multiple copies of absolutely everything, the backup tape filled up completely and the backup failed. So when the system crashed our latest backup was a month old, and we lost a lot of stuff.

A year later, I asked the hardware guy what our latest backup was. That same tape, now 13 months old.

But the boss saved some money by not fixing the backup system!

a long time

andy the pessimist

I was sharing a sparc (yes that long ago) with customers. I said what about backups? The pm didn't know. I asked for and got a tape streamer. The backups were kept offside. I can't comment on restores.

Key - what key?

ColinPa

A customer told me that they had had a problem. They were taking backups, and doing a test restore - which worked fine.

The tapes were then sent to the backup site for long term storage.

They had a major problem, and needed to restore from a tape at the backup site. Unfortunately, the data was encrypted on the tapes. The primary site had the key, but the backup site didn't.

There was a quick panic while they found out how they could get the key exported from the primary system, and entered on the backup system.

They didn't know if the encryption was on the tape hardware or in the software.... which added to the confusion. I think it was both.

I had a check list when I went to customers. I added .... when did you last test the backups can be processed on the remote sites?

Backup to /dev/null

Evil Auditor

I've surely posted this here before...

Early 2000s I asked a client, a small bank, whether they performed restore tests. Same answer: no, we get the daily "backup successful" message. At least, I managed to convince them that restore tests are a rather necessary task. And a couple of weeks later I get a phone call from that client after they found that all of their backup tapes were empty.

While setting up and testing the backup procedure, someone didn't want to wait for an hours-long backup to finish and directed the data stream to /dev/null. And then never changed it to write to the tape.

Anonymous Coward

I have 3 instances.

First one was a backup that I was not allowed to run as root (this is going back to the mid 80's, and for some reason access to root was extremely hard to get). What I ended up with was a Frankenstein of a backup. It would write a header, write my data, then wait for a piece of code to backup a database (written by the DBAs) before writing a trailer. To a tape device, which used a no-rewind device (nst) until the trailer was written, at which point it would rewind the tape. I would check that a trailer file had been written as it signalled to me that the backup had finished. What could go wrong?

Well, the DBA decided to change the tape device to a rewind device in his part of the backup sequence - so all that actually got saved was the trailer! I went to do a restore one day and that's when I found the problem. Now in my defence, I was not a sysadmin. I had no way of testing a full restore (only a couple of ICL Team servers and I was the first in Ops to work on UNIX). It was a hard lesson to learn early on in my career.

The second was a few years ago. There was a request to restore a very large database (nothing I dealt with this time - there was an entire offshore backup team that managed backups). Trouble is, it had been failing for months and no-one said a thing. There were some very worried people running around, and I believe some serious words were had.

Third one was when I was a support engineer for an app back at the turn of the century. Received a panicked call from a customer in the USA. She was complaining that the app would not start. Found out that it was unable to start the database. Then, after questioning her and checking a few things, realised that the database was missing. Turns out she was running out of space and the best course of action was to delete the database. I felt sorry for her (she was nearly in tears), as the last known good backup was the last monthly backup about 2 weeks old. They lost a LOT of work.

KittenHuffer

I have only 2 instances.

One I was directly involved in, where the HD in a PDP11-73 decided to turn up it's toes between completing the backup and verifying the backup. DEC were called to replace the drive and get us to the point of attempting the restore. The monthly full system restore went fine ....... then the unverified backup from the night of the crash also went fine! No data lost, full restore acheived!

And the one I was only indirectly involved in. A system crashed and it was then discovered that due to 'an unpublished Oracle issue' the last good backup was 6 weeks old. And it was necessary for everyone to scramble to reinput 6 weeks worth of worth. My involvement was that even though my team were scabbling to help with the problem I was the only member of the team that was not only not involved in the scrabble, but had not actually been told that there was an issue to scrabble about! To this day I believe that this happened because my mangler correctly assessed that I would be the one person to stand up in a meeting and say 'the Emperor has no clothes' concerning the reason that was given that there were no recent backup.

test, test, test

Colin Miller

My mantra on this is "If you're not testing your backups, then you're not doing backups"

Backing up the Internet

Anonymous Coward

I spent a couple of years working for a decent sized boarding school

Much of this time coincided with covid lockdowns, so we spent a lot of the lockdowns updating systems, installing new servers etc.

One day, the director of IT popped his head around the door and mentioned in passing that the Governors had been looking at disaster recovery and decided they didn't trust OneDrive not to lose data.

This prompted one of the Governors to offer up space in a data warehouse he had shares in, which was nice of him.

A colleague wrote some code and set in motion the backup of 300TB or so of data our lovely students had amassed, made up of over 90 million files (probably mostly movies, music and game saves), to a bit barn somewhere in Buckinghamshire.

6 weeks later, when I changed job, the backup was still in progress...

No backup existed

ralphh

Decades ago I started a new job and created a new team.

IT supplied drive space and we dutifully wrote source code and saved it to our drive. Two years later IT came to me and explained there had been a drive failure and they discovered that they'd not added my department's drive space to the backup schedule.

Being head of department I'd had my own backup schedule with off-site backups. Lost nothing. Had most of the QA department's data too.

Anonymous Coward

AC for reputation

Worked for a Bank in Live Support on call, was called into a Sev 2 Major Incident one night, circa 2am. Middleware server had had a newbie project resource make a disaster zone of failed changes and finger ferkups... can I fix it.. it was in such a state i didn't even know where to start, he had no notes, no recollection of what he had changed etc so I did the dutiful thing and asked Unix support to start working out a restore from backup ( Tivoli ) . Restore was approved on the call and so off they went to crack on.. whole file partition was about 30 GB so not huge even back then.

1 hour went, 2 hours went, 3 hours went.. " how long is this going to take I ask".. Well... its like this as we are on day 29 of our monthly backup process which is a full on the 1st then incremental we need to work through each one, but the tape system has a automated tape switching process and the way it works it needs to switch tapes to do each file which takes 6 mins each time..

17 sodding hours that one restore took.. since then I always ensure I am in full control of every single change I do, if I am changing anything I have full backups myself that I control. I rely on no other team or tech to be able to just put it all back as it was and walk away if necessary. Its saved me a few times when deployment files were not checked before being passed to me to deploy.

QOTD:
If it's too loud, you're too old.