News: 1753083129

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Under-qualified sysadmin crashed Amazon.com for 3 hours with a typo

(2025/07/21)


Who, Me? Welcome again to "Who, Me?" – The Register's Monday column in which readers admit to making mistakes and explain how they managed to keep their careers going afterwards.

This week, meet a reader we'll Regomize as "Ken" who told us that over 20 years ago he scored a job at Amazon.com as a Linux sysadmin, a role for which he admitted he was "completely unqualified."

He previously worked as a Solaris admin, experience that earned him an interview at Amazon. He quickly studied Linux, got the job, and soon found that the Red Hat Enterprise Linux environment in place at the time was very different to Solaris!

[1]

Despite his inexperience, Amazon gave Ken the job of upgrading the e-tail giant's tape backup application.

[2]

[3]

"I spent months planning and testing because with this upgrade configuration files changed and we were required to make new ones and push them out with the update," Ken told Who, Me? "I created those files and did all the necessary tests. Everything appeared to be fine, and the day came when we pushed the button."

For several hours, everything worked as intended. "We sat and watched for several hours after the update, everything worked great, so we patted ourselves on the back, called it a job well done, and went home."

[4]

And then at about 7PM, Ken's pager "started going crazy."

Within minutes, Ken joined a conference call in which very, very senior people – including then-CEO Jeff Bezos – wanted to know why all of Amazon.com was down.

"This, many considered, was bad," Ken told Who, Me?

[5]

Ken and his colleagues eventually noticed that the primary database for Amazon's bookstore had stopped doing anything, despite the enormous cluster of computers it ran on operating normally.

Ken knew the backup app he built would copy the database's logs to tape, then delete the logs on the servers that hosted the database. Ken checked the backup process, and found it was working just fine.

[6]Junior developer's code worked in tests, destroyed data in production

[7]Yes, I wrote a very expensive bug. In my defense I was only seven years old at the time

[8]Junior sysadmin’s first lines of code set off alarms. His next lot crashed the company

[9]Techie went home rather than fix mistake that caused a massive meltdown

He kept digging and eventually checked the configuration files he had so carefully created... and found a typo that meant the system didn't delete logs after backup.

"This wasn't an issue for many hours, but eventually the partition holding the logs filled up and the database just gave up and started complaining that nobody loved it anymore," he told Who, Me?

After satisfying himself that no log files had been lost, Ken and a database administrator deleted the logs on the cluster and watched as the database came back to life – and so did Amazon.com.

Ken fixed the typo in the configuration file, then went home and spent a restless night pondering the need to find a new job.

"I drove into the office the next morning to see my manager standing outside in the parking lot where I normally parked, which did not seem like a good omen," Ken said.

"I got out of my car and shuffled over to him. He stood in silence for about 15 seconds just giving me a hard look. Suddenly, he got a huge grin on his face, shook my hand, and said, 'Congratulations, you're no longer a virgin.' We walked inside, where everyone razzed me for a long time."

"And that's how I brought down Amazon."

What have you broken with a typo? To sahre shera share your story, [10]click here to send email to Who, Me? We'd love to tell your tale on some future Monday. ®

Get our [11]Tech Resources



[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offbeat/columnists&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aH4PtVgSB4nstdO9_2kQJQAAAMg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offbeat/columnists&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aH4PtVgSB4nstdO9_2kQJQAAAMg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offbeat/columnists&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aH4PtVgSB4nstdO9_2kQJQAAAMg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offbeat/columnists&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aH4PtVgSB4nstdO9_2kQJQAAAMg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_offbeat/columnists&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aH4PtVgSB4nstdO9_2kQJQAAAMg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2025/07/14/who_me/

[7] https://www.theregister.com/2025/07/07/who_me/

[8] https://www.theregister.com/2025/06/30/who_me/

[9] https://www.theregister.com/2025/06/23/who_me/

[10] mailto:whome@theregister.com

[11] https://whitepapers.theregister.com/



Logs

Dr Watson

This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!

Re: Logs

Anonymous Coward

This is why you ALWAYS put /var/log in its own partition. That way when the logs get full it doesn't stop the rest of the system!

Looks like they were database logs, the most recent of which might be required to complete a transaction. If the rdbms cannot write an intent entry it doesn't start the transaction ? Which wouldn't necessarily get you further ahead with a full dedicated file system.

These days I would put transient/volatile files on their own logical volume / file system rather than a physical partition.

Re: Logs

Doctor Syntax

I'm guessing these are the logs that the database engeine would use in the event of a database restoration. They would be used to restore any transactions after the last database backup. If the engine ignored tet fact that it couldn;t write new ones it wouldn't be able to restore the new transactions if the need arose. Better to stop than compromise integrity.

Re: Logs

Pete Sdev

Generally good advice.

However some applications will, when upon unable to write to a log, will pass the error up the stack and refuse to do anything. In some cases this is justified, e.g. database logs in a cluster.

What's worrying about this story is:

i) No monit sending a warning "hey this partition is 95% full you should take a look"

ii) Inadequate testing of the script before use in production.

"which did not seem like a good omen"

Pascal Monett

I have to admit, if I were responsible for something like that in such a company, my butthole would be so clenched I wouldn't even be able to fart until I was told that no, I wasn't fired.

I'm glad that he escaped that episode unscathed (well, almost).

Anonymous Anti-ANC South African Coward

At least he did not use the venerable and powerful rm -rf * (the stuff legends are made of)

did not use the venerable and powerful rm -rf *

Anonymous Coward

Ironically in this case that would have worked in the log directory.

Re: did not use the venerable and powerful rm -rf *

Doctor Syntax

And left the database in need of an immediate backup.

m4r35n357

The One True Command is: rm -rf /

Get it right!

Korev

> We walked inside, where everyone razzed me for a long time."

They probably thought "Not fu Ken him again"

Big logs

KittenHuffer

Big logs that have not been removed from their storage area would bring me to a halt as well.

In fact I can recall sitting for long periods of time contemplating just this issue.

---------> Mine's the one hung on the back of the toilet door!

Job half done

Pope Popely

Unfortunately, no permanent fix for Amazon.

A Non e-mouse

Kudos to the manager for not firing "Ken" or throwing him under the bus.

tip pc

firing the guy who figured out the issue and resolved it is never a good idea.

KittenHuffer

I've never known that to stop Manglement though!

Sam not the Viking

As his boss, I took the blame for a number of cock-ups that our new graduate trainee introduced: Poor supervision on my part although I was worried about competence.....

I finally lost sympathetic-mode when instead of admitting defeat, he falsified a set of test results. The series of figures recorded didn't compute to the results displayed. It wouldn't have been so bad except that the data should have been recorded automatically and the answer calculated within the spreadsheet. Instead, he hand-typed numbers into the cells.

He expected things to be done for him, so he could pass them on and take the credit. I now realise he was middle-management material. But I/we sacked him. We later found out that the recruitment agency had failed to verify his qualifications but omitted to pass this information on..... We sacked them too.

Anonymous Coward

I had similar issues with a support person. He was likely to fail his probation due to poor timekeeping anyway but he then made an unauthorised configuration change which took down the ERP system database. Had he admitted it he might have had a second chance but he denied it even when presented with the evidence that he was the only admin logged on at the time so he had to go.

It later turned out that the recruitment agency had been economical with the info they'd provided on him but still insisted that we had to pay the full fee as we'd passed the cancellation date specified in the contract. Shortly after we stopped using employment agencies*!

*When I recruited his replacement three out of four candidates put forward turned out to be unsuitable for the role, and I suspect had been on the agency books for a long time. Fortunately the fourth is still with us.

Be truthful

PCScreenOnly

I have found that is the best way.

If you admit it it can be quicker to find a resolution

People appreciate the truth

If something else hits the fan in the future and you say "It wasn't me" or "I did this within the last x", people are more likely to believe you and can check what you did instead of random guessing of the cause

Strangely enough,

jake

These days it is unqualified amazon drivers crashing ...

A Linux machine! Because a 486 is a terrible thing to waste!
-- Joe Sloan, jjs@wintermute.ucr.edu