CrowdStrike blames a test software bug for that giant global mess it made

(2024/07/24)

Reference: 1721798221
News link: https://www.theregister.co.uk/2024/07/24/crowdstrike_preliminary_incident_report/
Source link:

CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week.

A Wednesday update to its [1]remediation guide added a Preliminary Post Incident Review (PIR) that offers the vendor's view of how it brought down 8.5 million Windows boxes.

The explanation opens by detailing that CrowdStrike's Falcon Sensor ships with "Sensor Content" that defines its capabilities. The software is updated with "Rapid Response Content" that allows it to detect and collect info on new threats.

[2]

Sensor Content relies on "Template Types" – code that includes pre-defined fields for threat detection engineers to leverage in Rapid Response Content.

[3]

[4]

Rapid Response Content is delivered as "Template Instances," which CrowdStrike describes as "instantiations of a given Template Type."

Each Template Instance maps to specific behaviors for the sensor software to observe, detect or prevent.

[5]

In February 2024, CrowdStrike introduced a new "InterProcessCommunication (IPC) Template Type" that the vendor designed to detect "novel attack techniques that abuse Named Pipes."

The IPC Template Type passed testing on March 5, so a Template Instance was released to use it.

Three more IPC Template Instances were deployed between April 8 and April 24. All ran without crashing 8.5 million Windows machines – although, as we [6]reported earlier this week, Linux machines had problems with CrowdStrike in April.

[7]

On July 19, CrowdStrike introduced two more IPC Template Instances. One included "problematic content data" – but made it into production anyway, because of what CrowdStrike described as "a bug in the Content Validator."

The post doesn't detail Content Validator's role – we'll assume it's supposed to do what the name suggests.

[8]How did a CrowdStrike config file crash millions of Windows computers? We take a closer look at the code

[9]CrowdStrike CEO summoned to explain epic fail to US Homeland Security committee

[10]Life, interrupted: How CrowdStrike's patch failure is messing up the world

[11]Cybercrooks spell trouble with typosquatting domains amid CrowdStrike crisis

Whatever the Validator does or is supposed to do, it did not prevent the release of the July 19 Template Instance, despite it being a dud. That happened because CrowdStrike assumed that tests that passed the IPC Template Type delivered in March, and subsequent related IPC Template Instances, meant the July 19 release would be OK.

History tells us that was a very bad assumption. It "resulted in an out-of-bounds memory read triggering an exception."

"This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash."

On around 8.5 million machines.

The incident report includes promises to test future Rapid Response Content more rigorously, stagger releases, offer users more control over when to deploy it, and provide release notes.

You read that right: release notes. Be still your beating heart.txt .

The report also includes a pledge to release a full root cause analysis, once CrowdStrike has finished its investigation.

Take all the time you want: some of us are still busy rebuilding machines you broke. ®

Get our [12]Tech Resources

[1] https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_security/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2ZqDQuZU7C0V72M0l2qAmtwAAAMs&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_security/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44ZqDQuZU7C0V72M0l2qAmtwAAAMs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_security/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33ZqDQuZU7C0V72M0l2qAmtwAAAMs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_security/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44ZqDQuZU7C0V72M0l2qAmtwAAAMs&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2024/07/21/crowdstrike_linux_crashes_restoration_tools/

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_security/front&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33ZqDQuZU7C0V72M0l2qAmtwAAAMs&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://www.theregister.com/2024/07/23/crowdstrike_failure_shows_need_for/

[9] https://www.theregister.com/2024/07/23/crowdstrike_ceo_to_testify/

[10] https://www.theregister.com/2024/07/19/life_interrupted_how_crowdstrikes_patch/

[11] https://www.theregister.com/2024/07/23/typosquatting_crowdstrike_crisis/

[12] https://whitepapers.theregister.com/

It worked on my machine!

TReko

A sanity test of actually installing it on a few real machines before deploying it worldwide to 8.5M machines is something that used to be standard QA practice.

Crowdstrike's practices sound like criminal negligence.

Re: It worked on my machine!

Joe Gurman

And someone at the customer sites installing the update on one (1) testbed system each before deploying to every mission-critical production system also used to be standard QA practice. Still is some places™.

Re: It worked on my machine!

142

Many customers *had* configured such a staggered roll-out for their CrowdStrike updates, but CS actively overruled that setting for this release...

Re: It worked on my machine!

Crypto Monad

Maybe Crowdstrike should release their code to their *own* desktops and servers, an hour or so before releasing it to the rest of the the world.

Re: It worked on my machine!

Anonymous Coward

Criminal negligence is a good term for what went down. I genuinely hope the justice system sees it that way as well. Either way, tons of businesses lost a ton of money, and if CrowdStrike doesn't cough it up, then, unfortunately and inevitably, the customers will. That is not acceptable.

Re: It worked on my machine!

Anonymous Coward

This also demonstrates that contracts don't really help that much in terms of damage. Let's imagine you can get money back from CrowdStrike, it is unlikely to cover the actual damage caused and in some cases perhaps your business is already dead anyway.

Let's imagine you're using a cloud service for your core IT, they "do a CrowdStrike" and are down for several days, for many companies that's game over and compensation from a contract which may or may not be honoured isn't going to help.

Re: It worked on my machine!

The old chestnut "don't put all your eggs in one basket" should always apply to your core IT in all circumstances.

Re: It worked on my machine!

Optimaximal

But given the increasing reliance on SaaS, Cloud & External computing, you can't always rely on others also applying that logic/practice.

Re: It worked on my machine!

bsdnazz

The Crowdstrike software license limits their liability to the software fees paid.

Computers are general purpose devices and can be put to many different uses. No software vendor is going to refund consequential losses while charging a standard software fee unless they can control very specifically what you do with their software and thus the risk they're exposed to.

Re: It worked on my machine!

Anonymous Coward

Here's hoping that the negligence part allows an end run around license agreements that are typically used to avoid coughing up for the kind of brutal cockups we have just seen.

Not holding my breath, to be honest.

Re: It worked on my machine!

Dave K

I would add that deploying a ring 0/kernel level driver that takes input from a regularly updated content file and which does not perform sanity checking on that input file is also criminally negligent.

Even given their dodgy/insufficient testing processes, this whole mess could have been avoided if the driver validated the content file before attempting to execute it...

Pascal Monett

You're supposing the local driver would be able to detect issues better than the testing suite that was written by the same company.

I don't see that happening.

Re: It worked on my machine!

simonlb

Yeah, you would expect the Validator to at least verify that the file its validating actually contains data which conforms to the format specific to the design of that template, rejecting anything that isn't correct. That's also ignoring the requirement that the service running on the server performs input validation on any file it's ingesting, although that might not be possible due to the way Windows works. Either way, it's a massive fail.

LosD

They didn't test the tester! Or the tester tester! Or the tester tester tester!

Typical *eyeroll*

AndrueC

It does appear to have been a testes up.

Anonymous Coward

Yeah, no excuse for not deploying on test machines or cannery channels. They are not talking themselves out of this one. They are quite literally The Man Who Sold The World.

Automation....

hoola

This revelation just sums up the insanity of where we are.

So much is now reliant on software to test stuff with simulations or whatever shite it does that the fundamental concepts of actually a TESTING something in a live environment has gone.

This is not an oversight or anything like that, it is a disaster that has been waiting to happen. Now it has happened however nothing will change because it is a cultural issue. Too many just will not believe that the old fashioned way of actually installing something to see if it works is ultimately better.

This is because it is considered 'legacy'......

Utter morons.

Re: Automation....

Anonymous Coward

I'd say you can blame Microsoft twice for this.

First for creating an OS and selling it via questionable means that, despite actual terabytes of updates, is still by default more resembling a colander from a security perspective and thus requires all sorts of shoring up with IT plasters and bandages to keep it together, and next for sacking their testers and making it acceptable to push shocking shoddy, not-even-beta-quality code out without any apology or shame and so make it acceptable for other organisations to do the same.

The problem is that it has never had any real consequences for Microsoft (except with us, but we're such a tiny exception it doesn't even register) - they still get paid. As long as that does not change I do not expect any improvement any time soon.

This WILL happen again.

I don't understand this sentance

EBG

it reads evasively

"CrowdStrike assumed that tests that passed the IPC Template Type delivered in March, and subsequent related IPC Template Instances, meant the July 19 release would be OK."

What didn't they test, and why ?

Not sure what language they use

Mishak

But it isn't normally difficult to capture and handle unexpected exceptions:

try() {

instantiateTemplate();

}

catch ( ... ) {

handleBadThingsThatShouldNotHappen();

}

Making sure that the exception handler can't throw an exception, of course.

PS - anyone know how to stop the html code blocks on here from adding space around newlines and removing leading spaces?

Re: Not sure what language they use

MrBanana

You're thinking about user code written in a high level langauge, where there is a saftey net in the kernel to catch your screwup, and gracefully return an exception. If you running as a kernel driver you will be wrtting in low level C and possibly assembler, no safety net, no exception handlers, just a hard crash. Made even worse by insisting on running at first boot - total borkage.

Re: Not sure what language they use

Jon 37

The Windows kernel is written in C. C itself doesn't have exceptions at all. Although Windows does have an exceptions mechanism that's kludged in there.

But, that's the wrong solution.

In C, you can write to a wild (invalid) pointer, and that might be caught by the OS or might just write to a random bit of RAM. In the kernel, you can corrupt any RAM that way, causing some other part of the system to go wrong (perhaps much later) in an unpredictable way.

So, you absolutely have to write your code correctly so it doesn't try to write to an invalid pointer. This is not optional. If you're doing that, then you can use the same techniques to make sure you don't read from an invalid pointer.

And once you've done that, you don't need to try to catch exceptions from using invalid pointers. And you shouldn't even try, because there is nothing sensible you can do if you catch one.

If you're a C# or Java programmer, then you might not have come across the concept of invalid pointers. One of the big improvements in those languages, is that they ensure that pointers are valid. They don't have raw pointers, instead they wrap them in object references and arrays. That makes this entire class of bugs impossible.

Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts. (Both Java and C# use a "garbage collector", which does not fit in an existing kernel easily. Rust doesn't, which makes it a better fit for gradually converting past of an existing kernel to a safer language).

Re: Not sure what language they use

Jellied Eel

Alternatively.. Leave the kernel well alone? This kinda reminds me of a debate from late last century on my degree. In which Z was inflicted on us to learn formal methods and software assurance. But proof in Z meant not much when we then had to hack away in C or assembler and hope the compiler didn't have any bugs on top of the ones we were writing. Which seems to be the problem, ie the kernel is the core of the OS, so if you want all the cruft that's wrapped around it to have a chance of working, it should be left to it's own devices. Which I guess has been the problem, ie the pressure to allow hooks into the kernel.

Anonymous Coward

Who tests the tester?

This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester. This could go on forever. What happened to having a team to test rather than relying on what we now know to be faulty test automation? How can you even have test automation when like in this instance the fault was unknown to the test automation so never got flagged as a fault? I think it's boils down to age old adage of paying money and ways to avoid it. Why have actual testers when we can do it without them or fewer for less? I can even imagine the meeting they had at some point where they talked about test automation and the money they could save by laying off staff. Pats on the back all round chaps.

Jellied Eel

Who tests the tester?

Richard Feynman wrote some good stuff on testing. Take 2 teams, one to make it, one to break it. It's one of those things where subconscious biases can affect things. I design something to the best of my ability, work through a bunch of failure scenarios and pass it off to another team, who promptly think of something I didn''t think of and break it.

This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester.

Yep. But break it down with your trusty Occam's Razor and you get a faulty o-ring. Or in this case, assuming a simulated test was a real test. It sounds like there wasn't actually any pre-deployment test by letting the update loose on a bunch of test environments and seeing what happened. Around 8,5m systems found that out the hard way.

Test, test, then test again.

Admiral Grace Hopper

I'm explaining the value of testing to a bunch of wanna-be junior techies this week and the CrowdStrike shenanigans have been an excellent example.

sitta_europea

Has anybody else noticed that "Safe Mode" presumably means the other mode - the one you normally use on Windows - must be "Unsafe Mode"?

Do you think the techies in 1995 ran the name past marketing first?

Admiral Grace Hopper

The same techies that put the Shut Down command on the Start menu?

Next assume, never promise

Julian Poyntz

Two things I learnt waay back in the past and have been so true all these years

break testing

Julian Poyntz

the line "worked in March", is a bit of a bell ringer. Why think it would still work now ?

I also wonder was "break" testing they did. See it so often with testing where things are tested to show it works as expected, that when something happens that should not (such as a button going missing) is missed.

However, reading a config file of any type without internal validation is really rather worrying, thouh I imagine the excuse is "To keep it as small as possible", which would not be the first time to hear that excuse, but looking at the size of numerous files it is complete bollocks

Secure boot?

Anonymous Coward

The details don't make clear if these data files were signed and validated by the driver in any way - if they're not then surely this could be a secure boot violation, given (clearly by the BSOD encountered) they can cause operations to be executed with kernel privileges...

tallen

Asking for a friend: So just how do you read the release notes before the auto-update tanks your machine?

Testing as a Service

ComputerSays_noAbsolutelyNo

Big customers, who stand to lose much, could send a small sample of their machine park to DownStrike, so that new updates can be tested prior to the full roll-out.

Evil Marketing guy: That's such a nice operation, you have running. It would be such a shame if some bad update would happen to it.

Ah "the dog ate my homework" excuse...

xyz

As others have noted, nothing beats real machine testing. I tested some code once (not my code) on about 5 machines and it would work, or it fell over, or it would work etc. I nearly got done for sabotage... Turns out the previous testing had been done on vms and what I'd exposed was a showstopper and my execution was cancelled.

News: 1721798221

CrowdStrike blames a test software bug for that giant global mess it made

It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Re: It worked on my machine!

Automation....

Re: Automation....

I don't understand this sentance

Not sure what language they use

Re: Not sure what language they use

Re: Not sure what language they use

Re: Not sure what language they use

Test, test, then test again.

Next assume, never promise

break testing

Secure boot?

Testing as a Service

Ah "the dog ate my homework" excuse...