Cloudflare Explains Its Worst Outage Since 2019
- Reference: 0180141637
- News link: https://tech.slashdot.org/story/25/11/19/1447256/cloudflare-explains-its-worst-outage-since-2019
- Source link:
Users attempting to access websites behind Cloudflare's network received error messages. The outage affected multiple services. Turnstile security checks failed to load. The Workers KV storage service returned elevated error rates. Users could not log into Cloudflare's dashboard. Access authentication failed for most customers.
Engineers initially suspected a coordinated attack. The configuration file was automatically regenerated every five minutes. Database servers produced either correct or corrupted files during a gradual system update. Services repeatedly recovered and failed as different versions of the file circulated. Teams stopped generating new files at 14:24 UTC and manually restored a working version. Most traffic resumed by 14:30 UTC. All systems returned to normal at 17:06 UTC.
[1] https://tech.slashdot.org/story/25/11/18/120222/cloudflare-outage-knocks-many-popular-websites-offline
[2] https://blog.cloudflare.com/18-november-2025-outage/
No QA before production release?? (Score:1, Insightful)
Seriously?
Even most of the shitty under resourced startups I was at had basic dev -> qa -> staging -> production environment life cycles in place.
This sort of failure is a result of sheer incompetence, bad systems engineering, and clueless management at all levels.
Outages? Yes, shit happens.
Preventable outages by huge critical infrastructure company in their key systems? Clown show.
Re: (Score:2)
who said that that flow (dev->qa->staging->prod) didn't happen?
how many people had bugs in prod that didn't show up in the previous steps? how many people had problems only hours after deploying to prod?
you can't not always test all conditions, dev and qa may not have the amount of users/access/info to really replicate a problem, staging may read the prod DB, but not trigger the conditions, maybe they are intermittent, require a special corner case, etc
in this case, a prod DB change is ALWAYS harde
Wait wait wait.... (Score:3)
Is this real?!?
Like, really really for real?
For once, IT WASN'T DNS!!!!!!!!!!!!!!!!!!
NOT DNS! (Score:2)
I'm also shocked!
Built In Limit? (Score:2)
> The software had a built-in limit of 200 bot detection features. The enlarged file contained more than 200 entries. The software crashed when it encountered the unexpected file size.
A built in limit is:
if ( rule_count > 200 )
log_urgent('rule count exceeded')
break
else
rule_count++
process_rule
This sounds like it did not have a built-in limit but rather walked off the end of an array or something when the count went over 200.
Re: (Score:1)
Indeed. Space-limit, hard-placed default-deny or something. In any case something placed incompetently and then not tested for. Amateurs.
Re: (Score:2)
can i see your code to see what assumptions did you make on early stages, that you never bother to go back and fix?! :D
Re: (Score:3)
They explain it and you can see their code toward the end of the linked blog post.
> Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features.
Re: (Score:2)
Thats an explanation, not an excuse. There should have been a limit check, end of. Probably written by some clueless kid just out of college because any semi competant dev would have put that check in.
Re: (Score:2)
I had a sneak peak at their code it was more like that if ( rule_count > 200 ) crash_whole_system() else process_rule()
Re: Cough ... AI ... cough (Score:1)
AI DS?
Engineers initially suspected a coordinated attack (Score:1)
When something goes wrong your mind always jumps to hackers, but most of the time its your own fault.
Re: (Score:1)
Looks like they modded you down so you got your answer. Yet more proof that Rust is more a cult now than a favoured dev language.
History repeats (Score:2)
Apparently nobody learned anything from the FalconStrike crash.
n/a (Score:3)
maybe centralization isn't such a good thing after all?
Re: (Score:3)
In this case centralization isn't a bad idea. Okay, occasionally there is a problem, but when there is a massive amount of resources are thrown at it, and it gets fixed quickly. Meanwhile their software is updated and constantly tested, so it's more secure and stable than most in-house efforts. It's their full time job, where as it's usually just the IT guy's background task when the company manages it themselves.
What matters is that there is still competition, to keep the market working properly, and that
Re:n/a (Score:4, Insightful)
Nope, not seeing it.
No centralization: One site goes down, inconveniences a few people, problem gets fixed a bit more slowly.
Centralization: A quarter of the internet becomes nonfunctional.
Centralization still seems like a really, really bad idea to me. It makes it MUCH harder for the internet to just route around damage.
Re: (Score:2)
i don't care if 1/4 of the internet goes down, i care about my site.
do the CF downtime was bigger or smaller than a downtime on my side?
can i replicate their features even, to get a similar service?
So while this downtime is always bad, all sites have some downtime... maybe i was luck , but was little affected by this
either way, no, i can not really replicate the cloudflare solution locally, the costs would be huge, more people, more servers, more knowledge and still would not reach the same level. Just the
Re: n/a (Score:2)
this is basically the political debate of communism vs federal republics haha
Re: (Score:2)
It's occasional mass outages for a short time, vs more frequent small outages and security issues.
Don't forget that Cloudflare handles a lot of the security for sites that use it. Not just DDOS protection, but things like user authentication and HTTPS.
Re: (Score:2)
> What matters is that there is still competition, to keep the market working properly, and that such services are properly regulated.
What world are you living in?
Re: (Score:2)
I mean it matters, not that there is good competition. Cloudflare is rather dominant.
Re: (Score:2)
20% of the world's traffic goes through Cloudflare. The other 80% would like a word with you.
Re: n/a (Score:2)
Dunno.
What strikes me as odd is that 20% of ALL sites use cloudflare. Why?
This in my book makes them as big and potentially evil just as much as google/amazon/meta