> 7/18/24 10:20PT - Hello everyone - We have widespread reports of BSODs on windows hosts, occurring on multiple sensor versions. Investigating cause. TA will be published shortly. Pinned thread.
> SCOPE: EU-1, US-1, US-2 and US-GOV-1
> Edit 10:36PT - TA posted: https://supportportal.crowdstrike.com/s/article/Tech-Alert-W...
> Edit 11:27 PM PT:
> Workaround Steps:
> Boot Windows into Safe Mode or the Windows Recovery Environment
> Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
> Locate the file matching “C-00000291*.sys”, and delete it.
> Boot the host normally.
Also in Australia
1.enter in drive C: 2.system 32 folder 3. Drivers 4. Rename crowdstrike folder to something else doesent matter what.
A Crowdstrike update being able to blue-screen Windows Desktops and Servers.
Whilst Crowdstrike are going to cop a potentially existential-threatening amount of blame, an application shouldn't be able to do this kind of damage to an operating system. This makes me think that, maybe, Crowdstrike were unlucky enough to have accidentally discovered a bug that affects multiple versions of Windows (ie. it's a Windows bug, maybe more-so than it is a Crowdstrike bug).
There also seems to have been a ball-dropped in regards to auto-updating all the things. Yes, gotta keep your infrastructure up to date to prevent security incidents, but is this done in test environments before it's put into production?
Un-audited dependence on an increasingly long chain of third-parties.
All the answers are difficult, time consuming, and therefore expensive, and are only useful in times like now. And if everyone else is down, then there's safety in the crowd. Just point at "them too", and stay the path. This isn't a profitable differentiation. But it should be! (raised fists towards the sky).
Update: 911 is down in Oregon too, no more ambulances at least.
Open the Command Prompt as an administrator. Run the following command to uninstall the current version: shell
sc delete csagent
https://www.nzherald.co.nz/nz/bank-problems-reports-bnz-asb-...
https://www.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_e...
DELL laptops are observed, after blue dump . Server is up and running fine
Temporary workaround leads compliance issue.
From India
Thank fuck Netflix runs on Linux. I just hope the full chain from my TV to Netflix is immune...
I'm getting more up to date technical details from the regular media.
This outage looks to be huge.
Did it work for you?
Looks like this is a big deal.
I went into our crowdstrike policies and disabled auto update of the sensor. Hopefully this means it doesnt hit everything. Double check your policies!!!
Edit:
Crowdstrike has an article out on the manual fix:
https://supportportal.crowdstrike.com/s/article/Tech-Alert-W...
- Major banks, media and airlines affected by major IT outage
- Significant disruption to some Microsoft services
- 911 services disrupted in several US states
- Services at London Stock Exchange disrupted
- Sky News is off air
- Flights in Berlin grounded
- Reports the issue relates to problem at global cybersecurity firm Crowdstrike
There exists a workaround but CS does not make it clear whether this means running without protection or not. (The workaround does get the windows boxes unstuck from the boot loop, but they do appear offline in the CS host management console - which of course may have many reasons).
Obviously bugs are inevitable, but why this wasn't progressively rolled out is beyond me.
Edit: took out a bit of snark.
With hearing 911 and other safety critical systems going down, I hope that the worst that comes out of this is a couple delayed flights and a couple missed bank payments.
"711 has been affected by the outage … went in to buy a sandwich and a coffee and they couldn’t even open the till. People who had filled up their cars were getting stuck in the shop because they couldn’t pay."
Can't even take CASH payment without the computer, what a world!
On a related note, this also demonstrates the danger of centralized cloud services. I wish there were more players in this space and the governments would try their very best to prevent consolidation in this space. Alternatively, I really wish the CS did not have this centralized architecture that allows for such failure modes. Software industry should learn from great & age old engineering design principles. For example, a large ships have watertight doors that prevent compartments from flooding in case of a breach. It appears that CS didn't think the current scenario was not possible therefore didn't invest in anything meaningful to prevent this nightmare scenario.
1. Stop putting mission critical systems on Windows, it's not the reliable OS it once was since MS has cut off most of its QA
2. AV solutions are unnecessary if you properly harden your system, AV was needed pre-Vista because Windows was literally running everything as Administrator. AV was never a necessity on UNIX, whatever MS bundles in is usually enough
3. Do not install third party software that runs in kernel mode. This is just a recipe for disaster, no matter how much auditing is done beforehand by the OEM. Linux has taught multiple times that drivers should be developed and included with the OS. Shipping random binaries that rely on a stable ABI may work for printers, not for mission critical software.
I take that to mean that systems can’t even boot. Right?
Can this be fixed over the air?
1) mistakes in kernel-level drivers can and will crash the entire os
2) do not write kernel-level drivers
3) do not write kernel-level drivers
4) do not write kernel-level drivers
5) if you really need a kernel-level driver, do not write it in a memory unsafe language
Boot Windows into Safe Mode or the Windows Recovery Environment (you can do that by holding down the F8 key before the Windows logo flashes on screen) Navigate to the C:WindowsSystem32driversCrowdstrike directory Locate the file matching “C-00000291.sys” file, right click and rename it to “C-00000291.renamed” Boot the host normally.
Certainly feels like it's disproportionately affecting us down under.
Most of the media I found say it’s because “cloud infrastructure”. I am yet to see any major source actually factually report this is caused by a bad patch in Crowdstrike software installed on top of Windows.
Gets to show how little competency there is in journalism nowadays. And begs a question how often they misinterpret and misreport things in other fields?
At least the central flight booking system is up I guess. Google brought it years ago and it's a mainframe.
Hence why google flights is so tapped in :)
Like run stuff on Linux, windows and freebsd servers, so that you have OS redundancy should an issue affect one in particular (kernel or app).
Just like you want more than a single server handling your traffic, you’d want 2 different base for those servers to avoid impacting them both with an update.
I work in hardware development and such a failure is almost impossible to imagine. It has to work, always. It puzzles me why this isn't the casebfor software. My SWE colleagues often get mad at us HW guys because we want to see their test coverage for the firmware/drivers etc.. The focus is having something which compiles and pushing the code to production as fast as possible and then regressing in production. Most of HW problems are a result of this. I found it's often better to go over the firmware myself and read line by line to understand what the code does. It saves so much time from endless debugging sessions later. It pisses of firmware guys, but hey, you have to break some eggs to make an omelette.
When the entire society and economy are being digitized AND that digitisation is controlled and passes through a handful of choke points its an invitation to major disaster.
It is risk management 101, never put all your digital eggs in one (or even a few) baskets.
The love affair with oligopoly, cornered markets and power concentration (which creates abnormal returns for a select few) is priming the rest of us for major disasters.
As a rule of thumb there should be at least ten alternatives in any diversified set of critical infrastructure service providers, all of them instantly replaceable / forced to provide interoperability...
Some truths will hit you in the face again and again until you acknowledge the nature of reality.
CrowdStrike in this context is a NT kernel loadable module (a .sys file) which does syscall level interception and logs then to a separate process on the machine. It can also STOP syscalls from working if they are trying to connect out to other nodes and accessing files they shouldn't be (using some drunk ass heuristics).
What happened here was they pushed a new kernel driver out to every client without authorization to fix an issue with slowness and latency that was in the previous Falcon sensor product. They have a staging system which is supposed to give clients control over this but they pissed over everyone's staging and rules and just pushed this to production.
This has taken us out and we have 30 people currently doing recovery and DR. Most of our nodes are boot looping with blue screens which in the cloud is not something you can just hit F8 and remove the driver. We have to literally take each node down, attach the disk to a working node, delete the .sys file and bring it up. Either that or bring up a new node entirely from a snapshot.
This is fine but EC2 is rammed with people doing this now so it's taking forever. Storage latency is through the roof.
I fought for months to keep this shit out of production because of this reason. I am now busy but vindicated.
Edit: to all the people moaning about windows, we've had no problems with Windows. This is not a windows issue. This is a third party security vendor shitting in the kernel.
Maybe time to reconsider how solid a ground clouds are.
https://old.reddit.com/r/wallstreetbets/comments/1e6ms9z/cro...
I know, I'm dreaming.
So there wasn't any new kernel driver deployed, the existing kernel driver just doesn't fail gracefully.
"62 minutes could bring your business down"
I guess they could bring all the businesses down much quicker.
edit: link https://www.crowdstrike.com/en-us/#teaser-79minutes-adversar...
Isn't this done as well with automatic updates of end user software or embedded systems and if not, why not?
ba-dum ching!
Running Windows on bare-metal was always obviously very stupid. The consequences of such stupidity are just being felt now.
Premature deployment of Crowdstrike AGI disaster response plan.
(also, great choice of name i must say)
They better pin this on a rogue employee, but even then, force pushing updates shouldn't be in their capability at all! They must guarantee removal of that capability.
Lawsuits should be interesting. They offer(ed?) $1 mil breach insurance to their customers, so if they were to pay only that much per customer this might be compensation north of $10B. But to be honest, wouldn't surprise me if they can pay up without going bankrupt.
The sad situation is, as twitter people were pointing out, IT teams will use this to push back against more agents for a long time to come. But in reality, these agents are very important.
Crowdstrike Falcon alone is probably the single biggest security improvement any company can make and there is hardly any competition. This could have been any security vendor, the impact is so widespread because of how widely used they are, but there is a reason why they are so widely used to begin with.
Oh and just fyi, the mitigation won't leave you unprotected, when you boot normal, the userspace exe's will replace it with a fixed version.
I mean, don't they do canary updates on CrowdStrike too? Every Windows admin has done this for the last 5+ years, test Windows updates on a small number of systems to see if they are stable. Why not do the same for 3rd party software?
on https://azure.status.microsoft/en-gb/status the message is currently:
> We have been made aware of an issue impacting Virtual Machines running Windows Client and Windows Server, running the CrowdStrike Falcon agent, which may encounter a bug check (BSOD) and get stuck in a restarting state.
"62 minutes could bring your business down"
I guess they could bring all the businesses down much quicker.
https://www.crowdstrike.com/en-us/#teaser-79minutes-adversar...
(Repeating my comment because other story is duped)
This is not a small startup with some SaaS, these guys are in most computers of too many huge companies. Not rolling out the updates to everyone at the same time seems just too obvious
Microsoft needs to take control and forbid anyone and anything from running software with that kind of behavior.
https://www.nasdaq.com/market-activity/stocks/crwd/short-int...
But maybe this kind of thing can actually impart the lesson that loading your OS up with always-on, internet-connected agents that include kernel components in order to instrument every little thing any program does on the system is, uh, kinda risky.
But maybe not. I wonder if we'll just see companies flock to alternative vendors of the exact same type of product.
My org which does mission critical healthcare just deployed ZScaler on every computer which is now in the critical path of every computer starting up and then in the critical path of every network connection the computer makes. The risk of ZScaler being a central point of failure is not considered. But - the risk of failing the compliance checkbox it satisfies is paramount.
All over the place I'm seeing checkbox compliance being prioritised above actual real risks from how the compliance is implemented. Orgs are doing this because they are more scared of failing an audit than they are of the consequences failure of the underlying systems the audits are supposed to be protecting. So we need to hold regulatory bodies accountable as well - when they frame regulation such that organisations are cornered into this they get to be part of the culpability here too.
So many places use the "emergency break glass rollout procedure" on every deploy because it doesn't require all the hassle
Plus, for your critical communication systems, you must have a disaster recovery plan that actually helps you recover quickly in minutes, not hours or days. And you have to exercise this plan regularly.
If you are crowd strike, shame on you for not testing your product better. You failed to meet a very low bar. You just shipped a 100% reproducible widely impactful bug. Your customers must leave you for a more diligent vendor.
And I really hope the leadership teams in every software engineering organization learn a valuable lesson from this – listen to that lone senior engineer in your leadership team who pushes for better craft and operational rigor in your engineering culture; take it seriously - it has real business impact.
https://en.wikipedia.org/wiki/July_2024_global_cyber_outages
/sarcasm
/but is it really?
https://www.crowdstrike.com/resources/reports/total-economic...
This software was utter shit, and broke stuff all over the place. And installs itself as basically malware into critical paths everywhere. We objected to ever using it as a SPOF, but was overruled.
So yeah, not remotely surprised this happened.
Any kind of middleware/dynamic agent is highly suspect in my experience and to be avoided.
Crowdstrike is very expensive.
Mostly because I lived through Y2K and every fear about Y2K just materialised but because of Crowdstrike instead.
I can't imagine the amount of wasted work this will create, not only the lost of operations across many industries but recovery will be absolute hell with Bitlocker. How many corporate users have access to their encryption keys? And when stored centrally, how many of the servers have Crowdstrike running and just got stuck in a boot loop now?
I don't envy the next days/weeks for Windows IT admins of the world...
Architecting technical systems is MUCH WAY easier than architecting social-economical systems. I hope one day all those tech-savvy web3 wannabe revolutionaries will start to do the real job a designing socially working systems, not only technically barely working cryptographically strong hamster-tapping scams
You don’t need conventional war any more. State actors can just focus on targeting widely deployed “security systems” that will bring down whole economies and bring as much death and financial damage as a missile, while denying any involvement…
Yet the chaos seems to continue. Could it be that this fix can't be rolled out automatically to affected machines because they crash during boot - before the Crowdstrike Updater runs?
They exist solely to tick the box. That’s it. Nobody who pushes for them gives a shit about security or anything that isn’t “our clients / regulators are asking for this box to be ticked”.
The box is the problem. Especially when it’s affecting safety critical and national security systems. The box should not be tickable by such awful, high risk software. The fact that it is reflects poorly on the cybersecurity industry (no news to those on this forum of course, but news to the rest of the world).
I hope the company gets buried into the ground because of it. It’s time regulators take a long hard look at the dangers of these pretend turnkey solutions to compliance and we seriously evaluate whether they follow through on the intent of the specs. (Spoiler: they don’t)
We don't ask customers to switch all systems from Windows to Ubuntu, but to consider moving maybe a third to Ubuntu so they won't sit completely helpless next time Windows fail spectacularly.
While I see more and more Ubuntu systems, and recently have even spotted Landscape in the wild I don't think they were as successful as they hoped with that strategy.
That said, maybe there is a silver lining on todays clouds both WRT Ubuntu and Linux in general, and also WRT IT departments stopping to reconsider some security best practices.
- CS: Have a staging (production-like) environment for proper validation. It looks like CS has one of these bu they have just skipped it - IT Admins: Have controlled roll-outs, instead of doing everything in a single swoop. - CS: Fuzz test your configuration
Anything I have missed?
but being in the industry for so long , I don't expect any changes whatsoever, it's either CS or some other tool
In corporate environments, IT staff struggle to contain these issues using antivirus software, firewalls, and proxies. These security measures often slow down PCs significantly, even on recent multi-core systems that should be responsive.
Microsoft is responsible for providing an operating system that is inherently insecure and vulnerable. They have prioritized user lock-in, dark patterns, and ease of use over security.
Apple has done a much better job with macOS in terms of security and performance.
The corporate world is now divided into two categories: 1. Software-savvy companies that run on Linux or BSD variants, occasionally providing macOS to their employees. These include companies like Google, Amazon, Netflix, and many others. 2. Companies that are not software-focused, as it's not their primary business. These organizations are left with Microsoft's offerings, paying for licenses and dealing with slow and insecure software.
The main advantage of Microsoft's products is the Office suite: Excel, Word and Powerpoint but even Word is actually mediocre.
EDIT: improve expression and fix errors:
I mean, there should be extensive automated testing using many different platforms and hardware combinations as a prerequisite for any rollout.
I guess this is what we get when everything is opaque, not only the product and the code, but also the processes involved in maintaining and evolving the solution. They would think twice about not investing heavily in testing their deployment pipelines if everyone could inspect their processes.
It might also be the case that they indeed have a thorough production and testing process deployed to support the maintenance of crowdstrike solutions, but we are only left to wonder and to trust whatever their PR will eventually throw at us, since they are a closed company.
I haven't used windows in years, but from what I read you need to be in safe mode to delete a crowdstrike file in a system directory, but you need some 48 char key to get into safe mode now if it is locked down?
1. CS normally pushes global updates to entire user base simultaneously?
2. This made it through their testing. Not only 'just' QA but likely CS employees internally run a version or two ahead of their customer base?
Just speculation - folks who know either answer can validate or debunk.
> I'm in Australia. All our banks are down and all supermarkets as well so even if you have cash you can't buy anything.
I hope the national security/defense people are looking at this closely. Because you can bet the bad guys are. What's the saying, civilisation is only ever three days away from collapse or something?
I am pretty convinced this is a fuckup not an attack, but if Iran or someone managed something like this, there would be hell to pay.
Edit: got in touch with an admin:
C-00000291-00000000-00000029.sys SHA256 1A30..4B60 is the bad file (timestamp 0409 UTC)
C-00000291-00000000-00000030.sys SHA256 E693..6FAE is the fix (timestamp >= 0527 UTC)
Do not rely on the hashes too much as these might vary from org to org I've read.
Ex-hackers often talk about security as if it's something you need to add to your systems... Security is achieved through good software development practices and it's about minimalism. You can't take intrinsically crappy, over-engineered, complex software and make it more secure by adding layers upon layer of complex security software on top.
I hope it’s just a bug.
So assuming everyone uses sneaker-net to restart what’s looking like millions of windows boxes, there comes recriminations but then … what?
I think we need to look at minimum viable PC - certain things are protected more than others. Phones are a surprisingly good example - there is a core set of APIs and no fucker is ever allowed to do anything except through those. No matter how painful. At some point MSFT is going to enforce this the way Apple does. The EU court cases be damned.
For most tasks for most things it’s hard to suggest that an OS and a webbrowser are not the maximum needed.
We have been saying it for years - what I think we need is a manifesto for much smaller usable surface areas
I've worked on 4 person software teams that at least followed basic user group rolling release system.
The short version was: we're a civic tech lab, so we have a bunch of different production websites made at different times on different infrastructure. We run Crowdstrike provided by our enterprise. Crowdstrike pushed an update on a Friday evening that was incompatible with up-to-date Debian stable. So we patched Debian as usual, everything was fine for a week, and then all of our servers across multiple websites and cloud hosts simultaneously hard crashed and refused to boot.
When we connected one of the disks to a new machine and checked the logs, Crowdstrike looked like a culprit, so we manually deleted it, the machine booted, tried reinstalling it and the machine immediately crashes again. OK, let's file a support ticket and get an engineer on the line.
Crowdstrike took a day to respond, and then asked for a bunch more proof (beyond the above) that it was their fault. They acknowledged the bug a day later, and weeks later had a root cause analysis that they didn't cover our scenario (Debian stable running version n-1, I think, which is a supported configuration) in their test matrix. In our own post mortem there was no real ability to prevent the same thing from happening again -- "we push software to your machines any time we want, whether or not it's urgent, without testing it" seems to be core to the model, particularly if you're a small IT part of a large enterprise. What they're selling to the enterprise is exactly that they'll do that.
A third party closed source Windows kernel driver that can't be audited. It gathers massive amount of activities and send back to the central server(which can be sold) as well as execute arbitrary payload from the central server.
It became single point of failure to your whole system.
If an attacker gain control of the sysadmin PC, it's over.
If an attacker gain administrator privilege on EDR-installed system, it run the same privilege with EDR so attacker can hide their activities from EDR. There aren't many EDR products in the world it can be done.
I'd like to call it "full trust security model".
I manage a simple Tier-4 cloud application on Azure, involving both Windows and Linux machines. Crowdstrike, OMI, McAfee and endpoint protection in general has been the biggest thorn in my side.
Humanity 1 - Technology 0
Edit: Outage of all ATM's in Morocco was yesterday not today. so not sure how the two are related.
Like the most useful Canary Island in the Coal Mine.
edit: aha https://news.ycombinator.com/item?id=41005936
They did do this to Linux, but in the past. Maybe whatever they did to deal with it saved Linux this time around
"The most important property of a program is whether it accomplishes the intention of its user."
C.A.R. Hoare
They still won’t learning anything from Crowdstrike’s mistakeS!
Maybe it is time for me to ditch that stock.
We put too much code in kernel simply because it’s considered more elite than other software. It’s just dumb.
Also - if a driver is causing a crash MSFT should boot from the last known-good driver set so the install can be backed out later. Reboot loops are still the standard failure mode in driver development…
https://www.reddit.com/r/ProgrammerHumor/comments/f79iag/don...
I can control and manage my own systems. I do not need nanny state auto updating for me.
Crowdstrike should be held liable for financial losses associated with this nonsense.
Is this specific to only Windows machines “protected” with CS or is this impacting Linux/macOS as well?
I hate lawyers, but this is the reason why companies outsource. Why take the blame (and spend the money) when you can blame the vendor?
Until I see an explanation of how this got past testing, I will assume negligence. I wasn't directly affected, but it seems every single Windows machine running their software in my org was affected. With a hit rate that high I struggle to believe any testing was done.
> A "content update" is how it was described. So, it wasn’t a major refresh of the cyber security software. It could have been something as innocuous as the changing of a font or logo on the software design.
He can't be serious, right? Right?
[1] https://www.bbc.co.uk/news/live/cnk4jdwp49et?post=asset%3Abd...
It is for social security, taxes, unemployment benefits, whatever. And running under a foreign TLD, .ME for Montenegro. I am not a security specialist. But I think this is asking for trouble.
By the way, do you remember when fuck.yu became fuck.me ?
Why would an anti-malware program be allowed to install a driver automatically ... or ever for that matter?
Added: OK, from another post I now know Crowdstrike has some sort of kernel mode that allows this sort of catastrophe on Linux. So I guess there is a bigger question here...
BitLocker is a storage driver, so that code turned into a circular dependency. The attempt to page in the code resulted a call to that not-yet-paged-in code.
The reason I didn't catch it with local testing was because I never tried rebooting with BitLocker enabled on my dev box when I was working on that code. For everyone on the team that did have BitLocker enabled they got the BSOD when they rebooted. Even then the "blast radius" was only the BitLocker team with about 8 devs, since local changes were qualified at the team level before they were merged up the chain.
The controls in place not only protected Windows more generally, but they even protected the majority of the Windows development group. It blows my mind that a kernel driver with the level of proliferation in industry could make it out the door apparently without even the most basic level of qualification.
I saw it was Windows and went to bed. What a great feeling.
I'm sorry to those of you dealing with this. I've had to wipe 1200 computers over a weekend in a past life when a virus got in.
Did I receive any appreciation? Nope. I was literally sleeping under cubicle desks bringing up isolated rows one by one. I switched everything in that call center to linux after that. Ironically it turns out it was a senior engineers ssh key that got leaked somehow and was used to get in and dig around servers in our datacenter outside of my network. My filesystem logging (in Windows, coincidentally) alerted me.
IT is fun.
Or perhaps Microsoft is just garbage and soon will be as irrelevant as commercial real estate office parks and mega-call centers
I used to work at MS and didn’t like their 2:1 test to dev ratio or their 0:1 ratio either and wish they spent more work on verification and improved processes instead of relying on testing - especially their current test in production approach. They got sloppy and this was just a matter of time. And god I hate their forced updates, it’s a huge hole in the threat model, basically letting in children who like to play with matches.
My important stuff is basically air-gapped. There is a gateway but it’ll only accept incoming secure sockets with a pinned certificate and only a predefined in-house protocol on that socket. No other traffic allowed. The thing is designed to gracefully degrade with the idea that it’ll keep working unattended for decades, the software should basically work forever so long as equivalent replacement hardware could be found.
- lifts wont operate.
- cant disarm the building alarms. (have been blaring nonstop...)
- cranes are all locked in standby/return/err.
- laser aligners are all offline.
- lathe hardware runs but controllers are all down.
- cant email suppliers.
- phones are all down.
- HVAC is also down for some reason (its getting hot in here.)
the police drove by and told us to close up for the day since we dont have 911 either.
alarms for the building are all offline/error so we chained things as best we could (might drive by a few times today.)
we dont know how many orders we have, we dont even know whos on schedule or if we will get paid.
This is just basic IT common sense. You only do updates during a planned outage, after doing an easily reversible backup, or you have two redundant systems in rotation and update and test the spare first. Critical systems connected to things like medical equipment should have no internet connectivity, and need no security updates.
I follow all of this in my own home so a bad update doesn’t ruin my work day… how do big companies with professional IT not know this stuff?
[AWS Health Dashboard](https://health.aws.amazon.com/health/status)
"First, in some cases, a reboot of the instance may allow for the CrowdStrike Falcon agent to be updated to a previously healthy version, resolving the issue.
Second, the following steps can be followed to delete the CrowdStrike Falcon agent file on the affected instance:
1. Create a snapshot of the EBS root volume of the affected instance
2. Create a new EBS volume from the snapshot in the same Availability Zone
3. Launch a new instance in that Availability Zone using a different version of Windows
4. Attach the EBS volume from step (2) to the new instance as a data volume
5. Navigate to the \windows\system32\drivers\CrowdStrike\ folder on the attached volume and delete "C-00000291*.sys"
6. Detach the EBS volume from the new instance
7. Create a snapshot of the detached EBS volume
8. Create an AMI from the snapshot by selecting the same volume type as the affected instance
9. Call replace root volume on the original EC2 Instance specifying the AMI just created"
Currently waiting in line for 2 hours + waiting for Delta to tell me when my connecting leg can be booked. My current flight is delayed 5 hours.
There's a strong link between the DNC, Hillary, and CrowdStrike. Here's once piece that links a cofounder of CrowdStrike with Hillary pretty far back: https://www.technologyreview.com/innovator/dmitri-alperovitc...
This 2017 piece talks about doubt behind CrowdStrike's analysis of the DNC hack being the result of Russian actors. One of the groups disputing CrowdStrike's analysis was Ukraine's military. https://www.voanews.com/a/crowdstrike-comey-russia-hack-dnc-...
This detailed analysis of CrowdStrike's explanation of the DNC hack goes so far as to say "this sounded made up" https://threatconnect.com/resource/webinar-guccifer-2-0-the-...
The Threat Connect analysis is also discussed here: https://thehill.com/business-a-lobbying/295670-prewritten-gu...
"For one, the vulnerability he claims to have used to hack the NGP VAN ... was not introduced into the code until an update more than three months after Guccifer claims to have entered the DNC system."
Noted at the end of this story they mention that CrowdStrike installed it's software on all of the DNC's systems: https://www.ft.com/content/5eeff6fc-3253-11e6-bda0-04585c31b...
Finally, there's this famous but largely forgotten story of the time Bernie's campaign was accused to accessing Hillary's data: https://www.npr.org/2015/12/18/460273748/bernie-sanders-camp...
"This was a very egregious breach and our data was stolen," Mook said. "We need to be sure that the Sanders campaign no longer has access to our data."
"This bug was a brief, isolated issue, and we are not aware of any previous reports of such data being inappropriately available," the company said in a blog post on its website.
(edited for spelling)
The CEO of Crowdstrike, George Kurtz, was the CTO of McAfee back in 2010 when it sent out a bad update and caused similar issues worldwide.
If at first you don't succeed, .... ;-) j/k
75 Billion dollars valuation, CNBC Analysts praising the company this morning on how well the company is run!...When in reality they can't master the most basic of the phased deployment methodologies known for 20 years...
Hundreds of handsomely paid CTO's, at companies with billions of dollars in valuations, critical healthcare, airlines, who can't master the most basic of the concepts of "Everything fails all the time"...
This whole industry is depressing....
I will admit we've done pretty well with kernel drivers (and better than I would have ever expected tbh), but given our new security focused environment it seems like now is the time to start pivoting again. The trade offs are worth it IMO.
But it's deeper than that: the industry realizes that, once you get to a certain size, no one can hurt you much. Crowdstrike will not pay a lasting penalty for what has just happen, which means executives will shrug and treat this as a random bolt of lightning.
Too bad ChromeOS seems be on the way out at Google.
https://www.youtube.com/watch?v=NcOb3Dilzjc
Interconnected systems can fail spectacularly in unforeseen ways. Strange that something so obvious is so often dismissed or overlooked.
one simple reason: all eggs in one Microsoft PC basket
why in one Microsoft PC basket?
- most corporate desktop apps are developed for Windows ONLY
Why most corporate desktop apps are developed for Windows ONLY?
- it is cheaper to develop and distribute since, 90% of corporations use Windows PCs ( Chicken and Egg problem)
- alternate Mac Laptops are 3x more expensive, so corporations can't afford
- there are no robust industrial grade Linux laptops from PC vendors (lack of support, fear of Microsoft may penalize for promoting Linux laptops etc.)
1/ Most large corporations (Airlines, Hospitals etc..) can AFFORD & DEMAND their Software vendors to provide their ' business desktop applications' both in Windows and Linux versions and install mix of both Operating systems.
2/ majority of corporate desktop applications can be Web applications (Browser based) removing the single vendor Microsoft Windows PC/Laptops
-
Previously;
It appears that someone was able to take my previous comment in this thread completely off hacker news, it's not even listed as flagged. It was at 40pts before disappearing, perhaps there is some reputation management going on here. If it was against the site rules it would be helpful to know which ones.
Edit; the link is https://news.ycombinator.com/item?id=41007985 it was a high up comment that no longer appears even though flagged comments do appear. I checked if it has been moved but the parent comment is still the same. This feels like hellbanned in that there isn't an easy way for me to see if I've been shadowbanned. But I really don't know. I was commenting in good faith.
Either that or Crowdstrike is testing critical software meddling in ring zero so poorly, causing crashes and bootloops out in the wild on 100% of the deployments, that they need to get sued out of existence.
I hope for their sake its the former.
I do not have access to c:\windows\system32\drivers\crowdstrike folder to delete the corrupted .sys file
I was able to boot on recovery mode with network, after waiting 30 min, I rebooted and BSOD persisted.
Are there other alternatives on how to recover?
Imagine what our IT systems would look like with someone _intentionally_ messing with them.
I am not sure in which one of his talks he briefly mentioned that one of his concerns is that we are basically building a digital Alexandria library, and if it burns, well ...
Even more devastating events like this will happen in the future.
We stand on the shoulders of giants and yet we learned nothing.
What happened to the QA testing, staggered rollouts, feature flags, etc.? It's really this easy to cause a boot loop?
To me, BSOD indicates kernel level errors, which I assume Crowdstrike would be able to cause because it has root access due to being a security application. And because it's boot-looping, there's not a way to automatically push out updates?
Does Linux require Crowdstrike style AV software?
What is the problem they are solving?
What is the difference between what an operating system contains and can do and what you need it to do?
Why would I want to rent a server to run a program that performs a task, and also have the same system performing extra tasks - like intrusion detection, intrusion detection software updates, etc.
I just don't understand why compiled program that has enough disk and memory would ever be asked to restart for a random fucking reason having nothing to do with the task at hand. It seems like the architecture of server software is not created intelligently.
>talked to pres of Crowdstrike. His forthrightnes was refreshing. He said “We got it wrong.”
>They are working with Microsoft to understand why this happened.
Pretty much the message minus even more boilerplate talk.
It's unfortunate, the ambulances are still running in our area of responsibility, but it's highly likely that the hospitals they are delivering patients to are in absolute chaos.
It seems like a chicken and egg problem.
I ran a team that developed a remote agent, and this was my nightmare scenario.
But also I don’t understand why this corporate garbageware is still a thing in 2024 when it adds so little value.
yes CRWD is a shitty company but seems they are a "necessity" by some stupid audit/regulatory board that oversees these industries. But at the end of the day, these CIOs/CTOs are completely fucking clueless as to the exact functions this software does on a regular basis. A few minions might raise an issue but they stupidly ignore them because "rEgUlAtOrY aUdIt rEqUiReS iT!1!"
- unaccountable black boxes
- of questionable, and un-auditable, quality
- requires kernel modules, drivers, LocalSystem, root access, etc.
- updates at random times with no testing
- download these updates from where? and immediately trust and run that code at high privilege. using unaccountable-black-box crypto to secure it.
- all have known patterns of bad performance, bugs, and generally poor quality
all in the name of security. let's buy multiple "solutions" and widely deploy them to protect us from one boogeyman, or at least the shiny advertisements say. while punching all sorts of serious other holes in security. why even look for a Windows ZeroDay when we can look for a McAfee or Crowdstrike zero day?
I left thinking about how anti-anti-fragile our systems have become. Maybe we should force cash operations…
DoD shouldn't have given up on MULTICS. That premature optimization is going to sink the US and the Free World.
Personally, I'm still waiting for Genode to be my daily driver.
Just don't do it. Windows Defender is a thing, it does just fine. For everything else there is least-privilege and group policy.
Any insight from those affected?
This outage represents more than just a temporary disruption in service; it's a black swan célèbre of the perilous state of our current technological landscape. This incident must be seen as an inflection point, a moment where we collectively decide to no longer tolerate the erosion of craftsmanship, excellence, and accountability that I feel we've been seeing all over the place. All over critical places.-
Who are we to make this demand? Most likely technologists, managers, specialists, and concerned citizens with the expertise and insight to recognize the dangers inherent in our increasingly careless approach to ... many things, but, particularly technology. Who is to uphold the standards that ensure the safety, reliability, and integrity of the systems that underpin modern life? Government?
Historically, the call for accountability and excellence is not new. From Socrates to the industrial revolutions, humanity has periodically grappled with the balance between progress and prudence. People have seen - and complained about - life going to hell, downhill, fast, in a hand basket without brakes since at least Socrates.-
Yet, today’s technological failures have unprecedented potential for harm. The Crowdsource outage killed, halted businesses, and posed serious risks to safety—consequences that were almost unthinkable in previous eras. This isn't merely a technical failure; it’s a societal one, revealing a disregard for foundational principles of quality and responsibility. Craftsmanship. Care and pride in one's work.-
Part of the problem lies in the systemic undervaluation of excellence. In pursuit of speed and profit uber alles. Many companies have forsaken rigorous testing, comprehensive risk assessments, and robust security measures. The very basics of engineering discipline—redundancy, fault tolerance, and continuous improvement—are being sacrificed. This negligence is not just unprofessional; it’s dangerous. As this outage has shown, the repercussions are not confined to the digital realm but spill over into the physical world, affecting real lives. As it always has. But never before have the actions of so few "perennial interns" affected so many.-
This is a clarion call for all of us with the knowledge and passion to stand up and insist on change. Holding companies accountable, beginning with those directly responsible for the most recent failures.-
Yet, it must go beyond punitive measures. We need a cultural shift that re-emphasizes the value of craftsmanship in technology. Educational institutions, professional organizations, and regulatory bodies must collaborate to instill and enforce higher standards. Otherwise, lacking that, we must enforce them ourselves. Even if we only reach ourselves in that commitment.-
Perhaps we need more interdisciplinary dialogue. Technological excellence does not exist in a vacuum. It requires input from ethical philosophers, sociologists, legal experts. Anybody willing and able to think these things through.-
The ramifications of neglecting these responsibilities are clear and severe. The fallout from technological failures can be catastrophic, extending well beyond financial losses to endanger lives and societal stability. We must therefore approach our work with the gravity it deserves, understanding that excellence is not an optional extra but an essential quality sine qua non in certain fields.-
We really need to make this be an actual tuning point, and not just another Wikipedia page.-
Also, you can mount BitLocker partitions from Linux iirc. If it encounters a BitLocker partition, have it read a text file of possible keys off the USB drive.
I can understand the frustration their customers feel. But how could a software company ever bear liability for all the possible damage they can cause with their software? If they built CrowdStrike to space mission standards nobody could afford it.
We were trialing CrowdStrike and about to purchase next week. If their rep doesn't offer us at least half off, we are going with Sentinel One which was half the price of CS already.
The incompetence that allowed this is baffling to me. I assumed with their billions of dollars they'd have tiers of virtual systems to test updates with.
I remember this happening once with Sophos where it gobbled up Windows system files. If you had set to Delete instead of Quarantine, you were toast.
The irony is dawning on me that for much of the recent computing era we've developed defenses against massive endpoint outages (worms, etc.) and one of them is now inadvertently reproducing the exact problem we had mostly eradicated.
As something of a friendly reminder, it was Microsoft this time, but it's a matter of "when" not "if" till every other OS with that flavor of security theatre is similarly afflicted (and it happens much more frequently when you consider the normal consequences of a company owning the device you paid for -- kicked out of email forever, ads intruding into basic system functions, paid-in-full device eventually requires a subscription, ...). Be cautious with automatic updates.
My first encounter with CrowdStrike was overwhelmingly negative. I was wondering why for the last couple weeks my laptop slowed to a crawl for 1-4 hours on most days. In the process list I eventually found CrowdStrike using massive amounts of disk i/o, enough to double my compile times even with a nice SSD. Then they started installing it on servers in prod, I guess because our cloud bill wasn’t high enough.
If I see some news I will update this comment.
1) A key recovery step requires a snapshot to be take of the disk. The Portal GUI is basically locking up, so scripting is the only way to do this for thousands of VMs. This command is undocumented and has random combinations of strings as inputs that should be enums. Tab-complete doesn't work! See: https://learn.microsoft.com/en-us/powershell/module/az.compu...
E.g.: What are the accepted values for the -CreateOption parameter? Who knows! Good luck using this in a hurry. No stress, just apply it to a production database server at 1 am in the morning.
2) There has been a long-standing bug where VMs can't have their OS disk swapped out unless the replacement disk matches its properties exactly. For comparison, VMware vSphere has no such restrictions.
3) It's basically impossible to get to the recovery consoles of VMs, especially VMs stuck in reboot loops. The serial console output is buggy, often filled with gibberish, and doesn't scroll back far enough to be useful. Boot diagnostics is an optional feature for "reasons". Etc..
4) It's absurdly difficult to get a flat list of all "down" VMs across many subscriptions or resource groups. Again, compare with VMware vSphere where this is trivial. Instead of a simple portal dashboard / view, you have to write this monstrous Resource Graph query:
Resources
| where type =~ 'microsoft.compute/virtualmachines'
| project subscriptionId, resourceGroup, Id = tolower(id), PowerState = tostring( properties.extended.instanceView.powerState.code)
| join kind=leftouter (
HealthResources
| where type =~ 'microsoft.resourcehealth/availabilitystatuses'
| where tostring(properties.targetResourceType) =~ 'microsoft.compute/virtualmachines'
| project targetResourceId = tolower(tostring(properties.targetResourceId)), AvailabilityState = tostring(properties.availabilityState))
on $left.Id == $right.targetResourceId
| project-away targetResourceId
| where PowerState != 'PowerState/deallocated'
| where AvailabilityState != 'Available'
https://github.com/cookiengineer/fix-crowdstrike-bsod
Releases section contains prebuilt binaries, but of course, I always recommend to check the source and then build it yourself.
https://news.ycombinator.com/item?id=41002195&p=2
Broken updates have cause far more havoc than being a few hours or even days late on a so-called critical patch.
https://arstechnica.com/information-technology/2006/10/7998/
I suppose true genius is seldom understood within someone's lifetime.
Windows has pretty good facilities for locking down the system so that ordinary users, even those with local admin rights, cannot run or install unauthorised code so if nothing can get in why would the system need checking for viruses?
So why do most companies not lock down their machines?
The faulting driver in the stack trace was csagent.sys.
Now, Crowdstrike has got two mini filter drivers registered with Microsoft (for signing and allocation of altitude).
1) csagent.sys - Altitude (321410) This altitude falls within the range for Anti-Virus filters. 2) im.sys - Altitude (80680) This altitude falls within the range for access control drivers.
So, it is clear that the driver causing the crash is their AV driver, csagent.sys.
The workaround that CrowdStrike has given is to delete C-00000291*.sys files from the directory: C:\Windows\System32\Drivers\CrowdStrike\
These files being suggested to be deleted are not driver files (.sys files) but probably some kind of virus definition database files.
The reason they name these files with the .sys extension is possibly to leverage Windows System File Checker tool's ability to restore back deleted system files.
This seems to be a workaround and the actual fix might be done in their driver, csagent.sys and the fix will be rolled out later.
Anyone having access a Falcon endpoint might see a change in the timestamp of the driver csagent.sys when the actual fix rolls out.
it seems and i hope that after all is said and done there is no major life-threatening consequence of this debacle. at the same time, heart goes out to the dev who pushed the troubling code. very easy to point at them or the team's processes, but we need to introspect at our own setup and also recognize that not all of us work in crucial systems like this.
Are there any protections to prevent repeating reboots?
See for example 6000 flights cancelled or the many statements posted here regarding it negatively impacting healthcare and other businesses.
The fanout is a robustness measure on systems. If we can control the fanout we increase reliability. If all it takes is a handful of bits in a 3rd party update to kill IT infrastructure, we are doing it wrong.
This worries me. Does this mean intel has access to remotely access my machine?!?!
security is a great business - you play on people's fears, your product does not have to deliver the goods.
like the lock maker, you sell a lock, the thief breaks it, but it is not your problem, and you sell a bigger badder lock the next year which promptly gets broken.
as a business, you dont have any consequences for how your product works or doesnt work, what a great business to be in !!
https://www.theverge.com/2024/7/20/24202527/crowdstrike-micr...
just sayin'
"Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."
(https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule)
Arthur C. Clarke's third law:
"Any sufficiently advanced technology is indistinguishable from magic."
(https://en.wikipedia.org/wiki/Clarke%27s_three_laws#:~:text=....)
Apparently we now have the following, as well:
"Any sufficiently bad software update is indistinguishable from a cyberattack…"
Comments: