In just 78 minutes, a faulty update from CrowdStrike caused global chaos, grounding flights, disrupting hospitals, and halting banking services. This incident serves as a stark reminder of the urgent need for enterprises to bolster their resilience strategies. Read on to learn the essential steps enterprises must take to prepare for future disruptions. For more details, reach out to us to discuss this topic further.
What happened, and how did it happen?
CrowdStrike pushed a faulty sensor configuration update for Falcon that made the Windows devices crash; however, Linux and Mac devices weren’t impacted by this update. The update was pushed on July 19, 2024, at 4:09 UTC, and the remediation was provided on July 19, 2024, at 5:27 UTC – within 78 minutes, but these 78 minutes were enough to create waves that would result in major economic and societal impacts. CrowdStrike (or any other large software provider) can make kernel-level changes in Windows, and it was a kernel-level change that resulted in the Blue-Screen-of-Death (BSOD) error. This approach is very different from Mac, Apple revoked the kernel access to technology providers in 2020, but that resulted in a lot of technology providers having to re-write their entire software.
Microsoft confirmed that the number of Windows devices impacted was close to 8.5 million (around <1% of overall global Windows devices) in its recent press release, but we can’t ignore the severity of the impact.
Impacts of the faulty CrowdStrike update
Some of the major impacts were felt across the companies that directly dealt with end-consumers, including:
- Airlines: Thousands of flights were canceled across the globe owing to the system outage on Windows devices. Delta alone reported that the pause in Delta’s operation resulted in more than 3,500 canceled Delta and Delta Connection flights through July 20. It wasn’t just the airlines; airports too suffered severely, with disruptions reported in airports around the world, such as Hong Kong; Sydney, Australia; Berlin; and Amsterdam
- Healthcare: Several hospitals across the globe were impacted by the outage. In some cases, the outage resulted in the cancelation of non-critical surgeries. US-based Kaiser Permanente, which runs 16 hospitals and 197 medical offices across Southern California and provides care to 12.6 million members in the United States, said that all of its hospitals were affected, and it activated backup systems to keep caring for patients. In the UK, doctors were not able to access their online booking systems, and there are reports of cancelation of non-critical surgeries in Germany
- Banks: Multiple banks saw disruption in services across the globe. Some of the leading ones that were unavailable are Arvest Bank, Bank of America, Capital One, Charles Schwab, Chase, TD Bank, US Bank, and Wells Fargo. There are reports of banks facing outages in Asia as well; the Reserve Bank of India (RBI) mentioned 10 Indian banks and NBFCs experienced minor disruption in services due to the CrowdStrike update
Microsoft called this outage a demonstration of the “interconnected nature of our broad ecosystem,” but this raises a lot of questions about how software updates are pushed, whether enterprises should trust all the updates, and what to do in such situations. In one interview, the Chair of the Federal Trade Commission said, “These incidents reveal how concentration can create fragile systems.”
Typical enterprise challenges that make these incidents more severe
This is not a one-off incident, and in no logical sense will this be the last either. Enterprises face several challenges in managing these kinds of incidents, but some of the biggest challenges are as follows:
- Lack of agility: Enterprises often struggle to quickly adapt to and mitigate unexpected issues due to rigid processes and slow decision-making
- Complex infrastructure: Diverse and outdated systems increase the difficulty in identifying and resolving issues, prolonging outages
- Gigantic scale: Large enterprises operate vast and interconnected systems, making it challenging to quickly isolate and resolve issues, leading to widespread disruptions
- Limited asset visibility: Inadequate tracking of assets hampers the ability to pinpoint and address affected components swiftly, exacerbating the impact of incidents
What should enterprises do for a long-term fix?
Enterprises must prioritize building business resilience to address black swan events, such as the CrowdStrike update incident or the COVID-19 pandemic. Business resilience is the ability of an enterprise to quickly adapt to disruptions while maintaining continuous operations and safeguarding people, assets, and brand equity. This approach not only ensures long-term sustainability but also provides a competitive advantage, as demonstrated by airlines and banks that remained unaffected.
One of the core pillars of business resilience is cyber resilience, which is more about how to deal with zero-day attacks that can literally halt the business operations of a company. We have internally developed a cyber resilience framework called 5R. Our 5R framework can help enterprises remain cyber resilient in the face of such black swan events.
A parallel can be drawn for operational resilience, the other important half of business resilience, using the same framework – enterprises can look at these individual 5Rs of Ready, Respond, Recover, Reinforce, and Revamp from a business perspective. In CrowdStrike’s faulty update push case specifically, enterprises need to focus on Reinforcing their learnings and leverage supply chain best practices to make sure that the impact of black swan events can be minimized.
To summarize, here are some key actions enterprises should take for a long-term fix:
- Emphasize innovation in business resilience: While enterprises understand its importance, there has been little innovation in business resilience. Invest in solutions that match advancements in cybersecurity, cloud, and apps
- Focus on cyber resilience: Develop strategies to manage zero-day attacks and other cyber threats, using frameworks like the internally developed 5R framework
- Enhance operational resilience: Ensure continuity during disruptions by adopting best practices and integrating supply chain management to mitigate unexpected impacts
- Foster strategic collaboration: Collaborate closely with service providers to build effective resilience frameworks, moving beyond treating them as mere order-takers
- Establish Objectives and Key Results (OKRs) and Service Level Agreements (SLAs) on business resilience: Implement OKRs and SLAs to measure and ensure business resilience, aligning them with strategic goals for continuous improvement
While talking to some enterprises over the “outage weekend,” we realized how the industry leaders are looking to build stronger OKRs around business resilience and tie them to SLAs. Some of the OKRs and corresponding SLAs that we discussed are added below:
Objective | Key result | SLAs |
Ensure operational continuity | Reduce system downtime by XX% | Maximum allowable downtime of XX hour per month |
Enhance disaster recovery capabilities | Implement automated backup solutions across all systems | Data backup completed within XX hours of changes |
Strengthen cybersecurity posture | Decrease security incidents by XX% | Incident response time of less than XX minutes |
Improve supply chain resilience | Diversify suppliers for key components | XX% of key suppliers with alternative sourcing options |
Boost employee readiness | Conduct quarterly business resilience training sessions | XX% employee participation in training sessions |
How should enterprises partner with service providers to establish business resilience?
Enterprises should strategically identify and align with key service providers within their ecosystem to enhance business resilience, including preparation for black swan events. Service providers specializing in infrastructure management and cybersecurity services are ideal partners, as these areas are more crucial to overall business resilience. Opting for one or two partners enhances accountability and effectiveness in resilience efforts. Here are key recommendations for enterprises for choosing a strategic partner for business resilience:
- Enhanced protection strategies: Partner with service providers to implement comprehensive protection solutions, including real-time risk detection and response. This collaboration helps safeguard against disruptions, ensuring continuous operations
- Frequent data back-ups and recovery services: Ensure service providers offer automated, regular data backups and quick recovery solutions. This strategy enables swift restoration of operations after data loss or corruption, minimizing downtime
- Better asset visibility: Work with service providers to gain enhanced visibility into digital assets through advanced tools and platforms. Effective monitoring and management of infrastructure allow for quick identification and resolution of potential issues
- Robust supply chain through sandboxing: Encourage service providers to implement sandboxing techniques to test and validate software supply chain updates in a controlled environment. This approach ensures robust and resilient supply chain operations that can adapt to disruptions
- Training employees on business resilience: Collaborate with service providers to conduct regular training sessions for employees on business resilience strategies. This training equips employees with the knowledge and skills needed to handle disruptions and maintain operational continuity
The recent CrowdStrike update incident underscores the vital need for robust business resilience. To mitigate future disruptions, enterprises should invest in innovative resilience strategies, enhance cybersecurity measures, and collaborate with service providers to ensure continuous operations and safeguard their assets. To learn more about the 5R framework or for questions, reach out to Arjun Chauhan or Kumar Avijit.
Watch the webinar, Gen AI and the Future of Cybersecurity: Advanced Strategies for Cyber Defense, for insights into new developments, emerging applications, challenges, and opportunities presented by gen AI in cybersecurity.