Lessons from the CrowdStrike Conundrum and why you need Chaos Engineering

Why IT Departments Must Embrace Chaos Engineering to Safeguard Against Unforeseen Disruptions

Jul 20, 2024

"Security is a process, not a product." — Bruce Schneier

The recent CrowdStrike issue demonstrates that even robust cybersecurity solutions can face challenges disrupting global operations. This incident highlights the need for organizations to review their resiliency strategies and implement chaos engineering and resilience testing to ensure their systems can withstand unforeseen disruptions.

The Incident

CrowdStrike encountered a significant issue with its Falcon Sensor software, leading to widespread outages and the notorious 'Blue Screen of Death' (BSOD) across numerous systems. This incident, coupled with a major Microsoft outage, underscores the complexity of the IT ecosystem and the potential for cascading failures.

Microsoft Outage

This week, Microsoft faced an unplanned widespread outage, causing disruptions across multiple services and leaving users worldwide grappling with the BSOD. The root cause was an issue with CrowdStrike's Falcon Sensor software, a tool designed to protect systems from cyberattacks. Despite attempts to roll back the update, many machines remained affected.

The outage had a global impact, affecting various platforms, including Microsoft 365, Azure, Amazon Web Services, and even social media sites like Instagram and eBay. It grounded flights, disrupted live broadcasts, and caused supermarket payment processing issues.

Response and Resolution

CrowdStrike's CEO, George Kurtz, provided an update stating that the issue had been identified and isolated, and a fix had been deployed.

While resolved, this incident highlights the risks associated with heavy reliance on cloud services and the necessity for robust resiliency strategies.

Chaos Engineering vs. Resilience Testing

Chaos Engineering

Popularized by Netflix, chaos engineering involves deliberately introducing failures into a system to uncover unknown issues in production-like environments. This practice helps identify vulnerabilities and strengthen system resilience against real-world disruptions.

Chaos Experiments: Conduct controlled experiments to simulate failures and observe system responses, as detailed in "Chaos Engineering" by Casey Rosenthal and Nora Jones.
Continuous Testing: Regularly test systems under various failure scenarios to ensure continuous resiliency, echoing the continuous improvement ethos in "Lean Startup" by Eric Ries.

Resilience Testing

Resilience testing focuses on validating known failure scenarios and the system's ability to recover from them. This ensures that systems can handle anticipated disruptions and maintain operational integrity.

Scenario Validation: Test systems against predefined failure scenarios to ensure they can recover effectively.
Recovery Verification: Ensure that recovery mechanisms function correctly, maintaining system availability and performance.

Thank you for reading Think Big Code Small. This post is public, so feel free to share it.

Principles of Chaos Engineering

Drawing from PhoenixNAP, here are some principles to guide effective chaos engineering:

Define the Steady State: Establish normal system behavior metrics as a baseline.
Formulate Hypotheses: Predict how the system will react to specific disruptions.
Plan Experiments: Design detailed failure simulations targeting critical components.
Execute Experiments: Implement disruptions and monitor system responses.
Analyze Results: Compare outcomes against the baseline to identify weaknesses.
Improve and Iterate: Use insights to enhance system resilience continuously.

Best Practices for Chaos Engineering

Based on the Medium article by The Cloud Architect, here are some best practices for a successful chaos engineering journey:

Start Small: Begin with small-scale experiments to build confidence and understanding.
Automate: Use automation tools to implement and monitor chaos experiments.
Collaborate: Foster a culture of collaboration among teams to address discovered issues.
Document: Keep detailed records of experiments, outcomes, and improvements.
Scale Gradually: Expand the scope of experiments gradually to encompass more significant parts of the system.

DevOps/SRE Strategy Integration

Chaos engineering and resilience testing are crucial components of a comprehensive DevOps and Site Reliability Engineering (SRE) strategy. These practices should be integrated with existing engineering practices like automated testing and environment management to ensure robust system reliability and performance.

Automated Testing

Automated testing ensures that code changes do not introduce new failures. Integrating chaos engineering with automated testing helps identify potential disruptions early in the development lifecycle, improving overall system resilience.

Environment Management

Effective environment management includes maintaining consistent development, testing, and production environments. Chaos engineering can validate that these environments can withstand disruptions, ensuring reliable deployments and operations.

Types of Chaos Engineering Tests

Infrastructure Failure Tests: Simulate failures in the underlying infrastructure, like instance terminations.
Application Failure Tests: Force shutdowns of critical services to identify single points of failure.
Dependency Failure Tests: Assess how the system handles failures in third-party services.
Network Failure Tests: Introduce latency or packet loss to test network robustness.
Security Chaos Engineering Tests: Simulate cyber-attacks to evaluate system defenses.
Operational Failure Tests: Test response to routine maintenance and unexpected operational issues.

Conclusion

The CrowdStrike incident is a powerful reminder that even top-tier cybersecurity solutions are not immune to issues that can cause widespread disruption. Organizations must adopt comprehensive resiliency strategies, including chaos engineering and resilience testing, to ensure robust defenses and operational efficiency.

As Bruce Schneier aptly puts it, "Security is a process, not a product," underscoring the ongoing journey toward true resiliency. By learning from these disruptions and proactively strengthening their systems, organizations can better navigate the complexities of the digital age.