The day IT stumbled: Lessons learned from CrowdStrike outage

A global tech crisis that erupted a week ago disrupted critical infrastructure and affected around 8.5 million Windows devices, caused by a flawed update in software developed by cybersecurity company CrowdStrike

2024-07-25 21:18:44
- The update caused system crashes across the world by incorrectly flagging legitimate activities as threats, according to​ ​​​​​​Timothy Lethbridge, a British-Canadian computer science professor

- Now, companies, airlines, hospitals, financial systems, and others that were affected by the outage are scrambling to deal with the fallout and take measures to prevent such meltdowns in the future

ISTANBUL

On July 18, a faulty update by cybersecurity company CrowdStrike to its falcon software triggered a global tech meltdown that impacted around 8.5 million Windows devices and disrupted critical infrastructure across multiple sectors.

Airlines, hospitals, and financial systems were affected across an enormous region spanning the Americas, Europe, and Asia, with many met with the so-called "blue screen of death" that Microsoft Windows displays when its system is hit by a critical error.

While the impact of the ordeal was unprecedented, the lingering effects and wide area hit have made it difficult to ascertain the exact extent of the harm that was caused.

CrowdStrike issued an official blog post to clarify the cause, which was identified as a problematic update distributed via automatic channels.

What happened?

The update incorrectly flagged legitimate activities as threats, leading to widespread system failures. CrowdStrike addressed the issue by correcting the update and implementing measures to prevent future occurrences through more rigorous testing.

Timothy Lethbridge, a British-Canadian computer science professor at Ottawa University, confirmed to Anadolu that a coding error in the update was responsible, not a malicious actor.

"It turns out that that description had a fault in it. The understanding right now is that it wasn't a robot or a bad actor or anything like that. It was somebody made a coding error," Lethbridge explained.

He noted that the problem stemmed from a faulty description of bad behavior, which the software is designed to guard against, causing it to crash.

"CrowdStrike sent one of these new descriptions, these files, to all the computers that are running their CrowdStrike Falcon software so that they're all protected," Lethbridge said.

The faulty update, he explained, led to simultaneous failures of all affected systems.

Financial impact

The outage impacted over 1,000 organizations, resulting in a temporary decline in the stock prices of companies using CrowdStrike services.

Financial news website MarketWatch reported an average stock price drop of 3% to 5% in the cybersecurity and technology sectors during the initial days following the incident.

Bloomberg estimates global financial losses from business interruptions at least hundreds of millions of dollars, encompassing lost revenue, increased security costs, and other operational disruptions.

Health care disruptions

Besides financial losses, the outage led to significant delays in electronic health record systems and telemedicine services.

According to a report by the website Health IT News, several hospitals experienced temporary disruptions affecting around 1.5 million patients worldwide.

While no major incidents of patient harm were reported, The New York Times highlighted concerns about potential risks due to delays in accessing critical medical records.

Transportation problems

UPI reported that the CrowdStrike outage caused widespread disruptions in air travel, resulting in thousands of flight cancellations globally, with backlogs continuing through the week.

Nearly 26,000 flight delays were also reported, it added, with the US experiencing over 3,000 cancellations and more than 12,000 delays on the first day alone.

Major airlines like Delta were some of the most heavily affected, with many reportedly seeking to compensate passengers with travel waivers.

Impact on daily life

Many consumer-facing applications and services reliant on CrowdStrike for cybersecurity faced outages or degraded performance, affecting online shopping, banking apps, and other digital services.

Major banks like JPMorgan Chase, Bank of America, and payment card services provider Visa faced login and payment issues, affecting millions of customers, according to Peoplemag.

Numerous other financial institutions and e-wallets globally experienced disruptions, causing delays in transactions and service accessibility, it added.

Media and broadcasting services were also hit, with NBC affiliates and Sky News experiencing blackouts, leaving stations off the air for hours.

Additionally, billboards in Times Square went blank during the outage, highlighting the extensive reach of the incident.

The outage affected not only businesses but also public services, with DMV offices in several states shutting down temporarily.

The Guardian reported increased public anxiety and confusion, though the overall effect on daily routines was relatively contained.

CrowdStrike's Falcon software

CrowdStrike's Falcon software is designed to monitor and protect systems from threats using real-time updates.

Forbes highlighted Falcon's use of AI and machine learning to mitigate threats.

Lethbridge elaborated that companies rely on CrowdStrike for continuous protection, though the faulty update disrupted this protection temporarily.

"CrowdStrike provided this Falcon software, which is constantly monitoring the computer, but it's doing it in an interesting way," he said.

The setup enables CrowdStrike to send real-time updates about newly detected threats, ensuring continuous protection, he added.

Recovery

According to Lethbridge, all computers worldwide that were open and using Falcon went down simultaneously as the automatic update took effect July 18.

“All the computers running CrowdStrike Falcon software that were live at 4.09 a.m. (0809GMT) went down simultaneously,” he said, describing the widespread impact.

TechCrunch reported that CrowdStrike promptly identified and addressed the issue, curbing the disruption as much as it could.

Recovery involved manual intervention to restart affected systems and issue a corrected update. Despite the damage, CrowdStrike has received praise for its transparent communication and the rapid recovery of many systems, helping maintain customer confidence.

Technical and testing challenges

In the aftermath of the outage, TechRadar uncovered vulnerabilities in the stability of CrowdStrike's systems when experiencing high traffic.

Meanwhile, information technology media outlet InfoWorld highlighted issues with testing new updates and configurations, suggesting that testing protocols and disaster recovery plans need to be improved.

“There's always a risk of error, but the risk is supposed to be really reduced by extensive testing,” Lethbridge warned.

“CrowdStrike didn't somehow manage to do better testing of this,” he added, suggesting that improved protocols and internal simulations could prevent similar incidents in the future.

Role of AI

Acknowledging the growing use of AI tools in software development, Lethbridge speculated that there could be “maybe a 20% chance that AI-assisted tools contributed to the error” that afflicted Falcon.

He noted that while AI played a crucial role in diagnosing and addressing the issue, it may also introduce errors that risked going unnoticed.

Technology news website The Register noted that AI plays a role in detecting anomalies, though its real-time analysis limitations delayed root cause identification in the CrowdStrike case.

Wired and ZDNet, meanwhile, emphasized the need for better predictive capabilities and incident response improvements.

Preparing for future outages

Now, CrowdStrike is reportedly investing in advanced AI tools and revising incident response protocols to enhance resilience against future outages.

ZDNet has noted improvements in testing environments and redundancy measures, but Lethbridge warns that such outages could recur and might be more severe, stressing the importance of building resilient systems and having contingency plans for critical services.

What is certain is that this incident highlighted the need for improved cybersecurity practices and enhanced software testing to protect digital infrastructure as experts seek ways to prevent outages in the future.