The day IT stopped
Dire warnings were blared, describing pandemonium in airports, halted stock market trading, credit card transactions not working – and so forth. Worldwide monitoring committees were instituted to monitor the IT world once all computer systems passed that dreaded second of midnight, Jan. 1, 2000.
For those too young to remember, IT professionals were long wary of how computers would behave once the year 2000 was crossed. This problem was dubbed the “Y2K Problem.” The root of this problem was that many early computer programs only recognized the last two digits of each year. As a result, the years “1900” and “2000” could not be differentiated. A cause for sufficient concern if, say, an air traffic computer believes that a plane scheduled to land on Jan. 1, 2000 had already landed a hundred years before.
Since the whole point of storing two digits instead of four was to save precious program space, it’s quite understandable that those same programs were not created to ponder how an airplane could possibly land on an airstrip three years before airplanes were invented. Nonetheless, as the first few seconds into the 2000s ticked by, IT professionals breathed deep sighs of relief and popped the champagne. Planes didn’t fall from the sky, and our electrical and water systems chugged on as normal. Programmers managed to save the day!
So years passed by, and the notion of a “Global IT Outage” faded. That is, until Friday, July 19, 2024.
The first call I received believed they were compromised by a hacker. While receiving such calls are common to us, I knew this organization to be quite secure and well funded against attacks. “All our servers are affected, I need you to look into this,” the grave voice on the line said.
I hung up quite perplexed. For an attacker to do that sort of damage, there are usually a lot of visible warning signs prior – especially for an advanced organization such as this one. Then posts in IT and IT security forums came rushing in. Whole financial and transport systems ceased to work because their computers simply crashed.
There is a consensus that this was the largest such IT outage in history – where one IT insurance firm (Parametrix Insurance) estimated a cost of $5.4 billion to Fortune 500 companies. Delta Airlines in particular canceled more than 5,500 flights within five days starting from the outage. Other airlines including some in the ASEAN region had to resort to handling transactions manually, while frantically restoring their IT services.
By now I suppose the general public is somewhat aware of what happened, though I suspect there remains some confusion due to its technical nature. I hope to attempt my own “laymanized” manner of explaining, and what we can learn from it.
Most readers know what an “Anti-Virus (AV)” program is, and what it does. I want to emphasize, however, that AVs are described as “passive.” They detect known malicious programs and stop them from running. Generally, that’s it. For large companies, however, this isn’t sufficient. We need to know how that malicious program got there in the first place. Was the user phished? Are we already infiltrated? Did this come from somebody else who was compromised?
There is a need to isolate the computer and analyze it, similar to how police would detain and interrogate a suspect. To do this, a more advanced program called “Endpoint Detection & Response (EDR)” is used. Think of it as an AV on steroids. Unlike a passive AV that simply blocks and deletes malicious files, an EDR allows for further interaction. An IT team, for example, may isolate a compromised computer (so a hacker inside the computer cannot move elsewhere in the organization) and investigate the computer as well. Bear in mind, the EDR lets an IT team do this remotely – on computers they manage across the world.
You’re probably imagining that an EDR is a very powerful piece of software, and you’d be very correct. It is “powerful” in the sense that it directly interacts with the very heart of the Windows operating system – a section of computer code literally referred to as the “kernel.” Mind you, this is not typically done by computer programs. Computer programs normally interact with the Windows operating system from safer and restricted areas.
I suppose this needs an analogy. When hotel guests need to charge their phones, they don’t typically need security access to the main electrical facility housing the circuit breakers. Guests simply plug their devices into the safe, fauxwood veneer electrical sockets in the comfort of their rose-scented rooms.
EDRs, however, not only need direct access to those main electrical facilities – but are constantly given new instructions on what to do once they are in that room. That very room where one single mistake can blow the lights out of the entire building.
And yes, this is basically what happened with the EDR called “CrowdStrike.” See, as with AVs, an EDR constantly gets updated to recognize the latest malicious software. It needs to, because after all, new threats are concocted daily by hackers – who aim to evade detection.
Unfortunately, as with all software, mistakes happen. The CrowdStrike update that took place on July 19 had a bug. While bugs occur with all software all the time (hence making IT security a viable career to put food on the table), recall that CrowdStrike directly works inside the “kernel” – the proverbial main electrical room of an entire building.
Could this have been prevented in the first place if CrowdStrike did not have direct access to the vaunted kernel? Certainly, as MacOS does force similar software to operate in a safer area. That said, a 2009 EU anti-competition ruling required Microsoft to allow third party vendors to access its kernel for equal opportunity.
CrowdStrike is a leading company in the global EDR segment, hence explaining the widespread impact. However, due to the digital manner in which modern companies interact, companies that do not use CrowdStrike directly were affected as well. Companies that don’t have an EDR would definitely feel the impact of their third-party payment system going offline.
Unfortunately, in a world where corporations interact with each other digitally in order to survive, the chances of a digital armageddon imagined with Y2K are all too real. While we may not have all the answers now to reduce the chances of these happening, as always, full awareness of the problem is never a bad first step.
- Latest
- Trending