Why system resilience should mainly be the job of the OS, not just third-party applications

October 03, 2024

Building efficient recovery options will drive ecosystem resilience

Last week, a US congressional hearing regarding the CrowdStrike incident in July saw one of the company’s executives answer questions from policy makers. One point that caught my interest during the ensuing debate was the suggestion that future incidents of this magnitude could be avoided by some form of automated system recovery.

Without getting into the technical details of the incident and how it could have been avoided, the suggestion begs a fundamental question: should automated recovery be the responsibility of the third-party software vendor or is this better framed as a wider issue of the resilience of the operating system (OS), meaning that the latter initiates some form of auto-recovery process in collaboration with a third-party application?

A system that heals itself

A catastrophic boot error that causes a blue screen of death (BSOD) occurs when the device fails to load the software required to present the user with a working operating system, along with the applications installed on the device. For example, it can be triggered when software is installed or updated; in this particular instance, a corrupted/bad update file called on during the boot process of the device triggered the BSOD that ultimately resulted in a well-documented global IT meltdown.

Some software, such as security applications, require low-level access, known as ‘kernel mode’. If a component at this level fails, a BSOD is a potential outcome. Rebooting the device results in the same BSOD loop and you need expert intervention to break this cycle. (Of course, a BSOD can also occur in ‘user mode’, which provides a more restricted environment for software to operate in.)

Now, if the mention of kernel mode lost you, let me use an analogy to make things clearer: Think of an engine in a gasoline car. The engine requires a spark to ignite the fuel-air mixture, which is where a spark plug comes in. On a regular maintenance schedule, spark plugs need replacing, otherwise the engine may well fail to perform as expected. A mechanic pops the hood of the car and in go new spark plugs. Turn the key (or push the start button) and the engine starts – except when it doesn’t. That’s roughly what happened in this incident, but from a software standpoint.

Now, the question arises: should it be the responsibility of a spark plug manufacturer, of which there are many, to create an auto-recovery mechanism for this scenario? In the software context, should the third-party vendor be responsible? Or should the mechanic just pop the hood again, revert to the used and known-to-be-working spark plugs, and restart the car in its previous working state?

In my view, the recovery process should be the same in all circumstances, regardless of the third-party software (or spark plugs) involved. Now, the reality is, of course, a little more complex than my analogy, as the spark plugs (the software) are being updated and replaced without the knowledge of the mechanic (the OS). Still, I hope the analogy helps provide a visual of the issue.

The case for OS-managed recovery

If every time a third-party software package updates and makes an adjustment to the core workings of the device, installs a new or modified file required at the time of the boot process, if it was to register with the operating system and the previous working file or state gets put to one side rather than overwritten. In theory, if on the next startup the device gets to a situation of a BSOD then a subsequent boot could, as a first task, check if the device did not start correctly on the previous boot and offer the user an option to recover the replaced file or state with the previous version, removing the update. The same scenario could be used for all third-party software that has kernel-mode access.

There is already a precedent for this kind of OS-managed recovery. When a new display driver is installed, but fails to initiate correctly during the boot process, the failure is captured and the operating system will automatically revert to a default state and offer a very low-resolution driver that works with all displays. This exact scenario obviously does not work for cybersecurity products, because there is no default state, but there could be a previous working state prior to the update.

Having a recovery option built into the OS for all third-party software would be more efficient than relying on each software vendor to develop their own solution. It would, of course, need consultation and collaboration between OS and third-party software vendors to ensure the mechanism functions and could not be exploited by bad actors.

I also accept that I may have (over)simplified the heavy lifting needed to develop such a solution, but even so, it would be more robust than to have thousands of software developers trying to create their own system recovery method. Ultimately, this could go a long way toward improving system resilience and preventing widespread outages – like the one triggered by the faulty CrowdStrike update.