Don’t blame outages on human error and technical debt; improve the system instead

The FAA outage in January that caused the first nationwide ground stop of all flights in the US since 9/11 is kind of old news now, but there’s one detail that I can’t stop thinking about. In the aftermath of the incident the cause was determined to be a database sync issue:

The F.A.A. said in a statement that the workers had been trying to “correct synchronization” between the main database for the Notice to Air Missions alerts and a backup database when the files were mistakenly deleted, causing the outage that snarled air traffic throughout the day on Jan. 11.

CNN added a little more detail:

A contractor working for the Federal Aviation Administration unintentionally deleted files related to a key pilot safety system, leading to a nationwide ground stop and thousands of delayed and canceled flights last week, the FAA said Thursday.

The FAA determined the issue with the Notice to Air Missions (NOTAM) system occurred when the contractor was “working to correct synchronization between the live primary database and a backup database.”

The unsurprising narrative that came out of the tech world following the incident can basically be summarized as “ha ha, silly contractors!” But that feels like a lazy response to me. I didn’t see anyone ask what I believe is the more important question: How do we improve the system (people, processes, technology) that enables one person to inadvertently take down all air traffic in the US?

Let’s remember that this kind of thing can happen to absolutely anyone. Etsy even hands out a “three-armed sweater” award to the engineer who had the most spectacular mishap in any given year:

Kate’s story is a nail-biter, involving a tiny code change that unexpectedly brought down Etsy.com. All of her coworkers rallied around her to help get the site back online, while offering words of encouragement and reassurance.

So it might be really convenient to blame the FAA outage on “contractor error” and then just keep going. But that’s not going to prevent the next incident from happening.

It is further also tempting to blame the entire issue on “tech debt” and call it a day. And, fair enough, there’s certainly plenty of that going around in FAA systems. Ars Technica has a good overview of some of the major issues and how the FAA wants to fix them. But like all giant “replatforming” projects (this one is called NextGen, because of course it is) things are… not going great:

FAA tech problems were previously described in a March 2021 report by the US Department of Transportation Office of Inspector General. The report discusses the FAA’s Next Generation Air Transportation System (NextGen), “a multibillion dollar infrastructure project aimed at modernizing our Nation’s aging air traffic system to provide safer and more efficient air traffic management.”

“NextGen’s actual and projected benefits have not kept pace with initial projections due to implementation challenges, optimistic assumptions, and other factors,”¹ the report said.

But blaming tech debt—and especially blaming individuals—is not going to get us very far. Tech debt will always be there (although I have some thoughts on how to prioritize it), and individual mistakes are not going to go away. What we can do is examine the system that enables, in this case, a database sync to corrupt the primary live db, and figure out how to prevent that from happening in the first place.

Almost 30 years ago Jakob Nielsen published his 10 Usability Heuristics for User Interface Design, and “error prevention” is still as true today as it was then:

Good error messages are important, but the best designs carefully prevent problems from occurring in the first place. Either eliminate error-prone conditions, or check for them and present users with a confirmation option before they commit to the action.

The example I always think of here is how you often seen battery packs shaped in a certain way so that it’s impossible to insert them incorrectly (contrast that with the terrors of trying to insert a USB cable the correct way the first time!).

In a situation like the one the FAA experienced, yes it’s important to acknowledge human error, and talk about the underlying tech issues, but that’s not enough. We have to figure out how to add preventative measures to our systems and pipelines². To put it another way, they might not be able to replace their battery packs with NextGen solar yet, but they can certainly change the shape of the battery to prevent contractors from blowing up the camera.

My emphasis added because who among us have not heard those words before… ↩
For further reading on what to do after a major incident, check out Will Larson’s Move past incident response to reliability. ↩

Don’t blame outages on human error and technical debt; improve the system instead

Articles and resources for product managers, technology leaders, and other curious minds