<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="https://elezea.com/wp-content/themes/elz_2023/styles/pretty-feed-v3.xsl" type="text/xsl"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
  xmlns:slash="http://purl.org/rss/1.0/modules/slash/" >
  <channel>
    <title>Elezea by Rian van der Merwe - RSS Feed</title>
    <atom:link href="https://elezea.com/2023/03/faa-outage-contractor-database/feed/" rel="self" type="application/rss+xml" />
    <link>https://elezea.com/2023/03/faa-outage-contractor-database/</link>
    <description>A personal blog about product, technology, and interesting things that are worth sharing.</description>
    <lastBuildDate>Thu, 02 Apr 2026 17:43:52 +0000</lastBuildDate>
    <language></language>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <generator>https://wordpress.org/?v=6.9.4</generator>
          <item>
        <title>Don’t blame outages on human error and technical debt; improve the system instead</title>
        <link>https://elezea.com/2023/03/faa-outage-contractor-database/</link>
        <pubDate>Sat, 04 Mar 2023 15:54:58 +0000</pubDate>
        <dc:creator>Rian van der Merwe</dc:creator>
        <guid isPermaLink="false">https://elezea.com/?p=7920</guid>
        <description>
          <![CDATA[The big FAA outage in January was blamed on "human error" but that ignores the bigger question: how can we improve the system that enables one person to inadvertently take down all air traffic in the US?]]>
        </description>
        <content:encoded>
          <![CDATA[<p>The <a href="https://edition.cnn.com/travel/article/faa-computer-outage-flights-grounded/index.html">FAA outage in January</a> that caused the first nationwide ground stop of all flights in the US since 9/11 is kind of old news now, but there’s one detail that I can’t stop thinking about. In the aftermath of the incident the <a href="https://www.nytimes.com/2023/01/19/travel/faa-outage-alert-system.html">cause was determined to be a database sync issue</a>:</p>
<blockquote>
<p>The F.A.A. said in a statement that the workers had been trying to “correct synchronization” between the main database for the Notice to Air Missions alerts and a backup database when the files were mistakenly deleted, causing the outage that snarled air traffic throughout the day on Jan. 11.</p>
</blockquote>
<p>CNN <a href="https://www.cnn.com/2023/01/19/business/faa-notam-outage/index.html">added a little more detail</a>:</p>
<blockquote>
<p>A contractor working for the Federal Aviation Administration unintentionally deleted files related to a key pilot safety system, leading to a nationwide ground stop and thousands of delayed and canceled flights last week, the FAA said Thursday.</p>
<p>The FAA determined the issue with the Notice to Air Missions (NOTAM) system occurred when the contractor was “working to correct synchronization between the live primary database and a backup database.”</p>
</blockquote>
<p>The unsurprising narrative that came out of the tech world following the incident can basically be summarized as “ha ha, silly contractors!” But that feels like a lazy response to me. I didn’t see anyone ask what I believe is the more important question: <strong>How do we improve the <em>system</em> (people, processes, technology) that enables one person to inadvertently take down all air traffic in the US?</strong></p>
<p>Let’s remember that this kind of thing can happen to absolutely anyone. <a href="https://www.linkedin.com/posts/etsy_lifeatetsy-engineeringatetsy-activity-6780161824160563200-Rhu8/">Etsy even hands out a “three-armed sweater” award</a> to the engineer who had the most spectacular mishap in any given year:</p>
<blockquote>
<p>Kate’s story is a nail-biter, involving a tiny code change that unexpectedly brought down Etsy.com. All of her coworkers rallied around her to help get the site back online, while offering words of encouragement and reassurance.</p>
</blockquote>
<p>So it might be really convenient to blame the FAA outage on “contractor error” and then just keep going. But that’s not going to prevent the <em>next</em> incident from happening.</p>
<p>It is further also tempting to blame the entire issue on “tech debt” and call it a day. And, fair enough, there’s certainly plenty of that going around in FAA systems. <a href="https://arstechnica.com/tech-policy/2023/01/faa-outage-that-grounded-flights-blamed-on-old-tech-and-damaged-database-file/">Ars Technica has a good overview</a> of some of the major issues and how the FAA wants to fix them. But like all giant “replatforming” projects (this one is called <em>NextGen</em>, because of course it is) things are&#8230; not going great:</p>
<blockquote>
<p>FAA tech problems were previously described in a March 2021 report by the US Department of Transportation Office of Inspector General. The report discusses the FAA&#8217;s Next Generation Air Transportation System (NextGen), &#8220;a multibillion dollar infrastructure project aimed at modernizing our Nation&#8217;s aging air traffic system to provide safer and more efficient air traffic management.&#8221;</p>
<p>“NextGen&#8217;s actual and projected benefits have not kept pace with initial projections <strong>due to implementation challenges, optimistic assumptions, and other factors</strong>,”<sup id="fnref:7920.1"><a href="#fn:7920.1" rel="footnote">1</a></sup> the report said.</p>
</blockquote>
<p>But blaming tech debt—and especially blaming individuals—is not going to get us very far. Tech debt will always be there (although <a href="https://elezea.com/2023/02/how-to-prioritize-technical-debt/">I have some thoughts on how to prioritize it</a>), and individual mistakes are not going to go away. What we <em>can</em> do is examine the system that enables, in this case, a database sync to corrupt the primary live db, and figure out how to prevent that from happening in the first place.</p>
<p>Almost 30 years ago Jakob Nielsen published his <a href="https://www.nngroup.com/articles/ten-usability-heuristics/">10 Usability Heuristics for User Interface Design</a>, and “error prevention” is still as true today as it was then:</p>
<blockquote>
<p>Good error messages are important, but the best designs carefully prevent problems from occurring in the first place. Either eliminate error-prone conditions, or check for them and present users with a confirmation option before they commit to the action.</p>
</blockquote>
<p>The example I always think of here is how you often seen battery packs shaped in a certain way so that it’s impossible to insert them incorrectly (contrast that with the terrors of trying to insert a USB cable the correct way the first time!).</p>
<p>In a situation like the one the FAA experienced, yes it’s important to acknowledge human error, and talk about the underlying tech issues, but that’s not enough. We have to figure out how to add preventative measures to our systems and pipelines<sup id="fnref:7920.2"><a href="#fn:7920.2" rel="footnote">2</a></sup>. To put it another way, they might not be able to replace their battery packs with <em>NextGen</em> solar yet, but they can certainly change the shape of the battery to prevent contractors from blowing up the camera.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:7920.1">
<p>My emphasis added because who among us have <em>not</em> heard those words before&#8230;&#160;<a href="#fnref:7920.1" rev="footnote">&#8617;</a></p>
</li>
<li id="fn:7920.2">
<p>For further reading on what to do after a major incident, check out Will Larson&#8217;s <strong><a href="https://github.com/readme/guides/incident-response">Move past incident response to reliability</a></strong>.&#160;<a href="#fnref:7920.2" rev="footnote">&#8617;</a></p>
</li>
</ol>
</div>
          <br>
          <br>
          <hr>
          Thanks for still believing in RSS! Get in touch <a href="https://elezea.com/contact">here</a> if you'd like.]]>
        </content:encoded>
                      </item>
      </channel>
</rss>