Menu

Don’t blame outages on human error and technical debt; improve the system instead

The FAA outage in January that caused the first nationwide ground stop of all flights in the US since 9/11 is kind of old news now, but there’s one detail that I can’t stop thinking about. In the aftermath of the incident the cause was determined to be a database sync issue:

The F.A.A. said in a statement that the workers had been trying to “correct synchronization” between the main database for the Notice to Air Missions alerts and a backup database when the files were mistakenly deleted, causing the outage that snarled air traffic throughout the day on Jan. 11.

CNN added a little more detail:

A contractor working for the Federal Aviation Administration unintentionally deleted files related to a key pilot safety system, leading to a nationwide ground stop and thousands of delayed and canceled flights last week, the FAA said Thursday.

The FAA determined the issue with the Notice to Air Missions (NOTAM) system occurred when the contractor was “working to correct synchronization between the live primary database and a backup database.”

The unsurprising narrative that came out of the tech world following the incident can basically be summarized as “ha ha, silly contractors!” But that feels like a lazy response to me. I didn’t see anyone ask what I believe is the more important question: How do we improve the system (people, processes, technology) that enables one person to inadvertently take down all air traffic in the US?

Let’s remember that this kind of thing can happen to absolutely anyone. Etsy even hands out a “three-armed sweater” award to the engineer who had the most spectacular mishap in any given year:

Kate’s story is a nail-biter, involving a tiny code change that unexpectedly brought down Etsy.com. All of her coworkers rallied around her to help get the site back online, while offering words of encouragement and reassurance.

So it might be really convenient to blame the FAA outage on “contractor error” and then just keep going. But that’s not going to prevent the next incident from happening.

It is further also tempting to blame the entire issue on “tech debt” and call it a day. And, fair enough, there’s certainly plenty of that going around in FAA systems. Ars Technica has a good overview of some of the major issues and how the FAA wants to fix them. But like all giant “replatforming” projects (this one is called NextGen, because of course it is) things are… not going great:

FAA tech problems were previously described in a March 2021 report by the US Department of Transportation Office of Inspector General. The report discusses the FAA’s Next Generation Air Transportation System (NextGen), “a multibillion dollar infrastructure project aimed at modernizing our Nation’s aging air traffic system to provide safer and more efficient air traffic management.”

“NextGen’s actual and projected benefits have not kept pace with initial projections due to implementation challenges, optimistic assumptions, and other factors,”1 the report said.

But blaming tech debt—and especially blaming individuals—is not going to get us very far. Tech debt will always be there (although I have some thoughts on how to prioritize it), and individual mistakes are not going to go away. What we can do is examine the system that enables, in this case, a database sync to corrupt the primary live db, and figure out how to prevent that from happening in the first place.

Almost 30 years ago Jakob Nielsen published his 10 Usability Heuristics for User Interface Design, and “error prevention” is still as true today as it was then:

Good error messages are important, but the best designs carefully prevent problems from occurring in the first place. Either eliminate error-prone conditions, or check for them and present users with a confirmation option before they commit to the action.

The example I always think of here is how you often seen battery packs shaped in a certain way so that it’s impossible to insert them incorrectly (contrast that with the terrors of trying to insert a USB cable the correct way the first time!).

In a situation like the one the FAA experienced, yes it’s important to acknowledge human error, and talk about the underlying tech issues, but that’s not enough. We have to figure out how to add preventative measures to our systems and pipelines2. To put it another way, they might not be able to replace their battery packs with NextGen solar yet, but they can certainly change the shape of the battery to prevent contractors from blowing up the camera.


  1. My emphasis added because who among us have not heard those words before… 

  2. For further reading on what to do after a major incident, check out Will Larson’s Move past incident response to reliability

Link roundup for March 3, 2023

The African Bricks 3. Mosaic artworks inspired by the culture and beauty of Africa, by Charis Tsevis.

The Cello in Soho Square. I like this description by Michael Lopp of the difference between “dabblers” and “S-tier” people (who are the absolute best at something): “There is an infinite list of exciting things to learn, but the Dabbler knows they have finite time, so they dabble. They get 80% of the juice, and they move on. Respect. S-Tier knows the last 10% of the challenge is the hardest, but it also teaches you the most.”

Physicists Say Aliens May Be Using Black Holes as Quantum Computers. This is fine. “In a recent study, a German-Georgian team of researchers proposed that advanced extraterrestrial civilizations (ETCs) could use black holes as quantum computers. This makes sense from a computing standpoint and offers an explanation for the apparent lack of activity we see when we look at the cosmos.”

Honestly, it’s probably the phones. Don’t dismiss this argument just from the headline, like I almost did. There’s some solid evidence presented here. “If we’re looking for one big ‘silver bullet’ or ‘grand unified theory’ of modern teenage unhappiness, phones are probably the place to start looking.”

Papercraft Models by Rocky Bergen. “Construct the computer from your childhood or build an entire computer museum at home with these paper models, free to download and share. Print, Cut, Score, Fold and Glue.”

In an Uncertain Job Market, How Can Companies Retain Workers? The conventional wisdom that people tend to hunker down when there are layoffs around them might not be accurate: “Layoffs ‘create an environment where people worry it might happen to them next,’ said Laszlo Bock, who was Google’s SVP for people operations. Poorly handled reductions may ‘degrade trust in management as people start hearing rumors of further cuts, and that in turn raises anxiety, which causes more people to quit.’” (NYT gift article)

How the Phonograph Created the 3-Minute Pop Song. I can’t resist a good “technologies people thought would ruin everything” article, and this is another fascinating one: “Plenty of folks worried that records would destroy musical culture. John Philip figured it would demotivate anyone from learning to play an instrument themselves. Why bother, when you could just put on music by a true virtuoso? ‘When music can be heard in the homes without the labor of study,’ he fretted in a 1906 article, ‘it will be simply a question of time when the amateur disappears entirely.’”

The Case for Hanging Out. I love this essay. “Pushed further into isolation by the pandemic, we’re all losing the ability to engage in what I view as the pinnacle of human interaction: sitting around with friends and talking shit.”

Explore. I think it’s probably too late for a viable LinkedIn alternative, but this site would be a great contendor.

Meaningful metrics: How data sharpened the focus of product teams

In Meaningful metrics: How data sharpened the focus of product teams Erin Gustafson goes into detail on how Duolingo grew their Daily Active Users (DAUs) by 4x since 2019. It all starts with the growth model they built:

The Growth Model is a series of metrics we developed to jump-start our growth strategy with data. It is a Markov Model that breaks down topline metrics (like DAU) into smaller user segments that are still meaningful to our business. To do this, we classify all Duolingo learners (past or present) into an activity state each day, and monitor rates of transition between states.

Once they were confident in the model they did a bunch of simulations to build a hypothesis of where they should focus on for the most growth:

With the Growth Model in place and trained on historical data, we began to run growth simulations. The goal of these simulations was to identify new metrics that—when optimized—were likely to increase DAU. We did this by systematically pulling each lever in the model to see what the downstream impact on DAU would be.

Click through to see a visualization of the model, and where they are planning to take this work next. The article also pairs well with Jorge Mazal’s How Duolingo reignited user growth.

Link roundup for March 1, 2023

Open Circuits is “a photographic exploration of the beautiful design inside everyday electronics. Its stunning cross-section photography unlocks a hidden world full of elegance, subtle complexity, and wonder.”

Good conversations have lots of doorknobs. This is a fascinating essay about the elements of good conversation and the difference between “takers” who keep things going, “givers” who tend to ask a lot of questions, and how the wrong match-up can cause a conversation to stall. Includes good advice backed up by tons of academic research. This is one to save and revisit often.

Why do modern pop songs have so many credited writers? Some of the examples are wild. “When these cases are settled in favor of the plaintiff, more songwriting credits are added after a song’s release. This is why the number of songwriters listed on Mark Ronson’s “Uptown Funk” has increased over the years. To avoid a Mark-Ronson-style-courtroom-induced headache, artists will sometimes preemptively credit writers of older songs even if the similarity between the older song and their composition is purely coincidental.”

A “Last of Us” Episode 7 musical mystery (light spoilers). I just want to say don’t worry The Last of Us fans, I’m thinking about the important things over here.

The choice is easy. Robin Sloan with a good reminder: “Anyone who adds one of those email newsletter pop-ups to a website demeans them selves and makes the world worse for everyone else.” Reminder that if you are an author using Substack you can turn off “Subscribe prompts on post pages” in Settings.

Quick Review Summary. Ok this seems like an actually good use of OpenAI. Instead of poring over hundreds of reviews of a hotel, copy the Tripadvisor URL of the hotel into this website and it will generate a summary of the general sentiment of the hotel.

Neurodiversity Design System. Great resource. “The NDS is a coherent set of standards and principles that combine neurodiversity and user experience design for Learning Management Systems. Design accessible learning interfaces supporting success and achievement for everyone.”

SoundPrint is an app to “discover quiet places and share them with others.” This looks really useful, especially if you’re a fellow tinnitus sufferer.

The 90s, having time, and always rushing to the next thing

I’m sure every generation writes lots of articles like Freddie deBoer’s It’s So Sad When Old People Romanticize Their Heydays, Also the 90s Were Objectively the Best Time to Be Alive. But hear me out. This is the impassioned, forceful, yet balanced Gen X take I wish I had the skill and wherewithal to write. It is a balm to the nostalgic soul in a way that somehow doesn’t feel like cringey old-person fanfic.

Here he is on the experience of visiting a record store:

When you were there you were Doing Music. Now we’re never doing anything—we’re always getting through something to get to something else to get through, using various time-saving techniques that maximize the amount of time we have to get through things while keeping our attention divided into a thousand things we then get through. When you went to a record store you were intent on music, and sometimes, you’d care enough about a particular artist that you paid for their album, real money, so that the artist got a cut that was more than the .002 cents they get per stream now.

This reminds me of the question Alan Jacobs asks: What exactly are we’re rushing towards with all our 2x listening and cliff notes skim-reading?

My question about all this is: And then? You rush through the writing, the researching, the watching, the listening, you’re done with it, you get it behind you—and what is in front of you? Well, death, for one thing. For the main thing. 

But in the more immediate future: you’re zipping through all these experiences in order to do what, exactly? Listen to another song at double-speed? Produce a bullet-point outline of another post that AI can finish for you?

Maybe the 90s have a thing or two to teach us yet.

Don’t give up on the value of product management because of bad past experiences

Maybe 10 years ago I would’ve gotten upset about an article like Spencer Fry’s No PM, no problem: how we ship great products fast, in which he explains why they don’t have product managers at Podia and how great that is. Luckily I’m now too old to stay up late just because I think someone’s wrong on the internet. Instead, I approach articles like these—ones I viscerally disagree with right off the bat—with a bit more curiosity. What is the source of the author’s assumptions? What is the data that led them to this particular set of conclusions? What is the problem they’re trying to solve, and what led them to this viewpoint as the solution?

As it turns out, we get the answers to those questions pretty early on in Fry’s post:

Why shouldn’t the developers—or designers—be tasked to work through the problems, instead of being handed a set of solutions?

Every single project, a developer is assigned what we call a Champion role and it’s that person’s responsibility to act as the PM in addition to their work as an individual contributor. This approach, as opposed to handing off a spec to stitch together with code, makes for much more engaged developers who feel more ownership of the work.

Ah, see, this makes sense! I can see why Fry concluded that PMs are unnecessary if his experience is that they (1) “hand off a spec to stitch together with code”, and (2) don’t give developers ownership over their work. The problem is likely that he has never worked with a PM that understands their role and does it well, so of course the data would lead to the conclusion “no PM, no problem”.

So let’s talk about those two assumptions for a minute.

(more…)

The Myth of Velocity

Randy Silver in The Myth of Velocity:

When we measure how quickly teams ship stories & code, we’re measuring speed—how quickly they move. It’s only when we measure the effect it has on the target metric—the value that we’re after—that we’re actually looking at velocity.

It doesn’t matter how much you ship if the end result doesn’t deliver value to your customers and your company. If you’re measuring story points, you’ve fallen into the trap of measuring outputs, not outcomes. When we talk about slowing down to speed up, this is the point: the only thing that matters in this equation is how quickly we can deliver actual value. Everything else is theater.

You can’t stand under my umbrella

In You can’t stand under my umbrella the Raw Signal team makes the case for when it’s not appropriate for managers to be “sh*t umbrellas” for their teams:

When things are steady, and people know the right things to work on, teams are constrained by velocity. We know the course we’re racing, the question is just how fast we can go. In that context, it makes sense for a manager to clear every obstacle out of our way. But during times of significant change, teams are constrained by agility. It’s not that velocity doesn’t matter, it still does. But when everything has changed about the race, we need the ability to steer. A manager who tries to preserve velocity at all costs risks running us into a wall.

They go on to talk about how to Accept, Adapt, and Act in such moments of significant change.

More

  1. 1
  2. ...
  3. 26
  4. 27
  5. 28
  6. 29
  7. 30
  8. ...
  9. 201