Menu

Posts tagged “engineering”

Why stable software development teams are more effective than “agile” teams

In the latest Platformer piece Meta doubles down on layoffs we see a perfect example of why stable software development teams are more effective than “agile teams” where people are seen as interchangeable cogs in a machine. When leaders think that people can be moved around between projects and “initiatives” at will and without knock-on effects, they run headlong into the basics of systems thinking, as shown here by Mark Zuckerberg’s realization:

In retrospect, I underestimated the indirect costs of lower priority projects. It’s tempting to think that a project is net positive as long as it generates more value than its direct costs. But that project needs a leader, so maybe we take someone great from another team or maybe we take a great engineer and put them into a management role, which both diffuses talent and creates more management layers. […] Indirect costs compound and it’s easy to underestimate them.

As a side note, this is honestly a pretty frustrating thing to read. It seems like such a basic software development concept—was there no one in Mark’s orbit that could tell him about the indirect costs of building VR headsets? And now that epiphany is costing Meta another 10,000 jobs. Ugh.

How to Be a PM That Engineers Don’t Hate

From How to Be a PM That Engineers Don’t Hate:

You see it everywhere: Engineers complaining about the product managers that they work with. Hating on PMs is kind of like complaining about your utility provider or the TSA—so universal that it’s always good for a light chuckle in the right circles. The PMs don’t know how the technology works. All they do is send emails and take credit. They’re meeting generation machines.

When I interview engineers my first question is always, “Tell me about your experience working with PMs.” I do it to calibrate what kind of PMs they are used to working with, or to put it another way, how much I should apologize for our profession before we continue…

Anyway, the linked post is less about that, and more about some things PMs can do to work more collaboratively with other teams—not just engineering. Lots of good, practical tips and examples here.

Using “steel threads” to reduce product delivery risk

Jade Rubick resurrects an old engineering concept (and dead Wikipedia page!) in Steel threads are a technique that will make you a better engineer:

A steel thread is a very thin slice of functionality that threads through a software system. They are called a “thread” because they weave through the various parts of the software system and implement an important use case. They are called “steel” because the thread becomes a solid foundation for later improvements.

With a Steel Thread approach, you build the thinnest possible version that crosses the boundaries of the system and covers an important use case.

He explains this in detail in the full post, with lots of helpful examples.

Don’t blame outages on human error and technical debt; improve the system instead

The FAA outage in January that caused the first nationwide ground stop of all flights in the US since 9/11 is kind of old news now, but there’s one detail that I can’t stop thinking about. In the aftermath of the incident the cause was determined to be a database sync issue:

The F.A.A. said in a statement that the workers had been trying to “correct synchronization” between the main database for the Notice to Air Missions alerts and a backup database when the files were mistakenly deleted, causing the outage that snarled air traffic throughout the day on Jan. 11.

CNN added a little more detail:

A contractor working for the Federal Aviation Administration unintentionally deleted files related to a key pilot safety system, leading to a nationwide ground stop and thousands of delayed and canceled flights last week, the FAA said Thursday.

The FAA determined the issue with the Notice to Air Missions (NOTAM) system occurred when the contractor was “working to correct synchronization between the live primary database and a backup database.”

The unsurprising narrative that came out of the tech world following the incident can basically be summarized as “ha ha, silly contractors!” But that feels like a lazy response to me. I didn’t see anyone ask what I believe is the more important question: How do we improve the system (people, processes, technology) that enables one person to inadvertently take down all air traffic in the US?

Let’s remember that this kind of thing can happen to absolutely anyone. Etsy even hands out a “three-armed sweater” award to the engineer who had the most spectacular mishap in any given year:

Kate’s story is a nail-biter, involving a tiny code change that unexpectedly brought down Etsy.com. All of her coworkers rallied around her to help get the site back online, while offering words of encouragement and reassurance.

So it might be really convenient to blame the FAA outage on “contractor error” and then just keep going. But that’s not going to prevent the next incident from happening.

It is further also tempting to blame the entire issue on “tech debt” and call it a day. And, fair enough, there’s certainly plenty of that going around in FAA systems. Ars Technica has a good overview of some of the major issues and how the FAA wants to fix them. But like all giant “replatforming” projects (this one is called NextGen, because of course it is) things are… not going great:

FAA tech problems were previously described in a March 2021 report by the US Department of Transportation Office of Inspector General. The report discusses the FAA’s Next Generation Air Transportation System (NextGen), “a multibillion dollar infrastructure project aimed at modernizing our Nation’s aging air traffic system to provide safer and more efficient air traffic management.”

“NextGen’s actual and projected benefits have not kept pace with initial projections due to implementation challenges, optimistic assumptions, and other factors,”1 the report said.

But blaming tech debt—and especially blaming individuals—is not going to get us very far. Tech debt will always be there (although I have some thoughts on how to prioritize it), and individual mistakes are not going to go away. What we can do is examine the system that enables, in this case, a database sync to corrupt the primary live db, and figure out how to prevent that from happening in the first place.

Almost 30 years ago Jakob Nielsen published his 10 Usability Heuristics for User Interface Design, and “error prevention” is still as true today as it was then:

Good error messages are important, but the best designs carefully prevent problems from occurring in the first place. Either eliminate error-prone conditions, or check for them and present users with a confirmation option before they commit to the action.

The example I always think of here is how you often seen battery packs shaped in a certain way so that it’s impossible to insert them incorrectly (contrast that with the terrors of trying to insert a USB cable the correct way the first time!).

In a situation like the one the FAA experienced, yes it’s important to acknowledge human error, and talk about the underlying tech issues, but that’s not enough. We have to figure out how to add preventative measures to our systems and pipelines2. To put it another way, they might not be able to replace their battery packs with NextGen solar yet, but they can certainly change the shape of the battery to prevent contractors from blowing up the camera.


  1. My emphasis added because who among us have not heard those words before… 

  2. For further reading on what to do after a major incident, check out Will Larson’s Move past incident response to reliability

Technical debt, product debt, and how to prioritize addressing it

Mike Fisher argues that we should rebrand technical debt as “product debt”, and I think it’s a good argument! That said, I’d like to add some considerations to two of his points. First:

We usually think of refactoring as “cleaning up” code, where we change the code to be more easily understood, perform better (faster or more efficiently), or follow current conventions/standards. The goal of refactoring is to change the code without changing its functionality; it should continue to pass all unit and functional tests. 

I take a slightly different approach to refactoring, and how to prioritize the work. I believe it’s important for teams to have a stated and agreed-upon value of “leave the code better than I found it.” This means that refactoring shouldn’t be a separate activity, for its own sake, that needs to be scheduled. It should be a natural part of feature development.

If you’re creating a mechanism for add-ons on the product, spend a few extra days to refactor the billing code you’re already working on. If you are adding metrics to the dashboard in your UI, take the time needed to refactor the front-end code to make it more performant. Whatever code you’re touching while you’re working on a project, leave it better than you found it. It is way more efficient to extend a project by a week to refactor code you’re already working on than it is to create a separate project that needs to be planned, prioritized, and worked into the roadmap.

Second point:

So, how do we ensure we are paying down technical debt when there is so much pressure to ignore it until things really break? I think one part of the answer is to use a different term. Instead of tech debt, which implies it is the responsibility of the tech team, let’s call it product debt.

I think this is a good first step to getting more teams to care about technical debt—but it’s not enough. One of the issues with getting technical/product debt prioritized is that often “the business” doesn’t see the value in statements like “we’re going to clean up the code so that it doesn’t break a few months from now”. Instead, we need to frame the work in terms of the benefits to customers and/or the business.

For instance, we could make the case that refactoring this piece of code would significantly increase our deployment speed, which would mean faster time to market. Or we could argue that fixing our slow staging environments would result in happier, more productive engineering teams.

With technical debt—as with most things in software development—the thing you do is never the main thing. The main thing is what the thing you do enables. What value it brings to customers and the business. That’s the framing we need for working on technical debt.

Move past incident response to reliability

Here’s an interesting article by Will Larson with advice on how to move past incident response to reliability in our products. Among other things it reminded me to watch out for “incident legalism”:

Incident legalism is when an incident response and analysis program—trying to better drive reliability improvements—becomes focused on compliance and loses empathy for the engineers and teams operating within the program’s processes.

He goes on to propose a more holistic, expanded model for reliability to help teams diagnose their systemic problems—and how to solve them:

Finally, you study the mitigated incidents, determining how to prevent them from recurring, and they become remediated incidents.

How the fediverse can help us collaborate better at work

Mehul Kar says he’s not super excited about the “fediverse” in the context of social media. However, he sees a huge need for The Fediverse At Work. The issue? The lack of integration across all the tools we use at work has become incredibly tedious and hard to keep track of:

Sometimes there are Figma design specs, with their own set of comments. And Loom walkthroughs, also with comments and likes. And any number of other things over time. The combinatorial complexity of these tools across these platforms (not to mention emails) can be quite messy to track. It’s really hard to remember where a conversation took place. Coworkers often repeat the same text in multiple places, prefixing with phrases like “Shared this in Notion comment also, but…” or “Just left a review, but high level: …”.

He believes that a decentralized platform for all these tools to effectively talk to each other would be hugely beneficial:

Maybe the protocols that make up the Fediverse can help. What if, instead of sharing a Github Pull Request URL in Slack, your Slack team channel could instead be subscribed to the Github repository. Maybe new Pull Requests are broadcasted to followers, and replies from Slack users to those posts are sent as comments to the Pull Request in addition to being threaded in Slack. Maybe the Notion document is treated the same way. Maybe the Loom walk through is a reply to a Slack thread, and comments on the video appear in Slack. Maybe the Slack thread is a series of comments displayed on a Figma design.

There are more examples in his post. I really hope we can get to this type of philosophy for our work tools. It does sound a little bit like the problem that Luro is trying to solve.

Principles for building software for developers

Kathy Korevec started a series about her principles for building software/tools for developers. Since I work on Postmark—one such tool—I read the intro post with great interest. The second installment is on the principle she calls You are a chef cooking for chefs:

Developers are masters of building applications, so when you’re building tools and experiences for them, you’re cooking in their kitchen. You can marvel at the delight you bring to the experience because no one can appreciate your hard work more than another developer. Developers can spot inconsistencies, antipatterns, and hurdles a mile away, so you must pay close attention to these details. At the same time, they know the challenges, understand the concerns, appreciate the details, and can provide crucial feedback to make your product even better.

This is one of the main reasons why I love working on developer tools. It’s an audience that can be brutal critics. But for the most part they do that because they care and want to see the product succeed—not because they want to fight just for the sake of it. And because they care, feedback generally have a degree of specificity that is invaluable for troubleshooting, use case discovery, and improving the product.

Anyway, this looks like a fantastic series and I can’t wait to read the rest. You can sign up for Kathy’s newsletter here.

Advice For Engineers, From A Manager

Marco Rogers has been an engineer and manager of engineers for 20 years. In this post he shares some short, practical (but not always easy to follow!) advice for engineers. A few of my favorites:

  • Learn what the true scope of the project needs to be. Back away from “story points” and understand what the project needs to accomplish. More context about the goals will help you negotiate what’s in and what’s out of scope.
  • Collaborate on designs. Designs never have the level of detail that matters. When you run into UX problems, work with people to develop a solution. Don’t just ask for more mocks. Own the details of what you’re building.
  • Don’t just write code. Solve problems. Make sure you understand the value of your work and you talk to people about that. Not just “features”. For example, “this needs to ship by Fall because it’s our big strategic bet for the year.” Tell people how to achieve the strategic goal.

Read the rest of his post for the others.

20 Things I've Learned in my 20 Years as a Software Engineer

Old technologies that have stuck around are sharks, not dinosaurs. They solve problems so well that they have survived the rapid changes that occur constantly in the technology world. Don’t bet against these technologies, and replace them only if you have a very good reason. These tools won’t be flashy, and they won’t be exciting, but they will get the job done without a lot of sleepless nights.

—Justin Etheredge, 20 Things I’ve Learned in my 20 Years as a Software Engineer.