Menu

How we got here (it’s not a “root cause”, it’s the system)

Lorin Hochstein shares a characteristically solid systems-thinking take in CrowdStrike: how did we get here?:

Systems reach the current state that they’re in because, in the past, people within the system made rational decisions based on the information they had at the time, and the constraints that they were operating under. The only way to understand how incidents happen is to try and reconstruct the path that the system took to get here, and that means trying to as best as you can to recreate the context that people were operating under when they made those decisions.

The “no root cause” concept is something I’ve been thinking about a lot as I’m working on a particularly complex project at work. Somehow we constantly forget that things usually are the way they are not because of a single “mistake”, but because of a the culmination of a bunch of legitimate reasons.

Systems get the way they are because of decisions made in good faith based on the data available at the time. And the worst thing you can do as a new person coming in to improve things is to hunt for a single “root cause” to fix. That’s just not how software (or people!) work. So take the time to understand Chesterson’s fence. Go ahead and draw boxes and arrows until no one disagrees any more about how the system works. And then figure out which parts can be improved, and in which order.


PS. Also see How Complex Systems Fail:

Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents. Each of these is necessarily insufficient in itself to create an accident. Only jointly are these causes sufficient to create an accident. Indeed, it is the linking of these causes together that creates the circumstances required for the accident. Thus, no isolation of the ‘root cause’ of an accident is possible. The evaluations based on such reasoning as ‘root cause’ do not reflect a technical understanding of the nature of failure but rather the social, cultural need to blame specific, localized forces or events for outcomes.