Menu

Why "Correction of Error" Gets Incidents (and Product Failures) Wrong

I’ve covered “root cause” thinking in incident reviews before, and Lorin Hochstein takes aim at a related issue: AWS’s “Correction of Error” terminology:

I hate the term “Correction of Error” because it implies that incidents occur as a result of errors. As a consequence, it suggests that the function of a post-incident review process is to identify the errors that occurred and to fix them. I think this view of incidents is wrong, and dangerously so: It limits the benefits we can get out of an incident review process.

What makes his critique compelling is the observation that production systems are full of defects that never cause outages:

If your system is currently up (which I bet it is), and if your system currently has multiple undetected defects in it (which I also bet it does), then it cannot be the case that defects are a sufficient condition for incidents to occur. In other words, defects alone can’t explain incidents.

This applies to product work too. When users report problems, our instinct is to find “the bug” and fix it. But often the bug has been there for months—what changed is the context around it. A new user flow, a spike in traffic, a feature interaction we didn’t anticipate. If we stop at “fixed the bug,” we miss the chance to understand why the system let that failure through in the first place.