Menu

Move past incident response to reliability

Here’s an interesting article by Will Larson with advice on how to move past incident response to reliability in our products. Among other things it reminded me to watch out for “incident legalism”:

Incident legalism is when an incident response and analysis program—trying to better drive reliability improvements—becomes focused on compliance and loses empathy for the engineers and teams operating within the program’s processes.

He goes on to propose a more holistic, expanded model for reliability to help teams diagnose their systemic problems—and how to solve them:

Finally, you study the mitigated incidents, determining how to prevent them from recurring, and they become remediated incidents.