This is an older post from Lorin Hochstein but it’s new to me, and really insightful. It’s about how to best use our knowledge about the past behavior of a software system to figure out where we should invest our time to improve the system—and how the common method of error budgets is generally not a good way to do this:
I’m skeptical about relying on predefined metrics, such as reliability, for getting insight into the risks of the system that could lead to big incidents. Instead, I prefer to focus on signals, which are not predefined metrics but rather some kind of information that has caught your attention that suggests that there’s some aspect of your system that you should dig into a little more.
So basically, vibe-based incident analysis is where it’s at.