Interesting Learnings from Outages

Here’s a good post from Gergely Orosz discussing Interesting Learnings from Outages. It covers internal vs. public postmortems, how investing in reliability can have bumps along the way, and how to make the difficult decision to try and fix something on the spot, or to do a lengthy restore. This point stood out to me:

“Move fast with autonomous teams” often builds up infrastructure debt. Reddit is a fast-moving scaleup where teams move fast, and it sounded like they had autonomy in infrastructure decisions. The wide range of infra configurations caused several outages, and the company is now paying down this “infrastructure debt.” This is not to say that autonomous teams moving fast is a bad thing, but it’s a reminder that this approach introduces tradeoffs that could impact reliability and will eventually have to be paid down, often by dedicated teams.