It's not your fault
Occasionally there’s a thread on tech twitter inviting people, senior developers especially, to share about a time they broke production.
I like these threads for two reasons. First, because production breakage usually involves a bizarre constellation of accidents and coincidences that are interesting to think through. Second, because in an industry with a lot of egos and mythology about software engineering, it’s good for newer people to see that mistakes happen, all the time, even by (especially by!) experienced people who’ve been around for a while.
That said, there’s something often missing in these threads that is worth emphasizing: breaking prod is not your fault. Put differently: breaking prod is a systems failure, not an individual one.
In my experience living in the US and Europe1, there is a lot of emphasis on individualism. Collectivity is often downplayed. You can see an example of this in current COVID pandemic policies in the US, UK and European countries where the emphasis is on “personal responsibility” rather than shared rules. Or looking for example at the content produced to
manufacture consent2 explain international conflicts–notice how much of the analysis focuses on individual leaders as the mechanism to explain war or politics, rather than economic, social, or historical factors.
In software engineering, I think we see this outlook manifest in conscious or unconscious glorification of the hero programmer:
“Hero Programmer” is a derogatory name for a programmer who chooses to fix problems in epic, caffeine-fueled 36-hour coding sessions that frequently just kick the can down the road to the next heroic 36-hour coding blitz. Hero programmers would rather react than plan. Projects with hero programmers working on them often make a lot of progress initially, but never arrive at a stable state of completion.3
It’s basically our industry’s version of “Great man theory”4. When you look at things from this lens, all the successes of a website, an application, or an organization flow from the talents and genius of a few individuals. It’s a compelling outlook because, well, empirically it can definitely appear this way, and it’s naturally aligned with the other dominant societal ideas we have about individuality.
But it’s also wrong, and toxic to sustainable development and equitable environments.
When production breaks, think about all the moments that lead up to the moment where things went wrong. Sure, the final step might have been someone (you, even!) mistyping a command or running a command against the wrong target (the GitLab database example). But before that came the structures and systems-or lack of-like code review, automated testing, CI, automation of playbooks. The training and documentation, or lack thereof. The other person who might have been pairing with you.
It isn’t so much about how you, with your own hands, broke prod, but how it was that the system you’re working on–its social and technical rules–allowed prod to be broken. That’s what we should be optimizing for and focusing on. People are always going to make mistakes. But working collectively and building resilient systems lets us minimize the fallout.
The last point is that despite my negativity about individualism above, I do think we should celebrate and recognize individual work and successes. I’m all for it! People are precious, and we don’t celebrate each other enough. Just that when we’re doing it, let’s also acknowledge those around who helped make a success, and when production crashes and burns–we should work collectively to strengthen the system, and not blame the individual.