Talks · 20 July 2014 · Ian Malpass
This is a transcript of a talk I gave at DevOps Days Minneapolis on 18 July 2014. It’s transcribed from memory, but should be a close approximation to what I said at the time.
I’ve been a software engineer at Etsy for almost five years now. What that has meant is that I’ve been lucky to have been involved in our transformation into a DevOps culture from the start. One of the aspects of this that has particularly engaged me has been our attitude to, and philosophy about, blame.
But first, some history.
In ancient Israel, each year the priests would take two goats. They would ritually place the sins of the Israelites onto these goats. One goat would be taken into the temple and sacrificed. The other—the scapegoat—would be driven into the wilderness to die.
Whether this was an effective process is a question for theologians, but we can say with certainty that it sucked to be a goat.
The ancient Greeks, never ones to do things by halves, would respond to calamities such as plague, famine, or invasion by selecting a person, usually a cripple, or a beggar, or a criminal. This person—the pharmakos—would be beaten, stoned, and driven from the city. Modern science suggests this didn’t work very well.
But this idea that we can achieve cleansing, that we can make our situation better, by the expulsion of some individual is remarkably persistent, and the word “scapegoat” is deeply entrenched in our language, and our thinking.
What we can observe in those historical examples is that in both cases the victim is what we might term an “other”, an outsider: a cripple, a criminal, even an animal. We still tend to seek to blame those different from ourselves and our groups today.
In the usual software engineering environment, others can be found in our functional silos: dev here, ops there, marketing over there, legal last seen two weeks ago talking to some people in suits. Ops didn’t keep to their SLA. Dev made unreasonable demands two days before launch. Legal made us change everything halfway through building the product. More insidiously, we find others in our unconscious biases—things like gender, or educational background.
With blame comes fear: fear of loss of status amongst our peers, fear of never getting promoted or even losing our job, fear of never getting to work on the interesting projects, the risky projects.
This isn’t healthy. It’s also not effective. We don’t learn from our mistakes this way. The Greeks kept succumbing to plagues because it turns out that beating up a beggar is less effective at preventing disease than coming up with a functional sanitation system.
In a DevOps environment, we’re actively working to eliminate these functional silos, and as we increase collaboration throughout the product development lifecycle, we remove these convenient “others”. So while all of this is applicable to any organisation, I’d argue that it’s absolutely vital for a DevOps organisation.
So. Why do we blame?
With the unknown, one is confronted with danger, discomfort, and care; the first instinct is to abolish these painful states. First principle: any explanation is better than none.
— Friedrich Nietzsche, The Twilight Of The Idols
The first explanation in most cases is “human error”. Look at news articles or investigation reports about accidents and the words come up time and again. In software engineering, our dominant narrative is that computers are understandable, deterministic, and safe. Humans interfere. Humans do the wrong thing at the wrong time. Humans are fallible.
The psychologist B. F. Skinner was the inventor of the Operant Conditioning Chamber, or Skinner Box. This device allowed experimenters to give a rat a treat if it did something they wanted, or an electric shock if it did something they didn’t. The learnings from these experiments have been widely adopted, from Facebook games to engineering management [kidding about the managers, I promise].
A person who has been punished is not thereby simply less inclined to behave in a given way; at best, he learns how to avoid punishment.
— B. F. Skinner, Beyond Freedom and Dignity
What Skinner observed, though, was that punishment merely taught the rats to avoid punishment. Blame doesn’t teach us to avoid failure; blame merely teaches us to avoid being blamed.
Because failure happens anyway: it’s an emergent property of complex systems. In a complex system, we simply can’t know all of the interactions that may arise, or all of the ways in which it might fail. And we are working with complex systems: from the complex social systems of our companies, to the software we write and all the ways the parts of that software interact. From the servers we run on, to the networks our data travels across. From the client devices and browsers a web app is served to, to the complex, unpredictable humans who use it. It’s complex systems all the way down, and these fail.
So if failure happens, all we’re teaching people to do is to sweep failure under the carpet, to hide mistakes and direct blame elsewhere. What we’re not doing is learning.
And we should be learning! Failures are windows into complex systems. They allow us to understand interactions that were hidden to us before. They’re opportunities to make our systems safer. (Not safe, alas, but at least we can try to get some of the way there.)
Here’s the thing: we don’t come to work to do a bad job. You don’t arrive at your desk on a Monday morning and say “today, I shall bring the site down”. If you do, your organisation has bigger problems than any incident you may cause—problems with hiring, retention, and culture, and these should be addressed. But assuming we really do come to work to do a good job, why do we still make mistakes? Why are humans still fallible?
In reality, what we are labouring under are multiple, conflicting constraints. Get the feature out on time, without bugs, and without using too many resources. Monitor everything, alert in good time, but don’t cause alert fatigue. It turns out that what humans are really, really good at is balancing these conflicting constraints and in doing so they allow systems to operate more safely.
How we do this—day in, day out—is by choosing between efficiency and thoroughness: what Erik Hollnagel terms the Efficiency-Thoroughness Trade-Off (or ETTO) Principle. We can be efficient—we can get lots of stuff done—but we can’t be thorough and dive deep into every issue. We can be thorough and really dig into the task at hand and understand it well but this takes time: it is inefficient. More commonly, rather than choosing one extreme or another, we’re deciding, through experience and learned heuristics, where on this spectrum of efficiency and thoroughness we need to operate for any given system at any given time, based on our understanding of the data available and the constraints we’re under.
You’d think, from a systems safety point of view, we’d really want to operate on the thoroughness side of things, to really understand what’s going on. And yet this brings its own problems.
If I’d observed all the rules, I’d never have got anywhere.
— Marilyn Monroe
One of the most effective forms of industrial action isn’t a strike, but a “work to rule”. Follow every rule, every regulation, to the letter, every time. The organisation rapidly grinds to a halt because it actually relies on people knowing what to skip, where to trade thoroughness for efficiency. And an inability to react to changing circumstances rapidly can lead to failure too.
When you make a change to some code, do you have everyone in the company check your diffs? When you deploy, do you check every graph of every metric you collect? Or do you commit, skip the tests, push to production, shout “YOLO!” and go for lunch?
In reality, what you do depends strongly on what you’re working on. If you’re doing some complex brain surgery on your master database you will likely tend towards thoroughness: lots of consultation and discussion, game days, contingency planning, and when the time comes to do the work, everyone is at their desk and knows what to do. Whereas if you’re pushing out a small CSS change, you’ll eyeball it, run the tests, deploy, eyeball it in production, and call it a day.
But, because these are complex systems, it is likely that there is some unknown interaction lurking in the database work you’ve planned, some hidden source of failure. Maybe you’ll get lucky, maybe you won’t. And I can tell you that a colleague of mine once took Etsy down entirely using only a CSS change.
As such, the fallibility of humans is not that we willfully do the wrong thing. Rather, we do what makes sense to us at the time. We do the thing that we believe will give us the result we want, based on our understanding of the system when we make our decision. And sometimes we’re wrong.
Given that formulation, blame becomes ludicrous. Instead we find ourselves asking questions like “Why?”. Why did what you did make sense at the time? Why did you think it would yield a good result? What was your understanding of the system? What constraints were you trying to satisfy?
You won’t get good answers to these questions if people are scared. Instead, you move to a place of trust. You hire well, you train well, and you trust that people are doing the right thing. And if things fail, you understand that this happens and you learn from it.
Blamelessness is not just about being nice (although that’s a pleasant side effect). It’s enabling you to get to an understanding of a problem that is impossible without it.
But, how do you go about it?
The first and most important part is to stop blaming yourself. This may sound all very “self help guru”, but it’s important. When we experience failure, when we get that “ice water down the spine” feeling as we see things going horribly wrong, we experience that Nietzschean anxiety and the first thing we come to is “I’m a failure, I don’t know what I’m doing, I suck”. That way lies Impostor Syndrome and misery, and we recoil from it and shut down that introspection, and we shut down our ability to learn.
Overcoming this is hard, but it’s very valuable. The great thing is that it works even if you’re in an organisation that hasn’t embraced blamelessness. If you learn one thing from this post, it should be to ask yourself “why did what I did make sense to me at the time?” and to ask that often.
But it’s easier to be blameless in a group. A supportive group of peers can help you avoid and overcome that tendency to self-blame, and provide both extra perspective and deeper lines of questioning to help you understand and learn from the failure. Another benefit of group discussions is that it reinforces the notion that blamelessness is a Thing That We Do. It’s not just something we heard about at a conference that sounded good, but actually part and parcel of our daily work.
With groups, it is beneficial to cast the net widely. At Etsy, reviews are open to all. Anyone from any part of the company can attend and participate, broadening the perspectives available to us. For example, we’ve even had our Chief Financial Officer come to reviews, particularly those involving money. What she provides is not only her intelligence and insight, but also the perspective of the Finance department. What are Finance’s goals and constraints? How do they align with Engineering’s constraints? How do they differ and conflict? We get a better, more holistic picture of incident as it applies to the entire organisation, rather than just within Engineering’s narrow view. We learn better.
Make a habit of doing blameless reviews. You don’t need to wait for failure to strike. You can review successes: what went right? Or near misses: what did we do to save the day? You can review things other than engineering incidents. How did your hiring process work? Did your PR campaign do what you expected? This habit of review both embeds the philosophy of blamelessness and hones your skills for looking at incidents and learning from them.
While I said the overcoming self-blame is hard, overcoming hindsight is even harder. Hindsight is a fundamental part of how humans look at the past and understand what happened. The problem is that, as the saying goes, hindsight is 20/20. We look back and see that the catastrophe that befell us was inevitable. The steps that led us there are clear as day and those who didn’t see it at the time are apparently idiots.
We find ourselves with counterfactuals: statements and reasoning about things that didn’t happen. More simply, the “coulda, woulda, shouldas”.
“He could have talked to Jennifer in Ops and she would have told him that the change would kill the database.”
“A good engineer would have tested this more thoroughly.”
Statements like these aren’t helpful. Why did he not talk to Jennifer? Did he not know to? Did he talk to someone else who said it would be fine? Had he made these changes a hundred times before without incident? Would it have been fine had it not been for some other operation that was also loading the databases at the time?
We see phrases like “a good” engineer all the time too, and they are incredibly blameful. “A good engineer would have tested this more thoroughly. You didn’t test this more thoroughly. You are a bad engineer.” But in reality, what we’re doing is making a judgement about an efficiency-thoroughness trade-off that this engineer made. With hindsight we see that it was wrong, but why was that not clear at the time? And bear in mind that a different decision at the time might have led to us sitting in a review asking “why did the feature not go out on time and why did you use so many QA resources?”.
Typically, in a review we find ourselves seeking a root cause. We are conditioned to think of failure as “Y happened because X” and we want to know what X was. But in a complex system failure, this rarely happens. Instead of a simple cause-and-effect process, we have multiple failures, each necessary to the broader calamity but not individually sufficient to cause it. Picking one of these and labeling it the root cause is false and misleading.
Likewise, we look back with hindsight and construct a nice, straight, linear sequence of events that we believe describe the failure: X then Y then Z then bang. But in reality, we’re dealing with that wibbly-wobbly ball of timey-wimey stuff that is complex system failure. Multiple failures all mixed together in a big jumble, not neat and tidy and certainly not linear as observed and experienced by those involved at the time.
As such root causes are constructed, not found. They are merely the place where you stop digging and say “this’ll do”. Remember that incident reviews are also subject to efficiency-thoroughness trade-offs.
Out of this, the idea is to learn. How you learn depends very much on your organisation, those involved, and the nature of the failure at hand, but typically you end up with a bunch of remediation items—things you intend to do to make sure that this never happens again. But try very hard to seek good remediation items. Review every one critically and ask what hazards it may cause, and what costs. It could be that the cost of remediation (in time, complexity, increased risk elsewhere in the system, etc.) exceeds the cost of the failure you’re trying to avoid. It could be that a failure is just the cost of doing business, and should it crop up again you’ll just deal with it.
Similarly, a common response is “we’ll implement a process”. A process such that no-one can ever stray from the correct path again. This, too, is problematic.
Process is an embedded reaction to prior stupidity.
— Clay Shirky, Wikis, Grafitti, and Process
Shirky’s adage is pithy, yes, but also rather blameful: “stupidity” isn’t something we want to see in reviews. But, his point is well taken. Processes are the scar tissue that builds up from your failures. Left unchecked, the growth of this tissue slows you down, and you find yourself less and less able to move and respond to problems. Every process you implement is one more thing for those fallible humans to make efficiency-thoroughness trade-off decisions on: follow the process and be thorough, or skip it and get stuff done (in which case you have learned precisely nothing and gained no increase in safety at all).
Once you’ve worked out what you’ve learned, it is necessary to communicate it. At Etsy, as a blameless organisation, what we observe is that those closest to the incident—those normally most likely to be blamed—actually leading the charge in review, and taking ownership of remediation afterwards. We get emails to the company saying “this is what I did, this is why it was bad, this is how the site went down, this is what we’ve learned and what we’re going to do in the future”: the very opposite of what you’d expect in a blameful environment. In fact, this behaviour is now expected: the norm, not the exception.
Now, this is harder in a culture where blamelessness isn’t 100% adopted, but the best way to promote the philosophy is to demonstrate its results. Managers have a huge role to play here, both in fostering blamelessness within their teams, and in advocating for it amongst their peers and their own managers.
As with all parts of DevOps culture, there is no “royal road” to blamelessness. I can’t give you a Docker image for blamelessness. You can’t Chef blamelessness out to your colleagues however much you might wish to do so. Despite all the advice here I don’t have a simple list of steps to take that will result in a blameless company. Instead, it’s an evolution: an evolution of your organisation and of the people within it. For those of you who want to make a start, I’ve compiled a reading list to help you.
Embracing blamelessness will make your organisation a happier, healthier, safer place, and as you set to work on it, I wish you the very best of luck.