Books for software engineers and managers

Drift into Failure

From Hunting Broken Components to Understanding Complex  Systems

by Sidney Dekker

Categories:
CTO,
Engineering Manager,
Tech Lead

Drift is about continued  normalization

We deviate slightly from the norm. Everything seems to work fine, establishing a new norm. Repeat. That’s drift.

Don’t standards help against drift? Sometimes. But standards have their own problems. Standards today reflected our understanding at some point in the past. Standards rot. Standards also experience drift.

Complexity now exceeds our  understanding

Dekker’s main point is that our reductionist approach to understanding problems doesn’t work with the highly complex systems we see today, and we don’t have a good replacement. For instance, when a plane crashes there are many factors at play, spanning many people and organizations, over many years.

We want answers. What broke? In complex systems, those answers are often unsatisfactorily multiple and relational.

We reason locally with global  consequences

Leading up to the 2008 housing collapse, everyone made locally rational decisions. Whoops.

In software engineering, local reasoning can also have catastrophic consequences.

Deprioritize something like a SQL injection vulnerability and a script kiddie blows away your data. Of course, you’ve never had this happen before so restoring your backup (you have backups, right?) takes a while. Customer data deleted, unexpected downtime, and revenue halted.

But in the moment, it really felt like that SQL injection hole wasn’t a big deal. You were right… until you weren’t.

Of course, this example assumes you know about the SQL injection problem. More likely, it’s sitting in your system without your knowledge and you’re actually rationalizing not looking for it because you have higher priority jobs.

Accidents come from relationships, not  parts

Instead of looking for broken parts or people, we should look for broken relationships. When systems fail, what interactions aren’t working as expected?

Debugging relationships is counterintuitive. We like blaming parts and people.

To practice debugging relationships not parts, I’ve started asking, “Where was the breakdown?” instead of “What broke?”

Link system fixes to strategic  priorities

As a technical leader when you identify a system drifting toward failure, you should attach that fix to a strategic priority for your organization. Do this early and often.

Attaching risk mitigation to organizational priorities lowers the friction for buy-in. You’re not implementing the fix instead of working on the product roadmap, you’re implementing the fix in support of the product roadmap.

Productivity and efficiency correlate with  drift

Engineering managers like myself read books like High Output Management, Peopleware, and Accelerate, getting pumped up about engineering productivity. Of course productivity is a good thing.

But Dekker uncomfortably points out that productivity and efficiency also correlate with drift.

Get too focused on productivity and you’re likely deviating from norms in sometimes unsafe ways. But these decisions come down to risk tolerance.

In a startup environment we tolerate more risk than more established companies. If you’re not productive in a startup, you’re guaranteed to die. But if you’re highly productive, you might live and have some occasional fallout from drift.

Diversity is a safety  valve

With more diverse experiences on our team, we’ll have more options when solving a problem. Diversity leads to creativity.

Creativity is the application of knowledge from one domain to another. Conversion rate optimization applied to recruiting. Teacher lesson planning applied to engineering training. Strength training principles applied to project management.

Diversity of experiences is a significant value offered by non-traditional engineering candidates when hiring. Self-taught and code school candidates often bring a new perspective that provides the team an opportunity for boosted creativity and maturity.

Drift into Failure

How strongly do I recommend Drift into Failure?
8 / 10

Fair warning, Drift into Failure reads a bit dry and academic. But the information inside will help software engineering managers identify potential upcoming failures and understand past failures. Notably, I think this book is good for anyone facilitating retros or reviewing post-mortems.