Books for software engineers and managers

Release It!

Release It!

Design and Deploy Production-Ready  Software

by Michael Nygard

Engineering Manager,
Tech Lead,
Star Engineer

How strongly do I recommend Release It!?
7 / 10

Review of Release It! Book

To my surprise based on the title, Release It! is not about increasing deployment frequency – a primary interest of mine in building high performance engineering teams. However, this was a pleasant surprise.

Release It! is a good read for both engineers and engineering managers, particularly product engineers that also take on systems and DevOps responsibilities. The author smartly ties tactical recommendations to broader concepts and themes, which will help you communicate with other software engineers.

When your service goes down unexpectedly, your first job is to restore  service

Restore service first. Then worry about deep diving into the problem.

However, there’s a catch. Restoring service often comes at the cost of not understanding the problem.

Your systems and machines are in a bad state. When restoring them to a good state, you often inhibit your ability to debug the issue. But that’s a trade-off you sometimes need to make.

Preventing cascading failure is the key to  resilience

Cascading failures occur when one system experiences an issue, subsequently causing issues in another system, potentially causing issues in yet another system. On and on we go.

In other words, cascading failures exist because of relationships between systems.

Nygard provides a lot of commentary on this topic throughout the book. Some of my favorite thoughts are:

  • Good architects focus more on arrows than boxes
  • Foster the ability to restart components, not entire servers or systems
  • Automation has no judgement

Chaos Engineering is a good way to understand how resilient your systems are against cascading failures. Netflix’s Chaos Monkey famously shuts down random services and servers in production to test how dependent systems respond.

You can't control incoming requests, whether legitimate or  malicious

You can load balance, govern requests, shed load, fail fast, and do plenty more to mitigate risk. But fundamentally, you can’t control the volume of requests your system receives or the nature of those requests.

Nothing is as permanent as a temporary  fix

Temporary fixes often arise in two situations:

  1. Firefighting. Temporary fixes can help us restore service quickly.
  2. Prototyping. Temporary fixes in prototyping usually don’t perform well in production. They either fail on edge cases or don’t scale. But when prototyping, they do help us maintain development momentum.

I give my engineers the benefit of any doubt. In my experience, engineers usually just forget about that temporary fix. They’re constantly bombarded with new problems consuming their attention.

Your engineers usually just need a reminder. As manager, when my team is done firefighting or prototyping, I like to ask what needs to be done to make the code production-worthy.

Then comes the hard part. You personally need to value the process of transforming a temporary fix into a production-worthy fix, and support that value in the face of mounting pressure from your product roadmap.

To improve productivity, watch the work moving through your system not the  workers

Rather than focusing on how efficient your developers are, focus on how efficiently work moves through your process.

Focusing on process and not people can feel counter-intuitive for managers. Aren’t we managers of people? Yes, but your job is to make work possible and that often means addressing process failures.

Diagramming your value delivery chain is a good place to start. The DevOps Handbook provides insight on how exactly to map your value delivery chain, but the basic idea is to list every step in your process and how long that step takes – both in execution time and total process time. For instance, code review may only require 5 minutes of execution, but can often take hours or days to perform.

Value delivery chain diagrams usually help identify a few limiting factors that slow your release frequency and increase lead times – two of the key engineering performance metrics tracked in Accelerate. Some common areas I’ve seen slowing teams from releasing are:

  • Slow tests
  • Slow code reviews
  • Rework caused by conceptual holes in the feature/solution
  • Dependency on external parties
  • Inability to work full stack


Release It!