Books for software engineers and managers

Release It!

Release It!

Design and Deploy Production-Ready  Software

by Michael Nygard

Categories:
Engineering Manager,
Tech Lead,
Star Engineer

How strongly do I recommend Release It!?
7 / 10

Review of Release It! Book

To my surprise based on the title, Release It! is not about increasing deployment frequency – a primary interest of mine in building high performance engineering teams. However, this was a pleasant surprise.

Release It! is a good read for both engineers and engineering managers, particularly product engineers that also take on systems and DevOps responsibilities. The author smartly ties tactical recommendations to broader concepts and themes, which will help you communicate with other software engineers.



When your service goes down unexpectedly, your first job is to restore  service

Restore service first. Then worry about deep diving into the problem.

However, there’s a catch. Restoring service often comes at the cost of not understanding the problem.

Your systems and machines are in a bad state. When restoring them to a good state, you often inhibit your ability to debug the issue. But that’s a trade-off you sometimes need to make.

Preventing cascading failure is the key to  resilience

Bugs cannot be eliminated, but they can be survived by preventing propagation.

Cascading failures occur when one system experiences an issue, subsequently causing issues in another system, potentially causing issues in yet another system. On and on we go.

In other words, cascading failures exist because of relationships between systems.

Nygard provides a lot of commentary on this topic throughout the book. Some of my favorite thoughts are:

  • Good architects focus more on arrows than boxes
  • Foster the ability to restart components, not entire servers or systems
  • Automation has no judgement

Chaos Engineering is a good way to understand how resilient your systems are against cascading failures. Netflix’s Chaos Monkey famously shuts down random services and servers in production to test how dependent systems respond.

You can't control incoming requests, whether legitimate or  malicious

You can load balance, govern requests, shed load, fail fast, and do plenty more to mitigate risk. But fundamentally, you can’t control the volume of requests your system receives or the nature of those requests.

Nothing is as permanent as a temporary  fix

Temporary fixes often arise in two situations:

  1. Firefighting. Temporary fixes can help us restore service quickly.
  2. Prototyping. Temporary fixes in prototyping usually don’t perform well in production. They either fail on edge cases or don’t scale. But when prototyping, they do help us maintain development momentum.

I give my engineers the benefit of any doubt. In my experience, engineers usually just forget about that temporary fix. They’re constantly bombarded with new problems consuming their attention.

Your engineers usually just need a reminder. As manager, when my team is done firefighting or prototyping, I like to ask what needs to be done to make the code production-worthy.

Then comes the hard part. You personally need to value the process of transforming a temporary fix into a production-worthy fix, and support that value in the face of mounting pressure from your product roadmap.

To improve productivity, watch the work moving through your system not the  workers

Rather than focusing on how efficient your developers are, focus on how efficiently work moves through your process.

Focusing on process and not people can feel counter-intuitive for managers. Aren’t we managers of people? Yes, but your job is to make work possible and that often means addressing process failures.

Diagramming your value delivery chain is a good place to start. The DevOps Handbook provides insight on how exactly to map your value delivery chain, but the basic idea is to list every step in your process and how long that step takes – both in execution time and total process time. For instance, code review may only require 5 minutes of execution, but can often take hours or days to perform.

Value delivery chain diagrams usually help identify a few limiting factors that slow your release frequency and increase lead times – two of the key engineering performance metrics tracked in Accelerate. Some common areas I’ve seen slowing teams from releasing are:

  • Slow tests
  • Slow code reviews
  • Rework caused by conceptual holes in the feature/solution
  • Dependency on external parties
  • Inability to work full stack

 

New architects focus on the boxes in system diagrams, experienced architects focus on the  arrows

Here are some good questions to ask when confronted with a system diagram:

  1. How does the data flow between systems?
  2. What is the typical order of operations?
  3. Which of these interactions or transactions is most likely to fail and why?

Focus on the arrows more than the boxes.

Paranoia is good  engineering

Andy Grove said about leadership that only the paranoid survive and the same is true for software engineering. Evaluate your systems with a cynical eye. Identify your bottlenecks assume they will be overwhelmed at some point.

View integration points skeptically. Systems that you don’t control will eventually fail you in unexpected ways.

Load shedding is the most important tool for controlling incoming  requests

Your application will eventually undergo a denial of service attack, either maliciously or the friendly hug of death. When this happens you need the ability to shed demand load so that your system can recover and respond correctly to requests that do mke it through.

Services should monitor their own response times and respond accordingly. In a service oriented architecture or microservice architecture, the service itself should be able to respond appropriately to the pressure being applied.

Automation has no judgment and when it goes awry things happen  quickly

Humans apply judgment based on context. Automation does not. When your automation is not configured correctly, and often you don’t learn about these configuration mistakes until it is too late, the system quickly reacts.

Your job is to make sure that automation does not go horrifically wrong and throw your system into an unrecoverable state.

Monitor business outcomes, user behaviors, and system level  metrics

Here are some examples:

  • Business outcome to monitor:  revenue per hour, transactions per minute
  • User behaviors to monitor: comments created, successful and failed logins
  • Systems to monitor: requests per minute, CPU, database connections, HTTP responses by status code

These don’t need to be in the same dashboard but they should be accessible and ideally configured with anomaly detection so that your team can receive push alerts when something goes wrong without needing to consistently monitor these outcomes.

Release It!