Practical Monitoring

How strongly do I recommend Practical Monitoring?
7 / 10

Review of Practical Monitoring

Practical Monitoring is a good book for product engineering teams just learning to take on DevOps responsibilities. It will point you in the right direction with enough detail to investigate further.

Top Ideas in This Book

Production-ready includes monitoring
Favor composable monitoring and structured logs over comprehensive tools
Start monitoring as close to the user as possible, like API response times and login failure rates
Establish incident response roles for people on your team when entering a firefight

Production-ready includes monitoring

Especially in a startup environment, it’s easy to consider monitoring a post-launch activity which creates two problems.

The first problem is that by delaying monitoring, we’re probably forgetting things during the build process. In this way, monitoring is similar to automated testing – when added, you realize things about your code and systems that otherwise aren’t obvious.

But the second problem is bigger. When we delay monitoring, we often forget monitoring. We’re immediately expected to start building the next thing.

This is where technical leadership during the build process really matters. Tech leads and senior engineers in particular have a responsibility to ensure that monitoring is a pre-production step.

Favor composable monitoring and structured logs over comprehensive tools

Composable tools like collectd and StatsD effectively apply the UNIX philosophy and pipe-like system to monitoring, separating the five components of monitoring into distinct services:

Data collection
Data storage
Visualization
Analytics and reporting
Alerting

By favoring composable tools, we can swap between different tools as our needs change and grow.

To facilitate the implementation of composable tools, we should also favor structured logs over unstructured logs. For instance, using JSON structures when possible rather than normal unstructured log files.

Structured logs also help humans debug and analyze log files. The downside is that structured log files are larger because they contain key information (in addition to values), so log cycling becomes more important.

Start monitoring as close to the user as possible, like API response times and login failure rates

Engineers are often tempted to start monitoring with CPU, disk usage, or similar background metrics.

Instead, start with metrics closer to the user like API response time, page rendering time, and user issues like login success/failure rate.

Not only will business leaders appreciate these metrics more, but they will tell you what layers of the onion to peel back.

Establish incident response roles for people on your team when entering a firefight

When software systems fail and you’re in firefighting mode, start by establishing incident response roles.

As Director of Engineering at my company, my role is to communicate status throughout the department and company. I rarely play the role of decision maker. Rather, I leave the decision maker role to a senior engineer.

You’ll also need people in other roles like scribes and investigators, which often work well for individual contributor engineers on your team.

In my experience, establishing these roles alleviates pressure and provides clear moments for engineering managers to challenge their seniors to step up their decision making abilities.

Effective Strategies for the Real World

Review of Practical Monitoring

Production-ready includes monitoring

Favor composable monitoring and structured logs over comprehensive tools

Start monitoring as close to the user as possible, like API response times and login failure rates

Establish incident response roles for people on your team when entering a firefight