Books for software engineers and managers

Practical Monitoring

Effective Strategies for the Real  World

by Mike Julian

Categories:
Tech Lead,
Star Engineer

Production-ready includes monitoring

Especially in a startup environment, it’s easy to consider monitoring a post-launch activity which creates two problems.

The first problem is that by delaying monitoring, we’re probably forgetting things during the build process. In this way, monitoring is similar to automated testing – when added, you realize things about your code and systems that otherwise aren’t obvious.

But the second problem is bigger. When we delay monitoring, we often forget monitoring. We’re immediately expected to start building the next thing.

This is where technical leadership during the build process really matters. Tech leads and senior engineers in particular have a responsibility to ensure that monitoring is a pre-production step.

Favor composable monitoring and structured logs over comprehensive  tools

Composable tools like collectd and StatsD effectively apply the UNIX philosophy and pipe-like system to monitoring, separating the five components of monitoring into distinct services:

  1. Data collection
  2. Data storage
  3. Visualization
  4. Analytics and reporting
  5. Alerting

By favoring composable tools, we can swap between different tools as our needs change and grow.

To facilitate the implementation of composable tools, we should also favor structured logs over unstructured logs. For instance, using JSON structures when possible rather than normal unstructured log files.

Structured logs also help humans debug and analyze log files. The downside is that structured log files are larger because they contain key information (in addition to values), so log cycling becomes more important.

Start monitoring as close to the user as possible, like API response times and login failure  rates

Engineers are often tempted to start monitoring with CPU, disk usage, or similar background metrics.

Instead, start with metrics closer to the user like API response time, page rendering time, and user issues like login success/failure rate.

Not only will business leaders appreciate these metrics more, but they will tell you what layers of the onion to peel back.

Establish incident response roles for people on your team when entering a  firefight

When software systems fail and you’re in firefighting mode, start by establishing incident response roles.

As Director of Engineering at my company, my role is to communicate status throughout the department and company. I rarely play the role of decision maker. Rather, I leave the decision maker role to a senior engineer.

You’ll also need people in other roles like scribes and investigators, which often work well for individual contributor engineers on your team.

In my experience, establishing these roles alleviates pressure and provides clear moments for engineering managers to challenge their seniors to step up their decision making abilities.

Practical Monitoring

How strongly do I recommend Practical Monitoring?
7 / 10

Practical Monitoring is a good book for product engineering teams just learning to take on DevOps responsibilities. It will point you in the right direction with enough detail to investigate further.