Books on Distributed Systems and Resilience Engineering

Summary

As my career progressed, I found myself working on increasingly large and complex systems. These systems served more users and came with higher expectations for reliability and performance.

These books will help you and your team of engineers identify common issues in distributed systems, providing you with a language and framework for systems thinking.

Thinking in Systems

A Primer

by Donella Meadows

Recommended: 7 / 10

When asked to draw a systems diagram, many engineers instinctively focus on the components – our databases, servers, devices, and other tangible pieces.

Systems thinking is about focusing on the lines between those components that represent the interactions.

Thinking in Systems is not directly about software engineering. The author uses metaphors from everyday life to explain core systems thinking ideas like stocks and flows. I found this abstraction useful because it prevented me from getting overly focused on software-specific details.

See top ideas from Thinking in Systems

Designing Data-Intensive Applications

The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

by Martin Kleppman

Recommended: 7 / 10

Data lives longer than code, so to build resilient systems we need to understand the fundamentals of our data storage systems.

This book operates at a lower level than the other books listed, getting into technical details about data storage concepts and specific implementations used by software authors.

For instance, in this book you can read about a variety of conflict resolution logic and mechanisms in database systems like MySQL and SQL Server.

See top ideas from Designing Data-Intensive Applications

Release It!

Design and Deploy Production-Ready Software

by Michael Nygard

Recommended: 7 / 10

I thought Release It! would be about deployment frequency, but I was wrong. This book is about resilience engineering in distributed and complex software systems.

The stories and examples in Release It! will resonate with every seasoned engineer, providing a common language for you and your team to discuss principles and strategies for responding to failure and increasing reliability.

Release It! is a light read and one that I recommend for your own engineering team book club.

See top ideas from Release It!

Drift into Failure

From Hunting Broken Components to Understanding Complex Systems

by Sidney Dekker

Recommended: 8 / 10

Drift into Failure is about the cultural and environmental causes of system failure. Like Thinking in Systems, this books is not directly about software engineering and that abstraction is helpful for understanding the broader concept of system failure.

Drift into Failure is an especially good read for software engineering managers at startups who are concerned about the security and reliability of their systems.

See top ideas from Drift into Failure

Practical Monitoring

Effective Strategies for the Real World

by Mike Julian

Recommended: 7 / 10

Practical Monitoring is about getting the instrumentation right so you can properly assess system reliability and performance.

In this book you will read about common antipatterns, principles for monitoring, and pragmatic advice for selecting the right monitoring tools.

Practical Monitoring is only about 130 pages and an easy read to share with your team.

See top ideas from Practical Monitoring