As my career progressed, I found myself working on increasingly large and complex systems. These systems served more users and came with higher expectations for reliability and performance.
These books will help you and your team of engineers identify common issues in distributed systems, providing you with a language and framework for systems thinking.
by Donella Meadows
When asked to draw a systems diagram, many engineers instinctively focus on the components – our databases, servers, devices, and other tangible pieces.
Systems thinking is about focusing on the lines between those components that represent the interactions.
Thinking in Systems is not directly about software engineering. The author uses metaphors from everyday life to explain core systems thinking ideas like stocks and flows. I found this abstraction useful because it prevented me from getting overly focused on software-specific details.
by Martin Kleppman
Data lives longer than code, so to build resilient systems we need to understand the fundamentals of our data storage systems.
This book operates at a lower level than the other books listed, getting into technical details about data storage concepts and specific implementations used by software authors.
For instance, in this book you can read about a variety of conflict resolution logic and mechanisms in database systems like MySQL and SQL Server.
by Michael Nygard
I thought Release It! would be about deployment frequency, but I was wrong. This book is about resilience engineering in distributed and complex software systems.
The stories and examples in Release It! will resonate with every seasoned engineer, providing a common language for you and your team to discuss principles and strategies for responding to failure and increasing reliability.
Release It! is a light read and one that I recommend for your own engineering team book club.
by Sidney Dekker
Drift into Failure is about the cultural and environmental causes of system failure. Like Thinking in Systems, this books is not directly about software engineering and that abstraction is helpful for understanding the broader concept of system failure.
Drift into Failure is an especially good read for software engineering managers at startups who are concerned about the security and reliability of their systems.
by Mike Julian
Practical Monitoring is about getting the instrumentation right so you can properly assess system reliability and performance.
In this book you will read about common antipatterns, principles for monitoring, and pragmatic advice for selecting the right monitoring tools.
Practical Monitoring is only about 130 pages and an easy read to share with your team.