The Art of Monitoring


TL;DR - I am writing a book about monitoring and you can sign up for updates here.

Let’s begin with an origin story. Once upon a time(-series) there was a sysadmin. She managed infrastructure that lived in a data center. Every time a new host was added to that environment she installed some software and setup some checks. Every now and again one of those servers would break and a check would trigger. An alert would be sent and she would wake up and run rm -fr /var/log/*.log to fix it.

For many years this approach worked just fine. Oh there were some dramas: sometimes things would go wrong for which there wasn’t a check, or there just wasn’t time to action some alerts, or some applications and services on top of those hosts weren’t monitored. But things were mostly fine.

Then things started to change in the IT industry. Virtualization was introduced and a lot more hosts appeared. Many of those hosts were run by people who weren’t sysadmins or were even outsourced to third-parties. Then some of the hosts in her data center were moved into the Cloud or replaced with Software-as-a-Service applications.

Most importantly, applications and services that were previously merely seen as technology now became critical to selling to customers and providing high quality customer service. Suddenly IT wasn’t a cost centre but rather something the company’s revenue relied on.

As a result aspects of monitoring began to break down. It became hard to keep track of hosts (there were a lot more of them!), applications and infrastructure became more complex, and expectations around availability and quality became more aggressive. It became harder and harder to check for all the possible things that could go wrong using the current system. More and more alerts piled up. More hosts and services meant more demand on monitoring systems, most of which were only able to vertically scale. Faults and outages became harder to find and slower to detect under these loads.

Additionally, the organization began demanding more and more data to both demonstrate the quality of the service they were delivering to customers and to justify the increasing spend on IT services. Many of these demands were made for data that existing monitoring simply wasn’t measuring or couldn’t generate. The monitoring system became a tangled mess.

This is monitoring right now for many people in the industry. But it doesn’t have to be like that. You can build a better solution that addresses the change in the way IT works and that scales for the future.

Welcome to The Art of Monitoring.

This is a hands-on book that teaches you how to build a modern, scalable monitoring environment using up-to-date tools and techniques.

We include lessons for both sysadmins and developers. We’ll show developers how they can better enable monitoring and metrics and we’ll show sysadmins how to take advantage of that data to do better fault detection and get insights into performance.

We try to address the change in IT environments with virtualization, containerization and the Cloud. We help you provide a monitoring environment that helps you and your customers manage IT better.

The book will contain.

  • Chapter 1: An Introduction to Monitoring
  • Chapter 2: Building a metrics-centric monitoring environment.
  • Chapter 3: Metrics, metrics and measurement
  • Chapter 4: Building a service-centric and dynamic fault detection system
  • Chapter 5: Alerting
  • Chapter 6: Trending
  • Chapter 8: Visualization
  • Chapter 9: Anomaly Detection for fun and profit

(Likely to change…)

In the book we look at a variety of open source tools, including:

The book will be published late in 2015.

You can find more information on the book and its status here and you can sign up for updates here.