Monitoring Sucks - A Rant

James Turnbull included in Blog devops Nagios

2013-01-29 883 words 5 minutes

Contents

In 2011 a Twitter hashtag called #monitoringsucks appeared. It was a response to the dearth of modern monitoring systems. Indeed, if you look around the open source ecosystem, the dominant player is still the venerable Nagios. There are a scattering of other players but innovation has been sadly lacking in the monitoring ecosystem.

Perhaps you could argue the lack of innovation stems from the fallacy that monitoring is a solved problem. Monitoring is far from a solved problem. Two things have made that particularly self-evident: virtualization and cloud. Virtualization means hosts are quick, easy and inexpensive to spawn. Virtual machine sprawl has quickly followed. Where organizations once had a 100 hosts they might now have a 1000. This has combined to create two problems: you have to monitor a lot more and what you are monitoring frequently changes. Cloud exacerbates the problem with the potential for frequent scaling up and down with commensurate frequent changes in monitoring configuration.

So how have the existing monitoring solutions coped with this change. The best answer is: not well. I’m going to pick on Nagios since it’s such a ubiquitous presence in open source monitoring and it is a tool I know well. Nagios has a number of key failings that the emergence of virtualization and cloud have exposed:

It doesn’t scale. Despite being written in C and reasonably fast, a lot of Nagios works in series especially checks. With thousands of hosts and tens of thousands of checks Nagios simply can’t cope.
It requires complex and verbose text-based configuration files. Despite configuration management tools like Puppet and Chef the Nagios DSL is not easily programmatically parseable. Additionally, the service requires a restart to recognize added, changed or removed configuration. In a virtualized or cloud world that could mean Nagios is being restarted tens or hundreds of times in a day. It also means Nagios can’t readily auto-discover nodes or services you want it to monitor.
It has a very binary view of the world. This means it’s not a useful tool for decision support. Whilst it supports thresholds it really can only see a resource as in a “good” state or in a “bad” state and it usually lacks any context around that state. A commonly cited example is disk usage. Nagios can trigger on a percentage threshold, for example the disk is 90% full. But it doesn’t have any context: 90% full on a 5Gb disk might be very different from 90% full on 1Tb drive. It also doesn’t tell you the most critical piece of information you need to make decisions: how fast is the disk growing. This lack of context and no conception of time series data or trending means you have to investigate every alert rather than being able to make a decision based on the data you have. This creates inefficiency and cost.
It is not very stateful. Unless you add additional components Nagios only retains recent state or maintains state in esoterically formatted files. Adding an event broker to Nagios, which is the recommended way to make it more stateful, requires considerable configuration and still does not ensure the data is readily accessible or usable.
It isn’t easily extensible. Nagios has a series of fixed interface points and it lacks a full API. It’s also written in C, which isn’t approachable for a lot of SysAdmins who are it’s principal users. It also lacks a strong community contributing to its development.
It is not modular. The core product contains monitoring, alerting, event scheduling and event processing. It’s an all or nothing proposition.

I could go on but this post isn’t about bashing Nagios. Despite its shortcomings, Nagios is still heavily used and there remain solid use cases for it. Indeed, we’ll probably see it in service for some time to come. Additionally, these shortcomings are not limited to Nagios either. Scratch the surface of a dozen solutions, many of them closely modeled on Nagios, leveraging its plugin API or reflecting lessons learnt from its design, and they all have one or more of the issues I’ve detailed above. For modern monitoring purposes, Nagios and many of the related tools, are not a viable solution for an increasing number of organizations.

This is the point at which someone says but what about “insert name of product here”? Or “You can run Nagios with x add-on and that fixes x.” Yes, there are some products, both open source and commercial, out there that have solved some of these issues. But these are not comprehensive solutions and many have solved one problem at the expense of another. In the case of add-ons, many of these are hack’ish and difficult to integrate.

None of these solutions represent the paradigm shift we actually need to address the new challenges we face as the result of our brave new world. We need to redesign monitoring solutions from the ground up to address the needs we have now and the needs, because the sprawl and complexity are only going to get worse as the market further commodifies, we’re going to have in the future.

I don’t yet know what that paradigm shift is, perhaps it’s Sensu-like or that may play a role, but I am looking forward to both finding out and hacking on the end result.