Looking up events in the Riemann index
Forthcoming book - The Art of Monitoring
One of the classic problems of monitoring alerts is that they are often very cryptic. Coupled with the challenge of alert fatigue1 this makes working out what to do next when you receive an alert quite tricky. Additionally, alerts often happen when we’re not at the top of our game: a 4am on a Sunday morning alert is not likely to foster an exemplary response.
The quintessential example of cryptic/unhelpful alerts are Nagios disk space alerts.
What does this alert mean? We can see that filesystem
/data has 678912 Mb of disk space left or 9%. Should we worry? How fast it is filling up? Is this likely to happen RSN or sometimes in the future? What’s on that filesystem? Do I care if it fills up? I already have five questions from a single alert and I haven’t even started to diagnose WHY things might be wrong. Meh I am going back to sleep.
Thankfully, in the middle of last year the estimable Ryan Frantz released Nagios Herald. Nagios Herald is a decorator for Nagios alerts. It allows you to add context or further information to alerts generated by Nagios.
For example, here is a decorated Nagios disk alert.
Much more useful. Nice big stack bar. Helpful graph. Output from the
df command. With this information I’m feeling a lot more comfortable about fixing the issue. (You can find a bunch of other example alerts here too.)
So helpful to all using Nagios. Not so helpful to others. (Although I think there is support for user-supplied attributes in Sensu and uchiwa and probably some other tools but nothing quite so well integrated and helpful (yet).)
So in the spirit of recent Riemann posts I thought about what I could do quickly and simply to provide some context for alerts, specifically email alerts. Riemann does have one useful store of information: the index. Every event you index is stored in there until its TTL expires and the expiration reaper runs. So if you’re collecting useful events then some of those might help to color your alerts with helpful context.
In my environment Riemann receives events from collectd and does most of its alerting based on the values of collectd metrics. One of those plugins,
df, emits metrics that measure the size of your filesystems. It emits a metric like so:
We can use this event, through the
:service field, for example
:service df-root/percent_bytes-use, to identify when specific filesystem have exceeded a threshold.
We can create a configuration like so to do this:
This uses the
where filter stream to select all
df-generated metric matching
df-(.\*)/percent_bytes-used. This should find the percent bytes used for every filesystem we’re monitoring, for example for the
/ filesytem the metric would be:
where filter all matches on the
metric when the percentage if greater than or equal to 90%. If it matches it sends an email using the
It’s inside our email alerting that we’re going to add the additional context. Inside our
:body option to the
mailer plugin. We’ve defined that plugin inside our
:body option takes a function and the
events argument. The
events argument contains one or more events in a sequence that our function, here
format-body, will then parse and format.
format-body function will look pretty similar to the default Riemann email formatting.
We take the
events argument and loop through the sequence of events inside it to produce a notification. Where the function starts to differ is when we begin to populate our additional insights. The insight is generated by looking up events in the Riemann index. To do this we use a third function called
print-context function takes a host, here the host of the current event from the
:host field, and uses the
search function to return all of the other events from that host from the index.
search function uses the
riemann.index/search function to query the index. It constructs a query using the
host argument. It then uses that query to retrieve all matching events from that host from the index. Where the location of the index is the currently running core. Any matching events in the index will be returned as a sequence of standard Riemann events.
We then pass this sequence to the
print-context function as an argument. The
print-context function iterates through the sequence and prints out a list of services and associated metrics.
The contextual example is a little silly because you probably don’t want all of these services and their metrics but you could easily select something more elegant. (In the example code we’re also included a
lookup function which uses the other index parsing function:
lookup function uses a host/service pair to look up specific events inside the index.)
We also run our events through the
round function which uses
clojure.pprint to round any numbers to 2 decimal places.
Phew! That’s a lot of background. So what actually happens when this alert triggers? In this case you will generate an email much like:
You could easily modify this to only select specific, relevant, events. You could also use any of Riemann’s stream functions or Clojure’s functions to manipulate those events.
You could also extend this example beyond the index to retrieve external information. For example to retrieve further information from the host, construct a graph, or link to an existing Graphite graph or data source. This could even be further extended to take some action on the host itself in addition to the notification. The possibilities are broad and exciting!
P.S. You can find a fully-functioning Riemann configuration for this example here.
Becoming desensitized to alerts because you get so many. ↩︎