Contents

Managing maintenance with Riemann

Contents

This is another post triggered by writing The Art of Monitoring. You can join the mailing list on that site for further information and updates.

Last week my phone danced its way across the floor after a flurry of notifications from our site. An upstream issue had caused a burst of false alarms to trigger. This is annoying but not nearly as annoying as when you accidentally trigger those notifications yourself. I think we’ve all had the experience of doing some maintenance only to discover a mess of notifications letting us know that we forgot to inform our monitoring system about that maintenance.

With a lot of monitoring systems you can avoid this by marking hosts and services as in “maintenance mode” and stop notifications being triggered. In other systems more crude means, like stopping the notifications daemon or stopping an SMTP service are required.

With Riemann though how do we handle this? Being event-driven and largely stateless, Riemann generally doesn’t have a repository of knowledge about our hosts and services to query (although you can hook up Riemann to services like Zookeeper if you wished). Riemann, however, does have the index. If configured, the Riemann index contains a copy of the latest event for any host and service pair sent to it. As a new event appears for that pair it replaces the old event in the index. If an event reaches the end of its time to live (TTL) without being replaced it is expired from the index and a new event generated with a state of expired.

We can take advantage of the Riemann index to help us manage maintenance and downtime for our hosts and services by injecting maintenance events. A maintenance event is a normal Riemann event that we identify by host, service or a specific tag. The event will have an infinite TTL or time to live. If we want to start a maintenance window we send Riemann one of these maintenance events with a :state of active. If we want to end the maintenance window we send another event with a :state of anything but active.

To check for maintenance events we’re going to build a check that will execute before notifications. The check will search the Riemann index for any maintenance events. If it finds an event that matches the host and service which has triggered the notification then it will check the event’s :state. For any events that have a :state of active it will abort the notification.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(defn maintenance-mode?
  "Is it currently in maintenance mode?"
  [event]
  (->> '(and (= host (:host event))
             (= service (:service event))
             (= (:type event) "maintenance-mode"))
       (riemann.index/search (:index @core))
       first
       :state
       (= "active")))

This new function, called maintenance-mode?, takes a single argument: an event. It then uses a macro, . The ->> macro rearranges the series of expressions, reversing the order and running through the forms. You can think about this expression as being:

1
(= "active" (:state (first (riemann.index/search (:index (clojure.core/deref core)) (quote (and (= host (:host event)) (= service (:service event)) (= (:type event) "maintenance-mode")))))))

This series of expressions means: “If the :state field of the first event returned by a search in the index is active, has a matching host and service and has a custom field called :type with a value of maintenance-mode.”

The index search itself is being done using the riemann.index/search function.

We then wrap our notifications with the new function.

1
2
3
4
5
(tagged "notification"
  (where (not (maintenance-mode? event))

  .  . . notification logic . . .
))

Let’s schedule some maintenance. We can do this a number of different ways, for example manually submitting an event using one of the Riemann clients like that Riemann Ruby client. Let’s install the Ruby client now.

1
$ sudo gem install riemann-client

And then use the irb ruby to send a manual event.

1
2
3
4
5
6
7
$ irb
irb(main):001:0> require 'riemann/client'
=> true
irb(main):002:0> client = Riemann::Client.new host: 'riemanna.example.com', port: 5555, timeout: 5
irb(main):003:0> client << {service: "apache2", host: "webserver", type: "maintenance-mode", state: "active", ttl: Float::INFINITY}
=> nruby
irb(main):003:0>

We require the riemann/client and then create a client that connects to our Riemann server. We then send a manual event with a :service of apache2 for the relevant host, webserver, and with a custom field of :type set to maintenance-mode. Our event will also have a :state of active and a TTL of forever. If a notification were to now trigger on the webserver host then Riemann would detect the active maintenance event and not send the notification.

If the maintenance window was over we could disable it like so:

1
2
irb(main):002:0> client << {service: "apache2", host: "webserver", type: "maintenance-mode", state: "inactive", ttl: Float::INFINITY}
=> nil

Using the Riemann client directly is a little clumsy so I’ve actually written a tool to help automate this process. It’s a Ruby gem called maintainer. You can install it via the gem command.

1
$ sudo gem install maintainer

We then use it like so:

1
$ maintainer --host riemanna.example.com --event-service apache2

This will generate a maintenance event for the current host (or you can specify a specific host with the --event-host flag). The event will look something like:

1
2
3
{:host webserver, :service apache2, :state active,
:description Maintenance is active, :metric nil, :tags nil,
:time 1457278453, :type maintenance-mode, :ttl Infinity}

We can also disable maintenance events like so:

1
$ maintainer --host riemanna.example.com --event-service apache2 --event-state inactive

Which will generate an event like so:

1
2
3
{:host webserver, :service apache2, :state inactive,
:description Maintenance is inactive, :metric nil, :tags nil,
:time 1457278453, :type maintenance-mode, :ttl Infinity}

We can then wrap this binary inside a configuration management tool or a Cron job or whatever else triggers your maintenance windows. And then we have a basic maintenance and downtime scheduling system for Riemann.