Connecting Riemann and Zookeeper
One of my pet hates is having to maintain configuration inside monitoring tools. Not only large pieces like host definitions but smaller pieces like service and component definitions. Using a configuration management tool makes this much easier but it still generally requires some convergence to update your monitoring configuration when a host is added or removed or a service changes.
An example might be HAProxy. I have a HAProxy running with multiple back-end nodes. I want to know about issues if the node count drops below a threshold, potentially if it drops at all. With auto-scaling or just adding and subtracting nodes I need to keep this count up-to-date in my monitoring system to ensure I am correctly alerted when something goes wrong and to avoid false positives. I could do that with configuration management and converge the configuration when I deploy, using Puppet’s exported resources for example. But in a dynamic and fast-moving environment I’d really prefer not to wait for any convergence.
(Note: This is a somewhat artificial and very pets v. cattle example. I don’t overly care if individual nodes die because they are disposable and easily replaced. I could apply the same logic to any host or service threshold that I wanted to query.)
Instead I want my monitoring system to be able to lookup my threshold in some source of truth about the state of my infrastructure. That source of truth could be something like Apache Zookeeper, Consul, or a configuration management store like PuppetDB.
In this post I’m going to combine Zookeeper and my Riemann monitoring stack. Let’s start with some code to connect to Zookeeper. It makes use of the Zookeeper-clj Clojure client.
The first part of our code loads the
zookeeper-clj client. We then
define a namespace called
zookeep and require the client (as
the Zookeeper client’s data function as
We’ve defined a var called
client that is a connection to a local
Zookeeper server. We could easily specify a remote server instead.
We’ve created a very simple function named
retrieves the contents of a specific Zookeeper node specified by the
Let’s now create a
riemann.config file to make use of our Zookeeper
In our configuration we’ve included our Zookeeper functions using the
include function and bound Riemann to all the interfaces on our host.
We’ve also configured the
Next we’ve defined some streams including a
where filter on an event
generated from collectd called
haproxy-backend.web-backend/gauge-active_servers. This is the active
back-end server count from the HAProxy
where filter matches this service, if it is tagged with
app1, and if
the value of the metric field is less than the value derived from the
(zookeep/get_data "/app1/haproxy/nodes") function. This function,
zookeep/get_data, takes the node name
/app1/haproxy/nodes and looks it up
Inside Zookeeper we’ve created this node and populated it with the count of HAProxy back-end nodes running for this specific application. That population of the node or its update would normally take place during deployment.
Now when the metric arrives into Riemann, the lookup is triggered and Riemann compares the value of the metric field with the value from the Zookeeper node. If the metric value is less than the node value then Riemann sends an email out containing the specific event. Now our monitoring system doesn’t need any changes when our HAProxy configuration changes. We hence eliminate the need to wait for our deployment changes to converge in our monitoring environment. Which means less risk of missing an alert or a false positive alert being generated.
… end interlude)