Contents

Monitoring Survey 2015 - Effectiveness

In the last posts I talked about monitoring environments, metrics, the tools people used in monitoring and the demographics of the survey.

In this post I am going to look at the questions around the effectiveness of monitoring, how people handle alerting and the use of configuration management software.

As I’ve mentioned in previous posts, the survey got 1,116 responses of which 884 were complete and my analysis only includes complete responses.

This post will cover the questions:

12. When do you most commonly add monitoring checks or graphs to your environment?
13. Do you ever have unanswered alerts in your monitoring environment?
14. How often does something go wrong that IS NOT detected by your monitoring?
15. Do you use a configuration management tool like Chef, Puppet, Salt or Ansible to manage your monitoring infrastructure?

When do you add monitoring checks or graphs to your environment?

Question 12 attempts to identify when in the product and infrastructure lifecycle you add monitoring checks to your environment. This is designed to tease out whether your monitoring is proactive or reactive.

The question had the following choices:

  • When something goes wrong and we want to monitor for that problem in future.
  • When we build new infrastructure or deploy new applications.

I’ve provided a graph showing the distribution of answers.

/images/posts/2015/7/whencheck.png

We can see that most people, 62.7% of them, add checks when infrastructure or applications are deployed, leaving 37% performing reactive checks. That’s largely unchanged from last year’s response.

We’ve also mapped it by organization size.

/images/posts/2015/7/whencheckorg.png

We can see that very small and very large organizations are slightly more reactive.

Do you ever have unanswered alerts in your monitoring environment?

In Question 13 we’re interested in the measurement of alerting hygiene and how people respond to alerts. I was interested in seeing how many people had outstanding alerts and how many actioned them immediately.

Each respondent had the option to answer the question with:

  • No - we action them all immediately
  • Yes - we usually have a few
  • Yes - we usually have some
  • Yes - we usually have a lot

I’ve provided a graph showing the distribution of answers.

/images/posts/2015/7/alert.png

We can see that the largest group of respondents, 401 or 45%, have at least a few unanswered alerts. This is identical to last year’s results for this category. The next largest group at 196 or 22% of respondents actions all alerts immediately. A further 19% have some unanswered alerts and 13% have a lot of unanswered alerts.

I also broke down alert behavior by organization size.

/images/posts/2015/7/alertorg.png

This year the patterns in this breakdown again felt very familiar. Like last year there is a decrease in alerts being actioned immediately as the organization grows and an increase in volume of alerts that are not actioned.

I was also planning to add a question about alert fatigue in this year’s survey but was unable to frame one that provided viable data.

How often does something go wrong that IS NOT detected by your monitoring?

Question 14 asked about outages and failures in environments that are NOT detected via monitoring. The respondents had the option of answering:

  • Frequently
  • Occasionally
  • Never

I’ve graphed the responses here:

/images/posts/2015/7/monfail.png

We can see that 81% of respondents had something occasionally go wrong that wasn’t detected by monitoring. 11% stated that failures frequently occurred that were not detected by monitoring. 8% stated that there were never undetected failures in their environments. This is very close to last year’s results.

I further analyzed the response by organization size.

/images/posts/2015/7/monfailorg.png

Again we see some familiar patterns with more frequent unmonitored failures in larger organizations.

Do you use a configuration management tool

The last question, Question 15, asked respondents if they used Configuration Management to manage their monitoring environment.

/images/posts/2015/7/cm.png

This year 71.7% of respondents did use Configuration Management to manage their monitoring, which is in line with last year’s results.

0.3% or 3 respondents did not know what configuration management was.

I also analyzed the responses by organization size.

/images/posts/2015/7/cmorg.png

Again this year we see less use of configuration management in larger organizations.

P.S. I am also writing a book about monitoring.

The posts: