A Monitoring Maturity Model

James Turnbull included in Blog Monitoring

2015-01-13 748 words 4 minutes

Contents

I’ve been thinking a lot about monitoring maturity. Based on some research I did last year and a number of conversations with people in the industry I’ve documented a simple monitoring maturity model. I present it largely because some folks might be interested rather than as any sweeping revelation.

The three level maturity model reflects the various stages of monitoring evolution I’ve seen organizations experience. The three stages are:

Manual
Reactive
Proactive

Onto the details of the stages.

Manual or None

Monitoring is largely done manually or not at all. If monitoring is performed you will commonly see checklists, or simple scripts and other non-automated processes. Much of the monitoring is cargo cult behaviour where the components that are monitored are those that have broken in the past. Faults in these components are remediated by repeatedly following rote steps that have also “worked in the past”.

The focus is entirely on minimizing downtime and managing assets. Monitoring provides little or no value in measuring quality or service and provides little or no data that helps IT justify budgets, costs or new projects.

This is typical in small organizations with limited IT staffing, where there are no dedicated IT staff or where the IT function is run or managed by non-IT staff, such as a Finance team.

Reactive

Monitoring is mostly automatic with some remnants of manual or unmonitored components. Tooling of varying sophistication has been deployed to perform the monitoring. You will commonly see tools like Nagios with stock checks of basic concerns like disk, CPU and memory. Some performance data may be collected. Most alerting will be simple and via email or messaging services. There may be one or more centralized consoles displaying monitoring status.

There is a broad focus on measuring availability and managing IT assets. There may be some movement towards using monitoring data to measure customer experience. Monitoring provides some data that measures quality or service and provides some data that helps IT justify budgets, costs or new projects. Most of this data needs to be manipulating or transformed before it can be used though. A small number of operationally-focussed dashboards exist.

This is typical in small to medium enterprises and common in divisional IT organizations inside larger enterprises. Typically here monitoring is built and deployed by an operations team. You’ll often find large backlogs of alerts and stale check configuration and architecture. Updates to monitoring systems tend to be reactive in response to incidents and outages. New monitoring checks are usually the last step in application or infrastructure deployments.

Proactive

Monitoring is considered core to managing infrastructure and the business. Monitoring is automatic and often driven by configuration management tooling. You’ll see tools like Nagios, Sensu, and Graphite with widespread use of metrics and graphing. Checks will tend to be more application-centric, with many applications being instrumented as part of development. Checks will also focus on measuring application performance and business outcomes rather than stock concerns like disk and CPU. Performance data will be collected and frequently used for analysis and fault resolution. Alerting will be annotated with context and likely include escalations and automatic responses.

There is a focus on measuring quality of service and customer experience. Monitoring provides data that measures quality or service and provides data that helps IT justify budgets, costs or new projects. Much of this data is provided directly to business units, application teams and other interests parties via dashboards and reports.

This is typical in web-centric organizations and many mature startups. Monitoring will still largely be managed by an operations team but responsibility for ensuring new applications and services are monitoring may be devolved to application developers. Products will not be considered feature complete or ready for deployment without monitoring and instrumentation.

Summary

I don’t believe or claim this model is perfect (or overly scientific). It’s also largely designed so I can quantify some work I am conducting. The evolution of monitoring in organizations varies dramatically, or as William Gibson said: “The future is not evenly distributed.” The stages I’ve identified are broad. Organizations may be at varying points of a broad spectrum inside those stages.

Additionally, what makes measuring this maturity difficult is that I don’t think all organizations experience this evolution linearly or holistically. This can be the consequence of having employees with varying levels of skill and experience over different periods. Or it can that different segments, business units or divisions of an organizations can have quite different levels of maturity. Or both.