Sensu: Finally the Nagios Replacement I Have Been Looking For!

by
Tags: , , ,
Category:

Nagios, the popular open source system and networking monitoring service, is awesome. It gives you so much flexibility that expensive commercial software like Solarwinds and Spiceworks just doesn’t have. However after I moved my infrastructure to Chef, Nagios was constantly giving me issues. The configuration scheme just doesn’t fit well in an autonomous environment.

So I set out to find a replacement that worked well with Chef, and eventually found Sensu. After just 2 days of testing, I was ready to Old Yeller Nagios.


Sensu

Sensu is a server-network-application health monitoring platform (It’s also a metrics platform, but I’ll get into that in another post). Sensu works in conjunction with your Chef or Puppet infrastructure by reading the roles of a node, and running specific checks on those roles.

For instance, let’s say you added a node to Chef called apache-webapp-01 with roles: linux_base and webapp. Sensu has the following checks in its configuration:

{
 "checks": {
    "check_cpu": {
      "command": "check-cpu.rb -c 99 -w 79",
      "handlers": [
        "mailer"
      ],
      "subscribers": [
        "linux_base"
      ],
      "standalone": false,
      "interval": 30,
      "occurrences": 5,
      "refresh": 1800
    }
  }
    "apache_check": {
      "command": "/etc/sensu/plugins/check-apache.rb",
      "interval": 60,
      "subscribers": [ "webapp" ]
      "handlers": [
        "slack"
      ],
    }
    "redis_check": {
      "command": "/etc/sensu/plugins/check-redis.rb",
      "interval": 60,
      "subscribers": [ "redis_db" ]
      "handlers": [
        "slack"
      ],
    }
  }
}

Sensu will automatically add the node to its clients list, then add the check_cpu and apache_check checks since the check subscribers match the node’s roles in Chef. The key thing about this is it’s AUTOMATIC! You only configure additional checks, handlers, and custom plugins for the Sensu server.

Checks

Checks are individual monitoring configurations that tell what plugins to run on the clients, and instruct about what to do when the check sends an event. Like Nagios, the result of the event is either 0 (OK), 1 (WARNING), 2 (CRITICAL), or 3 and above (UNKNOWN). This should be quite familiar to people creating their own Nagios plugins. In fact, your Nagios scripts and plugins will work with Sensu without any changes!

Let’s go over a simple check called check-disk line-by-line:

{
  "checks": {
    "check-disk": {
      "command": "/opt/sensu/bin/check-disk.rb -w 80 -c 95",
      "handlers": [
        "mailer"
      ],
      "subscribers": [
        "base"
      ],
      "interval": 30,
      "occurrences": 5,
      "refresh": 1800
    }
  }
}

——————————————-

"command": "/opt/sensu/bin/check-disk.rb -w 80 -c 95",

Run the script check-disk.rb. If the disk space is more then 80%, send a WARNING event back to the Sensu server; if more then 90%, send a CRITICAL event.


"handlers": [
        "mailer"
      ],

Use handler “mailer”. This perticular handler will take the STDOUT of the event and email it to a list of users.


"subscribers": [
        "base"
      ],

Only run the check on any nodes in the base role.


"interval": 30,

The client will run the plugin every 30 seconds.


"occurrences": 5,

The number of event occurrences that must occur before an event is handled for the check.


"refresh": 1800

Time in seconds until the event occurrence count is considered reset for the purpose of counting occurrences, to allow an event for the check to be handled again. For example: a check with a refresh of 1800 will have its events (recurrences) handled every 30 minutes, to remind users of the issue.

Handlers

Now the fun part: let’s review how Sensu handles events. In the past an email or SMS message was the popular way to receive an event saying something like CRITICAL: Disk /var is full! Nowadays, with our many collaboration tools we want more options. Let’s look at a handler that will send an event to Slack.

Here is the slack handler, line-by-line.

 {
  "handlers": {
        "slack": {
            "type": "pipe",
            "command": "/opt/sensu/embedded/bin/handler-slack.rb",
            "severities": "critical"
        }
  },
  "slack": {
    "webhook_url": "https://hooks.slack.com/services/xxxx/ooooooooo/xxxxxxx",
    "channel": "#sensu"
  }
}

——————————————-

"type": "pipe",

The event data is passed to the process via STDIN. The other types are tcp, udp, transport, and set.


"command": "/opt/sensu/embedded/bin/handler-slack.rb"

Runs the handler-slack.rb plugin (installed via gem sensu-plugins-slack) which will send the event message to Slack.


"severities": "critical"

Only run the handler if the event is critical. Other options are “ok”, “warning”, and “unknown”.


"slack": {
    "webhook_url": "https://hooks.slack.com/services/xxxx/ooooooooo/xxxxxxx",
    "channel": "#sensu"
  }

The global configuration to send messages to Slack


Here is what the event message would look like in Slack:

Event message in slack - image



Creating Sensu Plugins.

As with Nagios, creating plugins for Sensu is easy and the system is VERY flexible. You can use any scripting language that will provide a number to STDOUT. Sensu provides a ruby gem called sensu-plugin/check/cli which makes creating plugins even easier.

Let’s look at an apcupsd plugin I made that checks the battery time left on an APC UPS.

#!/usr/bin/env ruby
# check-apcupsd-timeleft.rb
#
# Sensu plugin that checks the battery time (in minutes) using apcupsd deamon.
# 
#
# Examples:
# check-apcupsd-timeleft.rb -w 5 -c 1
#
# Send warning if the battery time is 5 minutes or less,
# critical if 1 minute or less.
require 'rubygems' if RUBY_VERSION < '1.9.0'
require 'sensu-plugin/check/cli'
class CheckApcupsd < Sensu::Plugin::Check::CLI
  option :warn,
    :short => '-w WARN',
    :proc => proc {|a| a.to_i },
    :default => 5
  option :crit,
    :short => '-c CRIT',
    :proc => proc {|a| a.to_i },
    :default => 10
  def run
    apcaccess = '/sbin/apcaccess'
    results = (%x[#{apcaccess} status | grep -i timeleft
                                      | awk '{print $3}'
                                      | awk -F'.' '{print $1}' ]).to_i
    if results <= config[:warn] and results > config[:crit]
      warning "UPS Battery time is #{results} minutes"
    elsif results < config[:crit]
      critical "UPS Battery time is #{results} minutes"
    else
      ok "UPS Battery time is #{results} minutes"
    end
  end
end

-------------------------------------------

require 'sensu-plugin/check/cli'

The ruby gem that will give you functions to send the correct STDOUT events to the Sensu Server.


def run
....
end

The required method for sensu-plugin/check/cli. In the run method, create your logic for the plugin then use the ok, warning, critical or unknown functions with a string argument to send the appropriate event to the Sensu Server. For example:

if cpu > 90
   critical "Wake up! CPU is going nuts!!!"
end

Resources