Compute a 15 minute average of a metric easily with Ganglia

April 6th, 2012

This is a quick way to extract a 15 minute average of a metric in Ganglia. It utilizes Ganglia's CSV export function to get the values then uses awk to actually compute the average.

First of all find a metric graph you want to calculate average from. Right click over the image and copy the image location. Then append &csv=1 to the URL and UNIX time stamp from 15 minutes ago and put that as the &cs= argument. This is a simple shell script that illustrates it

MIN15AGO=`date --date="15 minutes ago" "+%s" ; 
curl --silent "$MIN15AGO&csv=1" | \
   awk -F, '{sum+=$2} END { print "Average = ",sum/NR}'

Adding context to your alerts

March 29th, 2012

I am a big believer in adding context to alerts. This allows the recipient of an alert to make a better decision on how to deal with an alert. It's often hard to classify alerts so providing as much context to the alert is extremely helpful. For instance if I am alerting on a value of a metric I like to attach an image of that metric for the past hour. This way if I am on my mobile phone and out and about I have the alerting metric graph right there without needing to open up another window or having to start up my laptop.

In more recent versions of Ganglia there is an option to add overlay events to hosts which show up as vertical lines on the graph. I figured that would be great context to add to alerts. Since I'm using Nagios I decided to extend a mail handler I used before to query Ganglia events database and include any events that were connected to the matching host in 24 hours. This helps in a number of  scenarios to keep team on the same page and well informed e.g.

  • There was a code push/config change however host/service was not scheduled for maintenance
  • Recent code push is causing issues ie. web servers are crashing

This is an example e-mail you get

As an added bonus mail handler sends all alerts to a Nagios Bot :-). Now all you need to make sure is to record events for any major changes. You could do a lot of these things automatically by e.g.

  • Adding hooks to your startup scripts so that when you purposely restart services it is logged
  • Watching logs then inserting proper events in the timeline. App stoppe
  • Querying external services e.g. Dynect provides an API to query zone changes

You can download the mail handler from here



Monitoring NetApp Fileservers with Ganglia

March 29th, 2012

In our environment we use NFS on Netapp fileservers a lot. They are used for home directories, build directories (don't ask), DB data directories etc. This is done mostly for reliability and data integrity. However it leads to a number of problems since they are shared by a number of different groups of users and are a "black box" for users. Frequently we'll get reports of machines or builds being unusually slow. This results in lots of confusion since we have observed in the past that in most cases of slowness machines involved are frequently "idle" where CPU utilization is unremarkable ie. < 10% yet CPU wait I/O is significantly elevated. We would then posit it was external e.g. NFS related. To avoid the guess work I have decided to start monitoring Netapp fileservers to get insight into what is going on. My team doesn't manage the Netapps however we use Ganglia. I found check_netappfiler  and using it as a template I built a script to gather metrics from Netapp and send them to Ganglia. You can download the script from here

Basically it queries a list of Netapp servers and injects those metrics to Ganglia. So far metric gathering has been invaluable. For example on one occasion we got a report of slowness from a couple of users. I observed that CPU utilization on the Netapp that project was using was 100 percent. That may "explain" the slowness.

However that wasn't all either. Preceding the event there was a heavy NFS utilization.

However as soon as CPU utilization goes to 100% number of NFS ops plummets.

and so does the Network bytes sent.

and bytes read from the disk. It sure looks like a Netapp bug. Luckily since we have all these metrics available we could much quicker figure out what is going on and what further steps to conduct ie. contact the vendor.

RESTful way to manage your databases

January 2nd, 2012

I have a need in my development environment to easily create/drop mySQL databases and users. Initially I was gonna implement a simple hacky HTTP GET method but was dissuaded by Ben Black from doing so. He suggested I write a proper RESTful interface. Without further ado I present to you dbrestadmin

It is my first foray into writing RESTful services so things may be rough around the edges. However it allows you to do following

  • manage multiple database servers
  • create/drop databases
  • list databases
  • create/drop users
  • list users
  • give user grants
  • view grants given to the user
  • view database privileges on a particular database given to a user

For example need to create a database called testdb on dbserver ID=0 use this cURL command

curl -X POST http://myhost/dbrestadmin/v1/databases/0/dbs/testdb

Create a user test2 with password test

curl -X POST "http://localhost:8000/dbrestadmin/v1/databases/0/users/test2@localhost" -d "password=test"

Give test2 user all privileges on testdb

curl -X POST "http://localhost:8000/dbrestadmin/databases/0/users/test2@'localhost'/grants" -d "grants=all privileges&database=testdb"

There is more. You can see all of the methods here

Improvements and constructive criticism welcome

Operating on Dell RAID arrays cheatsheet

November 23rd, 2011

I have to infrequently add new drives to Dell RAID arrays like H700. For some reason it takes me couple searches to find the info so here so I can find it later.

List all drives

/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

Create a RAID array (e.g. RAID 0)

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4, 32:5] -aALL

List working RAID arrays

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL

Confirm you got the right RAID array e.g. Virtual Disk 1

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L1 -aALL

Delete RAID array

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -aALL


Use fantomTest to test web pages from multiple locations

September 27th, 2011

In my previous I introduced Testing your web pages with fantomtest. I have recently added ability to test the same page from multiple sites within the same interface. You simply install the copy of fantomTest on a remote site then configure your primary site to access it. For example this is a test of Google from my laptop.

Looks like my network connection is really slow :-(. Changing the testing site to Croatia where I have a server I get

Slightly different since Google redirects me to their localized Google site however it leads me to believe that it's my connection that is slow not Google.

Any number of  "remotes" can be added. Want it ? Get it @GitHub

Using Jenkins as a Cron Server

August 22nd, 2011

There are a number of problems with cron which cause lots of grief for system administrators with big ones being manageability, cron-spam and auditability. To fix some of these issues I have lately started using Jenkins. Jenkins is an open source Continuous Integration server it has lots of features that make it a great cron replacement for a number of uses. These are some of the problems it solves for me


Jenkins can be configured to retain logs of all jobs that it has run. You can set it up to keep last 10 runs or you can set it up to keep only last 2 weeks of logs. This is incredibly useful since sometimes jobs can fail silently so it's useful to have the output instead of sending it to /dev/null.

Centralized management

I have my most important jobs centralized. I can export all Jenkins jobs as XML and check it into a repository. If I need to execute jobs on remote hosts I simply have Jenkins ssh and execute command remotely. Alternatively you can use Jenkins slaves.

Cron Spam

Cron spam is a common problem with solutions such as this, this and this. To avoid this condition I only have Jenkins alert me when a particular job fails ie. a job exits with return code other than 0.  In addition you can use the awesome Jenkins Text Finder plugin which allows you to specify words or regular expressions to look for in console output. They can be used to mark a "job" unstable. For example in text finder config I checked

X Also search the console output

and specified

Regular expression ([Ee]rror*).*

This has saved our bacon since we used the script which "swallows" up the errors codes from the mysqldump command and exits normally. Text Finder caught this

mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' bytes when dumping table `users` at row: 234

Happily we caught this one on time.

Job dependency

Often you will have job dependencies ie. main backup job where you first dump a database locally then upload it somewhere off-site or to the cloud. The way we have done this in the past is to leave a sufficiently large window between the first job and consecutive job to be sure first job has finished. This says nothing about what to do if the first job fails. Likely the second one will too. With Jenkins I no longer have to do that. I can simply tell Jenkins to trigger "backup to the cloud" once local DB backup concludes successfully.

Test immediately

While you are adding a job it's useful to test whether job runs properly. With cron you often had to wait until the job executed at e.g. 3 am in the morning to discover that PATH wasn't set properly or there was some other problem with your environment. With Jenkins I can click Build Now and job will run immediately.

Easy setup

Setting up jobs is easy. I have engineers set up their own job by copying an existing job and modifying it to do what they need to do. I don't remember last time someone asked me how to do it :-).

What I don't use Jenkins for

I don't use Jenkins to run jobs that collect metrics or anything that has to run too often.


Testing your web pages with fantomtest

August 2nd, 2011

Coming from web operations background my web site/page monitoring had largely focused at looking at metrics such as average request duration, 90th percentile request duration etc. These are all great metrics however through Velocity Conferences I have come to appreciate that there is a lot more to web performance than simply knowing how long it takes to load HTML in a web page. As a result I have been looking for ways to try to get better metrics by utilizing real browsers instead of Perl/Ruby/Python scripts. For some time I have been playing with Selenium RC to give me an easy way to test and time my web application. Unfortunately I found it heavy and slow. At last Velocity conference I was fortunate enough to see a demo of PhantomJS. PhantomJS is a semi-headless webkit browser with Javascript support. What I really appreciated about it is that it is light weight, fast and very easy to instrument using Javascript. In addition it includes a number of useful examples such as netsniff.js which output a HTTP Archive (HAR) of requests to a certain web page. From a HAR file you can builds among other things waterfall charts. There are a number of services you can use to have your site tested for free e.g. Limitation is that they can't test your intranet infrastructure since that is usually behind a firewall or it doesn't allow you to test remote sites that are connected to your intranet via a VPN.

That is why I'm introducing fantomTest. A simple web application that allows you to generate waterfall graphs using PhantomJS. It will also take a screenshot of a rendered page. Here is what that looks like

What's interesting in this particular case is that Google is not utilizing web performance recommendations by using a HTTP redirect from to

Anyways to get fantomTest go to

Monitoring links and monitoring anti-patterns video

June 5th, 2011

John Vincent aka. lusis has started an interesting conversation surrounding monitoring on Freenode on channel he named ##monitoringsucks. He has also done an awesome job of starting up a Github project of the same name that is shaping up to be a nice collection of links to tools and blog posts. Check it out

Also we just got a hold of  the monitoring anti-patterns Ignite Talk from Devopsdays Boston by Alexis Lê-Quôc aka. @alq. It is a short video (5 minutes) so it's definitely worth seeing.

Use your trending data for alerting

April 19th, 2011

This post will deal with helping you use the data you already have to do alerting. It is most helpful for people running Nagios or it's variants such as Icinga, Netreo etc. It could likely be used with other decoupled alerting systems (not Zabbix or Zenoss though since they do their own trending).

Recently I came to a realization that lots of sysadmins are unaware that they could easily use trending data they already capture with systems such as Ganglia, Graphite, Collectd, Munin etc. to do alerting. Standard way of doing health checks of remote nodes in Nagios is to install the Nagios Remote Plugin Executor aka. NRPE which allows you to execute Nagios plugins on remote nodes and pipe output to the Nagios server. NRPE does the job however has three major disadvantages

  1. It is another daemon that needs to run on the remote host possibly introducing security concerns
  2. Depending on the load of the machine can be slow thus bogging down the Nagios server
  3. Last and most important is that commonly it's used to alert on common metrics such as disk, load, CPU, swap which you should be trending anyways.

Instead what you ought to be doing is use trending data for alerting. I can think of at least 4 reasons to do so

  1. You may already be collecting pertinent data ie. system load, swap, CPU utilization
  2. If you are alerting on a particular metric you should likely be trending it
  3. It's fast
  4. Allows you to do more sophisticated checks easily ie. alert me if more than 5 hosts have a load greater than 5 etc.

Years ago I used Ganglia Web PHP code to write my own generic Nagios Ganglia plugin. This has served me well. Most recently Michael Conigliaro rewrote the script in Python making it more versatile and more powerful. You can download it from here

In a nutshell what it does is download the whole metrics tree ie. list of all hosts with their associated metrics. Caches it for a configurable amount of time then uses NagAconda to support all the threshold reporting as defined in Nagios developer guidelines.

Another alternative if you have a very large site is Ganglios which was opensourced by guys at Linden Lab. Their problem is/was that they have thousands of hosts and downloading the whole metrics tree takes ~15 seconds so they have separated the logic that downloads the metric tree and one that does alerting. You can download Ganglios from

This can easily be adapted to work with your trending system of choice.