Monitoring NetApp Fileservers with Ganglia

March 29th, 2012

In our environment we use NFS on Netapp fileservers a lot. They are used for home directories, build directories (don't ask), DB data directories etc. This is done mostly for reliability and data integrity. However it leads to a number of problems since they are shared by a number of different groups of users and are a "black box" for users. Frequently we'll get reports of machines or builds being unusually slow. This results in lots of confusion since we have observed in the past that in most cases of slowness machines involved are frequently "idle" where CPU utilization is unremarkable ie. < 10% yet CPU wait I/O is significantly elevated. We would then posit it was external e.g. NFS related. To avoid the guess work I have decided to start monitoring Netapp fileservers to get insight into what is going on. My team doesn't manage the Netapps however we use Ganglia. I found check_netappfiler  and using it as a template I built a script to gather metrics from Netapp and send them to Ganglia. You can download the script from here

https://github.com/ganglia/gmetric/tree/master/netapp

Basically it queries a list of Netapp servers and injects those metrics to Ganglia. So far metric gathering has been invaluable. For example on one occasion we got a report of slowness from a couple of users. I observed that CPU utilization on the Netapp that project was using was 100 percent. That may "explain" the slowness.

However that wasn't all either. Preceding the event there was a heavy NFS utilization.

However as soon as CPU utilization goes to 100% number of NFS ops plummets.

and so does the Network bytes sent.

and bytes read from the disk. It sure looks like a Netapp bug. Luckily since we have all these metrics available we could much quicker figure out what is going on and what further steps to conduct ie. contact the vendor.

RESTful way to manage your databases

January 2nd, 2012

I have a need in my development environment to easily create/drop mySQL databases and users. Initially I was gonna implement a simple hacky HTTP GET method but was dissuaded by Ben Black from doing so. He suggested I write a proper RESTful interface. Without further ado I present to you dbrestadmin

https://github.com/vvuksan/dbrestadmin

It is my first foray into writing RESTful services so things may be rough around the edges. However it allows you to do following

  • manage multiple database servers
  • create/drop databases
  • list databases
  • create/drop users
  • list users
  • give user grants
  • view grants given to the user
  • view database privileges on a particular database given to a user

For example need to create a database called testdb on dbserver ID=0 use this cURL command

curl -X POST http://myhost/dbrestadmin/v1/databases/0/dbs/testdb

Create a user test2 with password test

curl -X POST "http://localhost:8000/dbrestadmin/v1/databases/0/users/test2@localhost" -d "password=test"

Give test2 user all privileges on testdb

curl -X POST "http://localhost:8000/dbrestadmin/databases/0/users/test2@'localhost'/grants" -d "grants=all privileges&database=testdb"

There is more. You can see all of the methods here

https://github.com/vvuksan/dbrestadmin/blob/master/API.md

Improvements and constructive criticism welcome

Operating on Dell RAID arrays cheatsheet

November 23rd, 2011

I have to infrequently add new drives to Dell RAID arrays like H700. For some reason it takes me couple searches to find the info so here so I can find it later.

List all drives

/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

Create a RAID array (e.g. RAID 0)

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4, 32:5] -aALL

List working RAID arrays

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL

Confirm you got the right RAID array e.g. Virtual Disk 1

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L1 -aALL

Delete RAID array

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -aALL

				

Use fantomTest to test web pages from multiple locations

September 27th, 2011

In my previous I introduced Testing your web pages with fantomtest. I have recently added ability to test the same page from multiple sites within the same interface. You simply install the copy of fantomTest on a remote site then configure your primary site to access it. For example this is a test of Google from my laptop.

Looks like my network connection is really slow :-( . Changing the testing site to Croatia where I have a server I get

Slightly different since Google redirects me to their localized Google site however it leads me to believe that it's my connection that is slow not Google.

Any number of  "remotes" can be added. Want it ? Get it @GitHub

https://github.com/vvuksan/fantomtest

Using Jenkins as a Cron Server

August 22nd, 2011

There are a number of problems with cron which cause lots of grief for system administrators with big ones being manageability, cron-spam and auditability. To fix some of these issues I have lately started using Jenkins. Jenkins is an open source Continuous Integration server it has lots of features that make it a great cron replacement for a number of uses. These are some of the problems it solves for me

Auditability

Jenkins can be configured to retain logs of all jobs that it has run. You can set it up to keep last 10 runs or you can set it up to keep only last 2 weeks of logs. This is incredibly useful since sometimes jobs can fail silently so it's useful to have the output instead of sending it to /dev/null.

Centralized management

I have my most important jobs centralized. I can export all Jenkins jobs as XML and check it into a repository. If I need to execute jobs on remote hosts I simply have Jenkins ssh and execute command remotely. Alternatively you can use Jenkins slaves.

Cron Spam

Cron spam is a common problem with solutions such as this, this and this. To avoid this condition I only have Jenkins alert me when a particular job fails ie. a job exits with return code other than 0.  In addition you can use the awesome Jenkins Text Finder plugin which allows you to specify words or regular expressions to look for in console output. They can be used to mark a "job" unstable. For example in text finder config I checked

X Also search the console output

and specified

Regular expression ([Ee]rror*).*

This has saved our bacon since we used the automysqlbackup.sh script which "swallows" up the errors codes from the mysqldump command and exits normally. Text Finder caught this

mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' bytes when dumping table `users` at row: 234

Happily we caught this one on time.

Job dependency

Often you will have job dependencies ie. main backup job where you first dump a database locally then upload it somewhere off-site or to the cloud. The way we have done this in the past is to leave a sufficiently large window between the first job and consecutive job to be sure first job has finished. This says nothing about what to do if the first job fails. Likely the second one will too. With Jenkins I no longer have to do that. I can simply tell Jenkins to trigger "backup to the cloud" once local DB backup concludes successfully.

Test immediately

While you are adding a job it's useful to test whether job runs properly. With cron you often had to wait until the job executed at e.g. 3 am in the morning to discover that PATH wasn't set properly or there was some other problem with your environment. With Jenkins I can click Build Now and job will run immediately.

Easy setup

Setting up jobs is easy. I have engineers set up their own job by copying an existing job and modifying it to do what they need to do. I don't remember last time someone asked me how to do it :-) .

What I don't use Jenkins for

I don't use Jenkins to run jobs that collect metrics or anything that has to run too often.

 

Testing your web pages with fantomtest

August 2nd, 2011

Coming from web operations background my web site/page monitoring had largely focused at looking at metrics such as average request duration, 90th percentile request duration etc. These are all great metrics however through Velocity Conferences I have come to appreciate that there is a lot more to web performance than simply knowing how long it takes to load HTML in a web page. As a result I have been looking for ways to try to get better metrics by utilizing real browsers instead of Perl/Ruby/Python scripts. For some time I have been playing with Selenium RC to give me an easy way to test and time my web application. Unfortunately I found it heavy and slow. At last Velocity conference I was fortunate enough to see a demo of PhantomJS. PhantomJS is a semi-headless webkit browser with Javascript support. What I really appreciated about it is that it is light weight, fast and very easy to instrument using Javascript. In addition it includes a number of useful examples such as netsniff.js which output a HTTP Archive (HAR) of requests to a certain web page. From a HAR file you can builds among other things waterfall charts. There are a number of services you can use to have your site tested for free e.g. webpagetest.org. Limitation is that they can't test your intranet infrastructure since that is usually behind a firewall or it doesn't allow you to test remote sites that are connected to your intranet via a VPN.

That is why I'm introducing fantomTest. A simple web application that allows you to generate waterfall graphs using PhantomJS. It will also take a screenshot of a rendered page. Here is what that looks like

What's interesting in this particular case is that Google is not utilizing web performance recommendations by using a HTTP redirect from google.com to www.google.com.

Anyways to get fantomTest go to

https://github.com/vvuksan/fantomtest

Monitoring links and monitoring anti-patterns video

June 5th, 2011

John Vincent aka. lusis has started an interesting conversation surrounding monitoring on Freenode on channel he named ##monitoringsucks. He has also done an awesome job of starting up a Github project of the same name that is shaping up to be a nice collection of links to tools and blog posts. Check it out

https://github.com/monitoringsucks/

Also we just got a hold of  the monitoring anti-patterns Ignite Talk from Devopsdays Boston by Alexis Lê-Quôc aka. @alq. It is a short video (5 minutes) so it's definitely worth seeing.

Use your trending data for alerting

April 19th, 2011

This post will deal with helping you use the data you already have to do alerting. It is most helpful for people running Nagios or it's variants such as Icinga, Netreo etc. It could likely be used with other decoupled alerting systems (not Zabbix or Zenoss though since they do their own trending).

Recently I came to a realization that lots of sysadmins are unaware that they could easily use trending data they already capture with systems such as Ganglia, Graphite, Collectd, Munin etc. to do alerting. Standard way of doing health checks of remote nodes in Nagios is to install the Nagios Remote Plugin Executor aka. NRPE which allows you to execute Nagios plugins on remote nodes and pipe output to the Nagios server. NRPE does the job however has three major disadvantages

  1. It is another daemon that needs to run on the remote host possibly introducing security concerns
  2. Depending on the load of the machine can be slow thus bogging down the Nagios server
  3. Last and most important is that commonly it's used to alert on common metrics such as disk, load, CPU, swap which you should be trending anyways.

Instead what you ought to be doing is use trending data for alerting. I can think of at least 4 reasons to do so

  1. You may already be collecting pertinent data ie. system load, swap, CPU utilization
  2. If you are alerting on a particular metric you should likely be trending it
  3. It's fast
  4. Allows you to do more sophisticated checks easily ie. alert me if more than 5 hosts have a load greater than 5 etc.

Years ago I used Ganglia Web PHP code to write my own generic Nagios Ganglia plugin. This has served me well. Most recently Michael Conigliaro rewrote the script in Python making it more versatile and more powerful. You can download it from here

https://github.com/ganglia/ganglia_contrib/tree/master/nagios

In a nutshell what it does is download the whole metrics tree ie. list of all hosts with their associated metrics. Caches it for a configurable amount of time then uses NagAconda to support all the threshold reporting as defined in Nagios developer guidelines.

Another alternative if you have a very large site is Ganglios which was opensourced by guys at Linden Lab. Their problem is/was that they have thousands of hosts and downloading the whole metrics tree takes ~15 seconds so they have separated the logic that downloads the metric tree and one that does alerting. You can download Ganglios from

https://bitbucket.org/maplebed/ganglios

This can easily be adapted to work with your trending system of choice.

JSON representation for graphs in Ganglia

February 20th, 2011

Recently thanks to work done by Alex Dean aka. @mostlyalex Ganglia UI supports defining custom graphs using JSON. Prior to this only way to create custom graphs was by writing custom PHP code. This has two major problems ie. lots of people are not comfortable writing or modifying PHP code and second you have to target a particular graphing engine e.g. rrdtool. As I have written in the past we are gonna be supporting both rrdtool and graphite for graphing so having a common way to describe graphs has been one of our goals.

To describe a custom graph you would create a JSON file similar to this one

{
 "report_name" : "network_report",
 "report_type" : "standard",
 "title" : "Network Report",
 "vertical_label" : "Bytes/sec",
 "series" : [
 { "metric": "bytes_in", "color": "33cc33", "label": "In", "line_width": "2", "type": "line" },
 { "metric": "bytes_out", "color": "5555cc", "label": "Out", "line_width": "2", "type": "line" }
 ]
}

This will create a line graph with bytes_in and bytes_out metrics. Since hostname and cluster are not specified it is assumed that we want metrics for the current host we are viewing. You could however specify a particular host and metric you want to graph by adding hostname and cluster attributes to series ie.

{
 "report_name" : "our_load_report",
 "report_type" : "standard",
 "title" : "Load Report vs. Database Load",
 "vertical_label" : "Loads",
 "series" : [
 { "metric": "load_one", "color": "3333bb", "label": "Load 1", "line_width": "2", "type": "line" },
 { "hostname": "db1.domain.com", "clustername": "Databases", "metric": "load_one", "color": "44ddbb", "label": "DB1 Load 1", "line_width": "2", "type": "line" },
 ]
}

To use the reports all you have to do is put the report in the $GANGLIA_WEB_ROOT/graph.d directory. Name them something_report.json and it will be available for any host in the cluster. There is one important thing to note. By default graphing function will look for PHP definitions for graphs as those in theory provide more power and flexibility and if those are not available use JSON definition.

Types of graphs

Currently both line and stacked graphs are supported. Look in graph.d/ directory for additional examples.

Future

I am particularly excited about this feature as it allows us to define aggregate graphs easily. There is even an alpha implementation of functionality which would allow you to specify a metric and a regex host entry and you would end up with an aggregate graph :-) .

Download location

Latest version of the UI can be downloaded either from Ganglia Monitor Web 2.0 SVN branch or you can get it on Github.

Misconceptions about RRD storage

December 14th, 2010

I want to address the misconceptions about RRD (Round-Robin Database) that seem to crop up often even among seasoned sysadmins. Complaints can be summarized with these two points

  • RRD doesn't offer high resolution ie. after about an hour it's all averages and I want to knows what was the metric value last year at this hour and minute
  • Data drops off/is destroyed after a year - I want to keep my data forever, disk is cheap etc.

Those are valid points however none of them are the fault of RRD. RRD is a circular buffer so in order to be able to write into it you have to precreate it (otherwise it wouldn't be a circular buffer :-) ). Obviously more data points you store bigger the RRD file will be. To illustrate the point Ganglia Monitoring uses following defaults to create RRDs

RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" "RRA:AVERAGE:0.5:5760:374"

This will create multiple circular buffers within the same RRD database file. In order to make sense out of this you need to know what the polling interval is ie. how often do you write into RRDs. In Ganglia's case the default is 15 seconds so

  • "RRA:AVERAGE:0.5:1:244" says write actual values (:1:) for every polling interval. Save last 244 of those so in our case we'll have 61 minutes worth of actual data points. Since it's a circular buffer data older than 61 minutes will be "dropped"
  • "RRA:AVERAGE:0.5:24:244" says average 24 values (:24:), 24 * 15 seconds = 360 seconds = 6 minutes. 244 of those times 6 is a whole day
  • You can do the next two :-)
  • Last one "RRA:AVERAGE:0.5:5760:374" says average whole day (5760 * 15 seconds = 1440 minutes = 1 day) worth of values and store it in 374 points ie. little more than a year

When graphing RRDtool is smart enough to use the buffer which gives you the most data points. To store all this data RRD file will use about 12kBytes. Thus if you want higher resolution you will need to change the definition e.g. you could do this

"RRA:AVERAGE:0.5:1:2137440"

which will give you one year worth of data points with no averaging with 15 second interval. Trouble is the size of this RRD file is 17 Mbytes. This may not seem as bad but one of the RRD drawbacks is that every time you add data to an RRD the whole file is written over so if you have 1000 metrics you can be potentially writing 17 GBs of data every 15 seconds. This may be a problem depending how many metrics you are keeping track of. There are alternatives which increase throughput such as storing RRDs in RAMdisk or using rrdcached. Alternatively you can opt to keep 2 weeks worth of data points with e.g.

"RRA:AVERAGE:0.5:1:81984"

which will result in size of about 650 kBytes per RRD file. Or you can do something else altogether. Flip side of RRD is that there are no indexes to maintain, no tables that need to be rotated.

Update: I was wrong about the whole RRD file needing to be updated. In retrospect it makes sense and I apologize for providing the wrong info. You can read comment from Tobi Oetiker (creator of rrdtool) in comments below for more detail. This is actually awesome news since there is very little downside in making larger RRDs.

As far as Ganglia you can modify the defaults in /etc/ganglia/gmetad.conf file. You can also use gmetad-python which allows you to write your own plugins and store metric data in both RRD format, SQL or any other storage engine of your choice.

More on RRDtool can be found here

http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html