My programming language beat your honor roll language

April 10th, 2012

For a while I have been observing a tendency of technologists/engineers to describe technology in either black or white terms ie. X technology sucks, use Y technology. Most recent example is an article by someone going by the name of Eevee  in his PHP a fractal of bad design. It is a damning expose of PHP's failings/bad decisions/inconsistencies etc. Unfortunately as most articles of this type it involves a number of ad-hominem attacks like these

It’s so broken, but so lauded by every empowered amateur who’s yet to learn anything else, as to be maddening. It has paltry few redeeming qualities and I would prefer to forget it exists at all.

or

I assert that the following qualities are important for making a language productive and useful, and PHP violates them with wild abandon. If you can’t agree that these are crucial, well, I can’t imagine how we’ll ever agree on much.

This irritates me on many levels since it makes so many misguided assumptions e.g.

- Everyone's mind is the same therefore everyone should like or hate language X

Of course not. Your mind is different than my "defective" mind. I quite prefer writing in PHP. I have written/write code in Ruby/Python/Perl and PHP is my preferred language. That may change but at this point it's my preference. You may disagree with my choice and that's OK.

- Issues we are trying to solve are similar/identical and we have same resource constraints aka one-size fits all

Of course not. If most coders on my team are well versed with PHP and we have a tight schedule you bet we are most likely to choose PHP. Technical merits are not the only consideration to be taken. They are certainly important but they often pale in comparison to other considerations such as people and culture.

- PHP core developers are incompetent

I have on a number of occasions disagreed and been frustrated with decisions made by PHP core developers however I do assume that in most/all cases they are well intentioned and are making the best decision under available circumstances. PHP has been around for a long time so I imagine making major changes is tough and involve making significant tradeoffs. If those tradeoffs become a show stopper for me I'll use a different language.

Anyways the most bothersome part of the whole post is the technology "tribalism" which results in things like this (from Kibana README file)

Q: Why is this in PHP instead of Java, Ruby, etc?
A: Because PHP is what I know. The total PHP is less than 200 lines. If you want it in something else, it shouldn't be too hard to port it to your language of choice

That makes me pretty sad.

Compute a 15 minute average of a metric easily with Ganglia

April 6th, 2012

This is a quick way to extract a 15 minute average of a metric in Ganglia. It utilizes Ganglia's CSV export function to get the values then uses awk to actually compute the average.

First of all find a metric graph you want to calculate average from. Right click over the image and copy the image location. Then append &csv=1 to the URL and UNIX time stamp from 15 minutes ago and put that as the &cs= argument. This is a simple shell script that illustrates it

MIN15AGO=`date --date="15 minutes ago" "+%s" ; 
curl --silent "http://ganglia.domain.com/ganglia/graph.php?c=NetApp&h=host1&v=&m=netapp_cpuutil&cs=$MIN15AGO&csv=1" | \
   awk -F, '{sum+=$2} END { print "Average = ",sum/NR}'


Adding context to your alerts

March 29th, 2012

I am a big believer in adding context to alerts. This allows the recipient of an alert to make a better decision on how to deal with an alert. It's often hard to classify alerts so providing as much context to the alert is extremely helpful. For instance if I am alerting on a value of a metric I like to attach an image of that metric for the past hour. This way if I am on my mobile phone and out and about I have the alerting metric graph right there without needing to open up another window or having to start up my laptop.

In more recent versions of Ganglia there is an option to add overlay events to hosts which show up as vertical lines on the graph. I figured that would be great context to add to alerts. Since I'm using Nagios I decided to extend a mail handler I used before to query Ganglia events database and include any events that were connected to the matching host in 24 hours. This helps in a number of  scenarios to keep team on the same page and well informed e.g.

  • There was a code push/config change however host/service was not scheduled for maintenance
  • Recent code push is causing issues ie. web servers are crashing

This is an example e-mail you get

As an added bonus mail handler sends all alerts to a Nagios Bot :-). Now all you need to make sure is to record events for any major changes. You could do a lot of these things automatically by e.g.

  • Adding hooks to your startup scripts so that when you purposely restart services it is logged
  • Watching logs then inserting proper events in the timeline. App stoppe
  • Querying external services e.g. Dynect provides an API to query zone changes

You can download the mail handler from here

https://github.com/vvuksan/misc-stuff/blob/master/nagios/send_nagios_email.php

 

 

Monitoring NetApp Fileservers with Ganglia

March 29th, 2012

In our environment we use NFS on Netapp fileservers a lot. They are used for home directories, build directories (don't ask), DB data directories etc. This is done mostly for reliability and data integrity. However it leads to a number of problems since they are shared by a number of different groups of users and are a "black box" for users. Frequently we'll get reports of machines or builds being unusually slow. This results in lots of confusion since we have observed in the past that in most cases of slowness machines involved are frequently "idle" where CPU utilization is unremarkable ie. < 10% yet CPU wait I/O is significantly elevated. We would then posit it was external e.g. NFS related. To avoid the guess work I have decided to start monitoring Netapp fileservers to get insight into what is going on. My team doesn't manage the Netapps however we use Ganglia. I found check_netappfiler  and using it as a template I built a script to gather metrics from Netapp and send them to Ganglia. You can download the script from here

https://github.com/ganglia/gmetric/tree/master/netapp

Basically it queries a list of Netapp servers and injects those metrics to Ganglia. So far metric gathering has been invaluable. For example on one occasion we got a report of slowness from a couple of users. I observed that CPU utilization on the Netapp that project was using was 100 percent. That may "explain" the slowness.

However that wasn't all either. Preceding the event there was a heavy NFS utilization.

However as soon as CPU utilization goes to 100% number of NFS ops plummets.

and so does the Network bytes sent.

and bytes read from the disk. It sure looks like a Netapp bug. Luckily since we have all these metrics available we could much quicker figure out what is going on and what further steps to conduct ie. contact the vendor.

RESTful way to manage your databases

January 2nd, 2012

I have a need in my development environment to easily create/drop mySQL databases and users. Initially I was gonna implement a simple hacky HTTP GET method but was dissuaded by Ben Black from doing so. He suggested I write a proper RESTful interface. Without further ado I present to you dbrestadmin

https://github.com/vvuksan/dbrestadmin

It is my first foray into writing RESTful services so things may be rough around the edges. However it allows you to do following

  • manage multiple database servers
  • create/drop databases
  • list databases
  • create/drop users
  • list users
  • give user grants
  • view grants given to the user
  • view database privileges on a particular database given to a user

For example need to create a database called testdb on dbserver ID=0 use this cURL command

curl -X POST http://myhost/dbrestadmin/v1/databases/0/dbs/testdb

Create a user test2 with password test

curl -X POST "http://localhost:8000/dbrestadmin/v1/databases/0/users/test2@localhost" -d "password=test"

Give test2 user all privileges on testdb

curl -X POST "http://localhost:8000/dbrestadmin/databases/0/users/test2@'localhost'/grants" -d "grants=all privileges&database=testdb"

There is more. You can see all of the methods here

https://github.com/vvuksan/dbrestadmin/blob/master/API.md

Improvements and constructive criticism welcome

Operating on Dell RAID arrays cheatsheet

November 23rd, 2011

I have to infrequently add new drives to Dell RAID arrays like H700. For some reason it takes me couple searches to find the info so here so I can find it later.

List all drives

/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

Create a RAID array (e.g. RAID 0)

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4, 32:5] -aALL

List working RAID arrays

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL

Confirm you got the right RAID array e.g. Virtual Disk 1

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L1 -aALL

Delete RAID array

/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -aALL

				

Use fantomTest to test web pages from multiple locations

September 27th, 2011

In my previous I introduced Testing your web pages with fantomtest. I have recently added ability to test the same page from multiple sites within the same interface. You simply install the copy of fantomTest on a remote site then configure your primary site to access it. For example this is a test of Google from my laptop.

Looks like my network connection is really slow :-(. Changing the testing site to Croatia where I have a server I get

Slightly different since Google redirects me to their localized Google site however it leads me to believe that it's my connection that is slow not Google.

Any number of  "remotes" can be added. Want it ? Get it @GitHub

https://github.com/vvuksan/fantomtest

Using Jenkins as a Cron Server

August 22nd, 2011

There are a number of problems with cron which cause lots of grief for system administrators with big ones being manageability, cron-spam and auditability. To fix some of these issues I have lately started using Jenkins. Jenkins is an open source Continuous Integration server it has lots of features that make it a great cron replacement for a number of uses. These are some of the problems it solves for me

Auditability

Jenkins can be configured to retain logs of all jobs that it has run. You can set it up to keep last 10 runs or you can set it up to keep only last 2 weeks of logs. This is incredibly useful since sometimes jobs can fail silently so it's useful to have the output instead of sending it to /dev/null.

Centralized management

I have my most important jobs centralized. I can export all Jenkins jobs as XML and check it into a repository. If I need to execute jobs on remote hosts I simply have Jenkins ssh and execute command remotely. Alternatively you can use Jenkins slaves.

Cron Spam

Cron spam is a common problem with solutions such as this, this and this. To avoid this condition I only have Jenkins alert me when a particular job fails ie. a job exits with return code other than 0.  In addition you can use the awesome Jenkins Text Finder plugin which allows you to specify words or regular expressions to look for in console output. They can be used to mark a "job" unstable. For example in text finder config I checked

X Also search the console output

and specified

Regular expression ([Ee]rror*).*

This has saved our bacon since we used the automysqlbackup.sh script which "swallows" up the errors codes from the mysqldump command and exits normally. Text Finder caught this

mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' bytes when dumping table `users` at row: 234

Happily we caught this one on time.

Job dependency

Often you will have job dependencies ie. main backup job where you first dump a database locally then upload it somewhere off-site or to the cloud. The way we have done this in the past is to leave a sufficiently large window between the first job and consecutive job to be sure first job has finished. This says nothing about what to do if the first job fails. Likely the second one will too. With Jenkins I no longer have to do that. I can simply tell Jenkins to trigger "backup to the cloud" once local DB backup concludes successfully.

Test immediately

While you are adding a job it's useful to test whether job runs properly. With cron you often had to wait until the job executed at e.g. 3 am in the morning to discover that PATH wasn't set properly or there was some other problem with your environment. With Jenkins I can click Build Now and job will run immediately.

Easy setup

Setting up jobs is easy. I have engineers set up their own job by copying an existing job and modifying it to do what they need to do. I don't remember last time someone asked me how to do it :-).

What I don't use Jenkins for

I don't use Jenkins to run jobs that collect metrics or anything that has to run too often.

 

Testing your web pages with fantomtest

August 2nd, 2011

Coming from web operations background my web site/page monitoring had largely focused at looking at metrics such as average request duration, 90th percentile request duration etc. These are all great metrics however through Velocity Conferences I have come to appreciate that there is a lot more to web performance than simply knowing how long it takes to load HTML in a web page. As a result I have been looking for ways to try to get better metrics by utilizing real browsers instead of Perl/Ruby/Python scripts. For some time I have been playing with Selenium RC to give me an easy way to test and time my web application. Unfortunately I found it heavy and slow. At last Velocity conference I was fortunate enough to see a demo of PhantomJS. PhantomJS is a semi-headless webkit browser with Javascript support. What I really appreciated about it is that it is light weight, fast and very easy to instrument using Javascript. In addition it includes a number of useful examples such as netsniff.js which output a HTTP Archive (HAR) of requests to a certain web page. From a HAR file you can builds among other things waterfall charts. There are a number of services you can use to have your site tested for free e.g. webpagetest.org. Limitation is that they can't test your intranet infrastructure since that is usually behind a firewall or it doesn't allow you to test remote sites that are connected to your intranet via a VPN.

That is why I'm introducing fantomTest. A simple web application that allows you to generate waterfall graphs using PhantomJS. It will also take a screenshot of a rendered page. Here is what that looks like

What's interesting in this particular case is that Google is not utilizing web performance recommendations by using a HTTP redirect from google.com to www.google.com.

Anyways to get fantomTest go to

https://github.com/vvuksan/fantomtest

Monitoring links and monitoring anti-patterns video

June 5th, 2011

John Vincent aka. lusis has started an interesting conversation surrounding monitoring on Freenode on channel he named ##monitoringsucks. He has also done an awesome job of starting up a Github project of the same name that is shaping up to be a nice collection of links to tools and blog posts. Check it out

https://github.com/monitoringsucks/

Also we just got a hold of  the monitoring anti-patterns Ignite Talk from Devopsdays Boston by Alexis Lê-Quôc aka. @alq. It is a short video (5 minutes) so it's definitely worth seeing.