Archive for the ‘Uncategorized’ Category

GangliaView – automatically rotate Ganglia metrics

Wednesday, June 16th, 2010

GangliaView is a simple web app that allows you to automatically rotate selected Ganglia metrics. We use it to rotate key metrics with large graphs showing last hour and last day and smaller graphs showing last week and last month. A sample screen looks like this

GangliaView is derived from CactiView with a number of changes to make it work with Ganglia and removal of frames. You can download it from here

Vonage the new Baby Bell

Thursday, May 13th, 2010

It is sometimes amazing to me how new upstarts morph into their own arch enemies. Case in point is Vonage. For years I used to have Vonage service at home as a backup phone service. I was on a 500 minute plan for $14.99+taxes. This was a great plan for me as I didn't use the phone much. However at some point they decided that was too little money and they hiked up the price to $16.99 (something like that). It may seem like a small difference but I figured I may be better of elsewhere. I ended up switching to Galaxy Voice which I am using to this day since they had more flexible calling plans.

We recently expanded our office space and we needed a phone line added to a conference room. Since I had my old Vonage adapter at home I figured I would bring it and we'd use it. I thought it would be as easy as going to Vonage's web site, supplying the phone adapter ID and my credit card number and I would be set. It wasn't so. After entering the phone ID I got this message

The MAC address you entered is associated with an existing Vonage account. Please call our Customer Care department at 1-866-293-5676 for immediate assistance.

I called the number and spoke to someone in Customer service. This took about 20 minutes while the person kept re-asking for the same data and concluded that they couldn't help me and that I would have to talk to tech support. Tech support guy was equally unhelpful. Basically I could not activate a device that was ever used before since the system "knew" about it. Talk about having a piece of useless technological trash. At that point I was sufficiently frustrated to end the call. I tweeted about my experience and a day later I was contacted by Vonage's Twitter team about having someone at customer service contact me. I thought I'd give it a go. I got a call and this experience was not a whole lot better than the previous ones. Person kept asking me for my personal information including name, billing address, what was the credit card number I used for paying bills and the e-mail address I used. Since this was more than a year ago and I have dozens of e-mail addresses I said I couldn't remember. At that point I ended the call since I was sufficiently frustrated. I was willing to give these people money yet they were making me jump through all this hoops. I don't get it.

It occurred to me later that this was very similar to experiences that I had with a local phone company when I would move and I would have to get through all these bureaucratic hoops to make sure all my features stayed the same after I moved.

Devops homebrew

Friday, April 9th, 2010

There has been quite a bit of discussion about Devops and what it means. @blueben has suggested we start a Devops patterns cookbook so people can learn what worked or didn't work. This is the description of the environment we implemented at a previous job. Some of these things may or may not work for you. I will try to keep it short.

Environment background

7 distinct applications/products that had to be deployed and tested ie. base/core application, messaging platform, reporting app etc. All applications were Java based running on either Tomcat or Jboss.

Application design for deployment

These are some of the key points

  1. Application should have a sane default configuration options. Any option should be overrideable by an external file. In most cases you only need to override database credentials (host, username, password). Goal is to be able to use the same binary across multiple environments.
  2. Application should expose key internal metrics. We for instance asked for a simple key/value pairs web page ie. JMSenqueue=OK etc. This is important because there are lots of things that can break inside the application which external monitoring may miss like JMS message can't be enqueued, etc.
  3. Keep release notes actions to a minimum. Release notes are often not followed or partially followed thus make sure point 1. is followed and/or try to automate everything else.

Continuous Integration

We used CruiseControl for Continuous Integration. It was used solely to make sure that someone didn't break the build.

Creating releases

Developers are in charge of building and packaging releases. This primarily because QA or Ops will not know what to do if a build fails (this is Java remember). Each release has to be clearly labeled with the version and tagged in the repository. For example Location 1.1.5 will be packaged as location-1.1.5.tar.gz. Archives should contain only WAR (Tomcat) or EAR (Jboss) files and DB patch files. Releases are to be deposited into an appropriate file share ie. /share/releases/location.


In order to eliminate most manual deployment steps and support all the different applications we decided to write our own deployment tool. First we started off with a data model which roughly broke down to

  1. Applications – can use different app server containers ie. Tomcat/JBoss, may/will have configuration files that can be either key/value pairs or templates. For every application we also specified a start and stop script (hotdeploy was not an option due to bad experiences with our code).
  2. Domains/Customers – we wanted a single Dashboard that would allow us to deploy to multiple environments e.g. QA staging (current release), QA development (next scheduled release), Dev playbox, etc. Each of these domains had their own set of applications they could deploy with their own configuration options

First we wrote a command line tool that was capable of doing something like this

$ deployer –version 1.2.5 –server web10 –domain joedev –app base –action deploy 

What this would do is

  1. Find and unpack the proper app server container e.g. jboss-4.2.3.tar.gz
  2. Overlay WAR/EAR files for the name version e.g. base-1.2.5.tar.gz
  3. Build configuration files and scripts
  4. Stop the server on the remote box (if it's running)
  5. Rsync the contents of the packaged release
  6. Make sure Apache AJP proxy is configured to proxy traffic and do Apache reload
  7. Start up the server

One of the main reason we started off with a command line tool is that we could easily write batch scripts to upgrade whole set of machines. This was borne out of pain of having to upgrade 200 instances via a web GUI at another job.

Once deployer was working we wrote a web GUI that interfaced with it. You could do things like View running config (what config options are actually on the appserver), Stop, Restart, Deploy (particular version), Reconfig (apply config changes) and Undeploy. We also added the ability to change or add configuration options to the application specific override files. Picture is worth thousand words. This is a tiny snippet how it approximately looked for one domain

This was a big win since QA or developers no longer needed to have someone from ops deploy software.

DB patching

Another big win was "automated" DB patching. Every application would have a table called Patch with a list of DB patches that were already applied. We also agreed that every app would have dbpatches directory in the app archive which would contain a list of patches named with version and order in which they should be applied e.g.

  • 2.54.01-addUserColumn.sql
  • 2.54.02-dropUidColumn.sql

During deployment startup script would compare contents of the patch table and a list of dbpatches and apply any missing ones. If the patch script failed e-mail would be sent to the QA or dev in charge of particular domain.

A slightly modified process was used in production to try to reduce down time ie. things like adding a column could be done at any time. Automated process was largely there to make QA's job easier.

QA and testing

When a release was ready QA would deploy the release themselves. If there was a deployment problem they would attempt to troubleshoot it themselves then contact the appropriate person. Most of the times it was an app problem ie. particular library didn't get commited etc. This was a huge win since we avoided a lots of "waterfall" problems by allowing QA to self-service themselves.


Production environment was strictly controlled. Only ops and couple key engineers had access to it. Reason was we tried to keep the environment as stable as possible. Thus ad hoc changes were frowned upon. If you needed to make a change you would either have to commit a change into the configuration management system (puppet) or use the deployment tool.

Production deployment

The day before the release QA would open up a ticket listing all the applications and versions that needed to be deployed. On the morning of the deployment (that was our low time) someone from ops, development and whole QA team engaged in deploying the app and resolving any observed issues.


Regular metrics such as CPU utilization, load etc. were collected. In addition we kept track of internal metrics and set up adequate alerts. This is an ongoing process since over time you discover what your key metrics are and what their thresholds are ie. number of threads, number of JDBC connections etc.

Things that didn't work so well or were challenging

  1. One of the toughest parts was getting developers' attention to add "goodies" for ops. Specifically exposing application internals was often put off until eventually we would have an outage and lack of having the metric resulted in extended outage.
  2. Deployment tool took couple tries to get right. Even as it was there were couple things I would have done differently ie. not relying on a relational database for the data model since it made it difficult to create diffs (you had to dump the whole DB). I'd likely go with JSON so that diffs could be easily reviewed and committed.
  3. Other issues I can't recall right now :-)


This is the shortest description I could write. There are a number of things I glossed over and omitted so that this is not too long. I may write about those on another occasion. Perhaps the key take away should be that Ops should focus on developing tools that either automate things or allow its customers (QA, dev, technical support, etc.) to self-service themselves.

Update: There is a second part to this posts

Devops religion wars

Tuesday, April 6th, 2010

I have been trying to stay out of the devops arguments but it seems that they are slowly devolving into religious wars. It seems that each group ie. devops and non-devops is convinced that they are in possession of "eternal self-evident truths" and that everyone else is unenlightened hater or similar.  Proof in point is following post

Brian describes their devops process which seems reasonable to me. What is most important is that it works for him, his group and his site.

Unfortunately comments devolve from there. A non-devops person raises a good point about the process however does it with poor style and insulting language. Response is to compare devops and non-devops approach with giving man a fish vs. teaching someone to fish. It goes from there. It's all just too silly. Firstly I am not aware of definite devops definition. Secondly every environment is different. What may work for you may not work everywhere else. I really doubt that continuous deployment would work if your web app was used in providing emergency medical care. That said things have changed and availability expectations have increased so cooperation between development and ops is critical. Therefore let's try to stop with the silly arguments and try to learn from each other. Most of all avoid insulting language. I realize we all get frustrated at times but it really devalues your view.

Password complexity madness

Friday, January 22nd, 2010

You know the pitch. Each time you create an account for a "secure" site you are forced to come up with a complex password ie. you need to have a number, a capitalized letter, perhaps a special character such as + or -. Trouble is policies differ so on one site password has to be a minimum length, maximum length, some don't allow special characters etc. The thing is at one point in time this made sense and was required to keep basic security but it may not make sense today.

Ages ago computer systems (in particular UNIX systems) used to store passwords in a hashed format (hash . You can read more on cryptographic hashes on Wikipedia. The trouble is that these hashes were available for any user to see ie. you could copy a password file (/etc/passwd) or use YP/NIS tools to get a list of all passwords in an organization. Once you have the password file you do not know what the passwords are however you can take a word dictionary start computing hashes since a particular password will always convert to the same hash and compare it if there are any matches in your password file. If you find a match you know have "discovered" users password. This is often referred to as off-line password cracking since it allows you derive passwords without interacting with the target system. This has many advantages since you can try millions of passwords quickly and the target system's administrator will not be alerted. Based on this fact password policies were instituted that mandated password complexity since passwords with complexity ie. 9pc_miu would be nearly impossible or very hard to break (it may take years to break it). This made sense then.

However it doesn't make much sense now since on most systems regular users have no access to the password hashes. On UNIX systems "shadow" (/etc/shadow) is used to hide them or you may be using LDAP which has the capability of hiding password hashes, etc. The only users that have access to those hashes are administrator however they have other ways of acquiring your passwords. Thus your real exposures in order of importance are

  • Trivial passwords or easily guessable password ie. 123456, 1234, date of birth
  • Using same password across different sites ie. this is a problem if e.g. site gets hacked and hackers are able to determine your password and log into site

I actually feel that password complexity breeds poor security since people will write down complex passwords instead of remembering them. Just remember how many times have you seen passwords on post-it notes on someone's monitor. Perhaps it is time to scrap the password complexity and use something simpler.

Keeping an eye on binary log growth

Thursday, October 1st, 2009

Recently I got a report that some pages on the site were extremely slow. Looking at the web server metrics didn't show anything new however mySQL DB metrics showed a definite change

MySQL server CPU utilization

MySQL server CPU utilization

ie. at the end of Week 38 there is an increase in CPU utilization. Nearly 60% increase. Interestingly enough there was a new software release at the end of Week 38 which pointed to either a bug or a new feature. Luckily I have been collecting mySQL metrics using this gmetric script. This led me to these two graphs



So nearly double number of inserts and nearly triple the updates. Using mysqlbinlog I analyzed the update and insert statements and was able to identify the two culprit INSERT and UPDATE statements then sent it off to developers.

I also observed that had I watched the binary log growth I may have identified this earlier since there were a lot more binary logs for the period since the release. Thus mysql average binary log growth rate gmetric was born :-). Now all I need to do is find out what normal growth rate is and if it goes outside of that norm use Nagios to send me a non-urgent alert.

Software doesn’t run itself

Sunday, September 13th, 2009

Perhaps I should no longer be surprised but I am by the article mentioned in this blog post

In particular this

Once it went bankrupt, the staff who supported these systems “evaporated”, according to Steven O’Hanlon, president of Numerix, a pricing and valuation company which is working with Lehman Brothers Holding Inc to unwind the derivatives portfolio.

These days computer systems are the blood of your company so allowing critical technical staff to simply "evaporate" is mind boggling. Granted company imploded but still I would think that someone should have figured out going into bankruptcy that they should set aside money to pay for their maintenance.

Ultimate problem as pointed out in the blog post on Naked Capitalism that documentation is usually skimped on since it "doesn't provide value". Although I would also add that when people say "code is documented" they don't usually mention their systems infrastructure is documented. That can sometimes be even bigger impediment. At a previous job there was a Perl CGI script that most people didn't know about and even fewer understood. If that script didn't work our whole load balancing infrastructure would "mysteriously" fail since app servers wouldn't register themselves to web servers and leading to a full blown outage. It was such an obscure "feature" that you could literally spend weeks chasing other avenues since this was so non-obvious.

Also I would not take comfort in having source code to an application. Lot of customers of startups will write in their contracts that if a startup goes bust they get access to the source code. That may sound nice but it doesn't mean you will necessarily be able to run it. There are so many "secret" recipes, undocumented workarounds that are often involved in running most complex pieces of software that you should really be cautious.

In closing if you care that your software runs make sure you keep at least couple folks who have run it around.


Simple “web service” for Ganglia metrics

Friday, September 11th, 2009

Here is a simple PHP script to allow you to get current Ganglia metrics. You will need Ganglia web installation. Drop this script somewhere. Then invoke it via e.g.


Where server is the name of the server for which you want metrics and metric_name is the exact name of the metric you are looking for e.g. load_one, disk_free etc. Only thing that is returned is either ERROR message or actual value.



include_once "$GANGLIA_WEB/conf.php";
include_once "$GANGLIA_WEB/get_context.php";
# Set up for cluster summary
$context = "cluster";
include_once "$GANGLIA_WEB/functions.php";
include_once "$GANGLIA_WEB/ganglia.php";
include_once "$GANGLIA_WEB/get_ganglia.php";

# Get a list of all hosts
$ganglia_hosts_array = array_keys($metrics);

$found = 0;

# Find a FQDN of a supplied server name.
for ( $i = 0 ; $i < sizeof($ganglia_hosts_array) ; $i++ ) {
 if ( strpos(  $ganglia_hosts_array[$i], $_GET['server'] ) !== false  ) {
 $fqdn = $ganglia_hosts_array[$i];
 $found = 1;

if ( $found == 1 ) {
 if ( isset($metrics[$fqdn][$_GET['metric_name']]['VAL']) ) {
 } else {
 echo("ERROR: Metric value not found");
} else {
 echo "ERROR: Host not found";


Nothing fancy. It contains rudimentary error checking so please be gentle :-). Feel free to extend it satisfy your needs. Also this is likely not scalable if you have hundreds of hosts and tons of requests.

Broken hostname resolution and PAM don’t mix

Wednesday, September 9th, 2009

I don't mean PAM the cooking spray but Pluggable Authentication modules. I was asked to change some DNS settings for a set of hosts ie. move them from one domain to another e.g. from them being in to be in At the end of the process head node all of the sudden started refusing logins with following error message

fatal: Access denied for user vvuksan by PAM account configuration

It took some hair pulling but after a while I concluded that the headnodes hostname was set to the old name e.g. which was no longer resolvable. As soon as hostname was changed ie.

% hostname

Things automagically started working again. Hope this prevents someone from going bald :-).

Cloud computing’s Achilles Heel

Tuesday, September 1st, 2009

I have touched upon this issue before however here are some illustrations of what I think is cloud computing's Achilles heel. It has to do with shared hardware and virtualization. In my case I have a Drupal site running in a Xen guest running on top of a Xen host. For whatever reason while being indexed by a Google bot Apache went "crazy" allocating tons and tons of memory and swapping like crazy. At this point the Xen guest is nearly unusable since the load is close to a 100.


Now let's look at what is happening to the underlying Xen host ie. one that runs the Xen guest


Yikes. If you had another instance on this particular Xen host you can bet that instance would be severly affected. The trouble is that you may not be really aware of it since you do not have access to the underlying hardware. You may be scratching your head why all of the sudden you are getting subpar performance. Also if you are a cloud provider how do you deal with situations like this ? Do you simply shut down machines that exceed certain performance thresholds ? What if this happens to be a production database server which is doing a database dump and should be "allowed" to thrash the disk ? What if you shut it down and you corrupt customers' database ? It gets real tricky real quick.

Also forget about oversubscription. You need one poorly behaving guest to ruin it for everyone else. Although more you oversubscribe more the risk of performance degradation.