Archive for the ‘Uncategorized’ Category

Vonage the new Baby Bell

Thursday, May 13th, 2010

It is sometimes amazing to me how new upstarts morph into their own arch enemies. Case in point is Vonage. For years I used to have Vonage service at home as a backup phone service. I was on a 500 minute plan for $14.99+taxes. This was a great plan for me as I didn't use the phone much. However at some point they decided that was too little money and they hiked up the price to $16.99 (something like that). It may seem like a small difference but I figured I may be better of elsewhere. I ended up switching to Galaxy Voice which I am using to this day since they had more flexible calling plans.

We recently expanded our office space and we needed a phone line added to a conference room. Since I had my old Vonage adapter at home I figured I would bring it and we'd use it. I thought it would be as easy as going to Vonage's web site, supplying the phone adapter ID and my credit card number and I would be set. It wasn't so. After entering the phone ID I got this message

The MAC address you entered is associated with an existing Vonage account. Please call our Customer Care department at 1-866-293-5676 for immediate assistance.

I called the number and spoke to someone in Customer service. This took about 20 minutes while the person kept re-asking for the same data and concluded that they couldn't help me and that I would have to talk to tech support. Tech support guy was equally unhelpful. Basically I could not activate a device that was ever used before since the system "knew" about it. Talk about having a piece of useless technological trash. At that point I was sufficiently frustrated to end the call. I tweeted about my experience and a day later I was contacted by Vonage's Twitter team about having someone at customer service contact me. I thought I'd give it a go. I got a call and this experience was not a whole lot better than the previous ones. Person kept asking me for my personal information including name, billing address, what was the credit card number I used for paying bills and the e-mail address I used. Since this was more than a year ago and I have dozens of e-mail addresses I said I couldn't remember. At that point I ended the call since I was sufficiently frustrated. I was willing to give these people money yet they were making me jump through all this hoops. I don't get it.

It occurred to me later that this was very similar to experiences that I had with a local phone company when I would move and I would have to get through all these bureaucratic hoops to make sure all my features stayed the same after I moved.

Devops homebrew

Friday, April 9th, 2010

There has been quite a bit of discussion about Devops and what it means. @blueben has suggested we start a Devops patterns cookbook so people can learn what worked or didn't work. This is the description of the environment we implemented at a previous job. Some of these things may or may not work for you. I will try to keep it short.

Environment background

7 distinct applications/products that had to be deployed and tested ie. base/core application, messaging platform, reporting app etc. All applications were Java based running on either Tomcat or Jboss.

Application design for deployment

These are some of the key points

  1. Application should have a sane default configuration options. Any option should be overrideable by an external file. In most cases you only need to override database credentials (host, username, password). Goal is to be able to use the same binary across multiple environments.
  2. Application should expose key internal metrics. We for instance asked for a simple key/value pairs web page ie. JMSenqueue=OK etc. This is important because there are lots of things that can break inside the application which external monitoring may miss like JMS message can't be enqueued, etc.
  3. Keep release notes actions to a minimum. Release notes are often not followed or partially followed thus make sure point 1. is followed and/or try to automate everything else.

Continuous Integration

We used CruiseControl for Continuous Integration. It was used solely to make sure that someone didn't break the build.

Creating releases

Developers are in charge of building and packaging releases. This primarily because QA or Ops will not know what to do if a build fails (this is Java remember). Each release has to be clearly labeled with the version and tagged in the repository. For example Location 1.1.5 will be packaged as location-1.1.5.tar.gz. Archives should contain only WAR (Tomcat) or EAR (Jboss) files and DB patch files. Releases are to be deposited into an appropriate file share ie. /share/releases/location.

Deployment

In order to eliminate most manual deployment steps and support all the different applications we decided to write our own deployment tool. First we started off with a data model which roughly broke down to

  1. Applications – can use different app server containers ie. Tomcat/JBoss, may/will have configuration files that can be either key/value pairs or templates. For every application we also specified a start and stop script (hotdeploy was not an option due to bad experiences with our code).
  2. Domains/Customers – we wanted a single Dashboard that would allow us to deploy to multiple environments e.g. QA staging (current release), QA development (next scheduled release), Dev playbox, etc. Each of these domains had their own set of applications they could deploy with their own configuration options

First we wrote a command line tool that was capable of doing something like this

$ deployer –version 1.2.5 –server web10 –domain joedev –app base –action deploy 

What this would do is

  1. Find and unpack the proper app server container e.g. jboss-4.2.3.tar.gz
  2. Overlay WAR/EAR files for the name version e.g. base-1.2.5.tar.gz
  3. Build configuration files and scripts
  4. Stop the server on the remote box (if it's running)
  5. Rsync the contents of the packaged release
  6. Make sure Apache AJP proxy is configured to proxy traffic and do Apache reload
  7. Start up the server

One of the main reason we started off with a command line tool is that we could easily write batch scripts to upgrade whole set of machines. This was borne out of pain of having to upgrade 200 instances via a web GUI at another job.

Once deployer was working we wrote a web GUI that interfaced with it. You could do things like View running config (what config options are actually on the appserver), Stop, Restart, Deploy (particular version), Reconfig (apply config changes) and Undeploy. We also added the ability to change or add configuration options to the application specific override files. Picture is worth thousand words. This is a tiny snippet how it approximately looked for one domain

This was a big win since QA or developers no longer needed to have someone from ops deploy software.

DB patching

Another big win was "automated" DB patching. Every application would have a table called Patch with a list of DB patches that were already applied. We also agreed that every app would have dbpatches directory in the app archive which would contain a list of patches named with version and order in which they should be applied e.g.

  • 2.54.01-addUserColumn.sql
  • 2.54.02-dropUidColumn.sql

During deployment startup script would compare contents of the patch table and a list of dbpatches and apply any missing ones. If the patch script failed e-mail would be sent to the QA or dev in charge of particular domain.

A slightly modified process was used in production to try to reduce down time ie. things like adding a column could be done at any time. Automated process was largely there to make QA's job easier.

QA and testing

When a release was ready QA would deploy the release themselves. If there was a deployment problem they would attempt to troubleshoot it themselves then contact the appropriate person. Most of the times it was an app problem ie. particular library didn't get commited etc. This was a huge win since we avoided a lots of "waterfall" problems by allowing QA to self-service themselves.

Production

Production environment was strictly controlled. Only ops and couple key engineers had access to it. Reason was we tried to keep the environment as stable as possible. Thus ad hoc changes were frowned upon. If you needed to make a change you would either have to commit a change into the configuration management system (puppet) or use the deployment tool.

Production deployment

The day before the release QA would open up a ticket listing all the applications and versions that needed to be deployed. On the morning of the deployment (that was our low time) someone from ops, development and whole QA team engaged in deploying the app and resolving any observed issues.

Monitoring

Regular metrics such as CPU utilization, load etc. were collected. In addition we kept track of internal metrics and set up adequate alerts. This is an ongoing process since over time you discover what your key metrics are and what their thresholds are ie. number of threads, number of JDBC connections etc.

Things that didn't work so well or were challenging

  1. One of the toughest parts was getting developers' attention to add "goodies" for ops. Specifically exposing application internals was often put off until eventually we would have an outage and lack of having the metric resulted in extended outage.
  2. Deployment tool took couple tries to get right. Even as it was there were couple things I would have done differently ie. not relying on a relational database for the data model since it made it difficult to create diffs (you had to dump the whole DB). I'd likely go with JSON so that diffs could be easily reviewed and committed.
  3. Other issues I can't recall right now :-)

Wrapup

This is the shortest description I could write. There are a number of things I glossed over and omitted so that this is not too long. I may write about those on another occasion. Perhaps the key take away should be that Ops should focus on developing tools that either automate things or allow its customers (QA, dev, technical support, etc.) to self-service themselves.

Update: There is a second part to this posts

Devops religion wars

Tuesday, April 6th, 2010

I have been trying to stay out of the devops arguments but it seems that they are slowly devolving into religious wars. It seems that each group ie. devops and non-devops is convinced that they are in possession of "eternal self-evident truths" and that everyone else is unenlightened hater or similar.  Proof in point is following post

http://brian.moonspot.net/devops-dealnews

Brian describes their devops process which seems reasonable to me. What is most important is that it works for him, his group and his site.

Unfortunately comments devolve from there. A non-devops person raises a good point about the process however does it with poor style and insulting language. Response is to compare devops and non-devops approach with giving man a fish vs. teaching someone to fish. It goes from there. It's all just too silly. Firstly I am not aware of definite devops definition. Secondly every environment is different. What may work for you may not work everywhere else. I really doubt that continuous deployment would work if your web app was used in providing emergency medical care. That said things have changed and availability expectations have increased so cooperation between development and ops is critical. Therefore let's try to stop with the silly arguments and try to learn from each other. Most of all avoid insulting language. I realize we all get frustrated at times but it really devalues your view.

Password complexity madness

Friday, January 22nd, 2010

You know the pitch. Each time you create an account for a "secure" site you are forced to come up with a complex password ie. you need to have a number, a capitalized letter, perhaps a special character such as + or -. Trouble is policies differ so on one site password has to be a minimum length, maximum length, some don't allow special characters etc. The thing is at one point in time this made sense and was required to keep basic security but it may not make sense today.

Ages ago computer systems (in particular UNIX systems) used to store passwords in a hashed format (hash . You can read more on cryptographic hashes on Wikipedia. The trouble is that these hashes were available for any user to see ie. you could copy a password file (/etc/passwd) or use YP/NIS tools to get a list of all passwords in an organization. Once you have the password file you do not know what the passwords are however you can take a word dictionary start computing hashes since a particular password will always convert to the same hash and compare it if there are any matches in your password file. If you find a match you know have "discovered" users password. This is often referred to as off-line password cracking since it allows you derive passwords without interacting with the target system. This has many advantages since you can try millions of passwords quickly and the target system's administrator will not be alerted. Based on this fact password policies were instituted that mandated password complexity since passwords with complexity ie. 9pc_miu would be nearly impossible or very hard to break (it may take years to break it). This made sense then.

However it doesn't make much sense now since on most systems regular users have no access to the password hashes. On UNIX systems "shadow" (/etc/shadow) is used to hide them or you may be using LDAP which has the capability of hiding password hashes, etc. The only users that have access to those hashes are administrator however they have other ways of acquiring your passwords. Thus your real exposures in order of importance are

  • Trivial passwords or easily guessable password ie. 123456, 1234, date of birth
  • Using same password across different sites ie. this is a problem if e.g. site A.com gets hacked and hackers are able to determine your password and log into site B.com

I actually feel that password complexity breeds poor security since people will write down complex passwords instead of remembering them. Just remember how many times have you seen passwords on post-it notes on someone's monitor. Perhaps it is time to scrap the password complexity and use something simpler.

Keeping an eye on binary log growth

Thursday, October 1st, 2009

Recently I got a report that some pages on the site were extremely slow. Looking at the web server metrics didn't show anything new however mySQL DB metrics showed a definite change

MySQL server CPU utilization

MySQL server CPU utilization

ie. at the end of Week 38 there is an increase in CPU utilization. Nearly 60% increase. Interestingly enough there was a new software release at the end of Week 38 which pointed to either a bug or a new feature. Luckily I have been collecting mySQL metrics using this gmetric script. This led me to these two graphs

mysqlupdate

mysqlinsert

So nearly double number of inserts and nearly triple the updates. Using mysqlbinlog I analyzed the update and insert statements and was able to identify the two culprit INSERT and UPDATE statements then sent it off to developers.

I also observed that had I watched the binary log growth I may have identified this earlier since there were a lot more binary logs for the period since the release. Thus mysql average binary log growth rate gmetric was born :-). Now all I need to do is find out what normal growth rate is and if it goes outside of that norm use Nagios to send me a non-urgent alert.

Software doesn’t run itself

Sunday, September 13th, 2009

Perhaps I should no longer be surprised but I am by the article mentioned in this blog post

http://www.nakedcapitalism.com/2009/09/another-lehman-mess-no-one-can-run-the-software.html

In particular this

Once it went bankrupt, the staff who supported these systems “evaporated”, according to Steven O’Hanlon, president of Numerix, a pricing and valuation company which is working with Lehman Brothers Holding Inc to unwind the derivatives portfolio.

These days computer systems are the blood of your company so allowing critical technical staff to simply "evaporate" is mind boggling. Granted company imploded but still I would think that someone should have figured out going into bankruptcy that they should set aside money to pay for their maintenance.

Ultimate problem as pointed out in the blog post on Naked Capitalism that documentation is usually skimped on since it "doesn't provide value". Although I would also add that when people say "code is documented" they don't usually mention their systems infrastructure is documented. That can sometimes be even bigger impediment. At a previous job there was a Perl CGI script that most people didn't know about and even fewer understood. If that script didn't work our whole load balancing infrastructure would "mysteriously" fail since app servers wouldn't register themselves to web servers and leading to a full blown outage. It was such an obscure "feature" that you could literally spend weeks chasing other avenues since this was so non-obvious.

Also I would not take comfort in having source code to an application. Lot of customers of startups will write in their contracts that if a startup goes bust they get access to the source code. That may sound nice but it doesn't mean you will necessarily be able to run it. There are so many "secret" recipes, undocumented workarounds that are often involved in running most complex pieces of software that you should really be cautious.

In closing if you care that your software runs make sure you keep at least couple folks who have run it around.

http://www.nakedcapitalism.com

/2009/09/another-lehman-mess-no-one-can-run-the-software.html

Simple “web service” for Ganglia metrics

Friday, September 11th, 2009

Here is a simple PHP script to allow you to get current Ganglia metrics. You will need Ganglia web installation. Drop this script somewhere. Then invoke it via e.g.

http://mygangliaserver/ganglia/metric.php?server=web1&metric_name=load_one

Where server is the name of the server for which you want metrics and metric_name is the exact name of the metric you are looking for e.g. load_one, disk_free etc. Only thing that is returned is either ERROR message or actual value.

<?php

$GANGLIA_WEB="/var/www/html/ganglia";

include_once "$GANGLIA_WEB/conf.php";
include_once "$GANGLIA_WEB/get_context.php";
# Set up for cluster summary
$context = "cluster";
include_once "$GANGLIA_WEB/functions.php";
include_once "$GANGLIA_WEB/ganglia.php";
include_once "$GANGLIA_WEB/get_ganglia.php";

# Get a list of all hosts
$ganglia_hosts_array = array_keys($metrics);

$found = 0;

# Find a FQDN of a supplied server name.
for ( $i = 0 ; $i < sizeof($ganglia_hosts_array) ; $i++ ) {
 if ( strpos(  $ganglia_hosts_array[$i], $_GET['server'] ) !== false  ) {
 $fqdn = $ganglia_hosts_array[$i];
 $found = 1;
 break;
 }
}

if ( $found == 1 ) {
 if ( isset($metrics[$fqdn][$_GET['metric_name']]['VAL']) ) {
 echo($metrics[$fqdn][$_GET['metric_name']]['VAL']);
 } else {
 echo("ERROR: Metric value not found");
 }
} else {
 echo "ERROR: Host not found";
}

?>

Nothing fancy. It contains rudimentary error checking so please be gentle :-). Feel free to extend it satisfy your needs. Also this is likely not scalable if you have hundreds of hosts and tons of requests.

Broken hostname resolution and PAM don’t mix

Wednesday, September 9th, 2009

I don't mean PAM the cooking spray but Pluggable Authentication modules. I was asked to change some DNS settings for a set of hosts ie. move them from one domain to another e.g. from them being in domain.com to be in domain.net. At the end of the process head node all of the sudden started refusing logins with following error message

fatal: Access denied for user vvuksan by PAM account configuration

It took some hair pulling but after a while I concluded that the headnodes hostname was set to the old name e.g. server5.domain.com which was no longer resolvable. As soon as hostname was changed ie.

% hostname server5.domain.net

Things automagically started working again. Hope this prevents someone from going bald :-).

Cloud computing’s Achilles Heel

Tuesday, September 1st, 2009

I have touched upon this issue before however here are some illustrations of what I think is cloud computing's Achilles heel. It has to do with shared hardware and virtualization. In my case I have a Drupal site running in a Xen guest running on top of a Xen host. For whatever reason while being indexed by a Google bot Apache went "crazy" allocating tons and tons of memory and swapping like crazy. At this point the Xen guest is nearly unusable since the load is close to a 100.

xen-guest

Now let's look at what is happening to the underlying Xen host ie. one that runs the Xen guest

xen-host

Yikes. If you had another instance on this particular Xen host you can bet that instance would be severly affected. The trouble is that you may not be really aware of it since you do not have access to the underlying hardware. You may be scratching your head why all of the sudden you are getting subpar performance. Also if you are a cloud provider how do you deal with situations like this ? Do you simply shut down machines that exceed certain performance thresholds ? What if this happens to be a production database server which is doing a database dump and should be "allowed" to thrash the disk ? What if you shut it down and you corrupt customers' database ? It gets real tricky real quick.

Also forget about oversubscription. You need one poorly behaving guest to ruin it for everyone else. Although more you oversubscribe more the risk of performance degradation.

Trouble with cloud computing

Sunday, August 23rd, 2009

While we are on the subject of cloud computing the real problem with it is that rightfully or not it has been portrayed as the computing infrastructure "savior". Just check out the description on Wikipedia (and I am not really picking on Wikipedia just using is as a representative quote)

Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet.[1][2] Users need not have knowledge of, expertise in, or control over the technology infrastructure in the "cloud" that supports them.[3]

Now that you have heard that kind of a "pitch" it is hardly surprising to hear how great of an idea would it be to move everything off to the cloud thus avoid any capital expenses on equipment, avoid having to maintain the hardware, "unlimited" scalability etc. etc.  Trouble is that that is only a small piece of the whole infrastructure puzzle. In reality the only thing that clouds allow me to do is easily "create" and "destroy" hardware (I guess they abstract hardware). That is certainly a nice feature and no doubt has some value. However clouds don't "automatically" scale (they need some type of middleware to do that), they don't automatically configure themselves (configuration/manual management does that), nor they automatically alert or monitor stuff in YOUR application. You have to do that. Lots of it.

This actually reminds me of the "managed hosting" pitch a lot of the colocation providers will try to sell you on. In some cases they would scoff at the fact that you just wanted plain Jane colocation ie. a cabinet, some amount of power, dual network drops etc. No, they wanted to sell you on stuff like managing your OS/updates, doing back ups etc. so that you can spend your time on more "productive" tasks. That is all nice however that provides very little to no value to me. I can install an OS and all the updates through a mixture of PXEboot, console access and configuration management in less than 10 minutes and I know exactly what needs to be backed up. Do I really need /usr/ backed up on every web machine ? No.

In closing while evaluating cloud computing make sure you really look at what is the problem you are really trying to solve. Clouds do not in themselves provide you with magical ponies. You still have to do most of the work.