Quantcast

Integrating Graphite with Ganglia

September 29th, 2010

Some time ago I saw a demo on using Graphite (http://graphite.wikidot.com/). I was impressed by the ease of creating custom graphs and the quality/visual appeal of the graphs. Trouble was that Graphite uses it's own storage engine instead of RRD and I figured it may be too much work to figure out how to inject my existing Ganglia metrics.

Couple days ago I saw a tweet from Mike Brittain at Etsy on how Graphite is becoming one of his favorite graphing tools. I know that they use Ganglia at Etsy so I asked if/how they use integration between Graphite and Ganglia. He pointed me in the direction of Erik Kastner who has done Ganglia Graphite integration. I asked him if he could post the patches and he was gracious to do so. In a nutshell he uses RRD files directly and rsyncs them every few minutes. While trying to install Graphite I realized that injecting metrics into Graphite is really simple. For example graphite-web contains a simple client example that injects system load. All it does is connects to port 2003 of the graphite installation and sends a following payload

system.loadavg_1min 0.08 1285763852
system.loadavg_5min 0.02 1285763852
system.loadavg_15min 0.01 1285763852

That's simple :-) ie. some type of a metric name, value and what looks like current UNIX timestamp. I then remembered that Kostas Georgiou showed me a ruby script that connects to gmond, retrieves the XML for the host, parses it and adds to Facter. Unfortunately that didn't seem to have much value until now :-) . What I did  is change Kostas' script to send metrics to Graphite instead of adding them to facter. You can find the result at Ganglia Add-Ons GitHub repository. You can run the script either from cron or as a daemon.

There are two ways to do this. I have tested only the first way. I am not sure if the graphite receiver would freak out if it gets too many metrics in a payload. Let me know if you know :-) .

1. Run this script on every host that runs gmond. This may be somewhat tricky since I usually set up gmond to only send metrics and turn off receiving by setting deaf = yes. For this approach to work you have to turn on receiving. To make it more secure we'll just listen on loopback. In global make sure you have these settings

  mute = no
  deaf = no
In the rest of the section make sure you add/have
udp_send_channel {
  host = 127.0.0.1
  port = 8649
  ttl = 1
}
udp_recv_channel {
 bind = 127.0.0.1
 port = 8649
}
tcp_accept_channel {
   bind = 127.0.0.1
   port = 8649
}

2. Run this on the main gmond collector daemon. Main gmond collector daemon will have metrics from all hosts. Trouble is that I haven't tested injecting thousands of metrics in a single payload. I'm sure there is a way around it and perhaps someone can post a patch :-D .

Future Improvements

I can think of couple possible improvements

  1. There is a rewrite of gmetad written in Python. It supports plugins. I don't think it would be a stretch to add a plug-in where gmetad sends data to Graphite when it updates the RRDs
  2. Currently metrics are sent as <hostname>.<metric_name>. It may make sense to send them into the appropriate part of the tree ie. <type_of_metric>.<hostname>.<metric_name> e.g. database.web1.mysql_selects
  3. Better integrate Ganglia Web UI and Graphite. Graphite supports flexible URL parameters so this should be doable.

And obligatory screenshots. This is the stacked graph I created in 20 seconds :-)

Graphite view of Ganglia Metrics

EC2 micro instances cost analysis

September 9th, 2010

Amazon today announced addition of EC2 micro instances which is their smallest instance size coming with 613 MB RAM and priced at $0.02/hour. You can read more about the announcement here

http://aws.typepad.com/aws/2010/09/new-amazon-ec2-micro-instances.html

There is a wrinkle though. There is no local (ephemeral) storage so you need to use EBS backed volumes. EBS is charged at $0.10/GB per month along with the charge of $0.10/1 million I/O requests to the volume. That is actually a reasonably good idea since it likely cuts down on I/O subsystem abuse since if you start abusing I/O it will cost you. That said I thought I would run a quick cost analysis to determine how much would it cost to actually run an instance. I have a personal server I use for handling my family's e-mail, blog and personal web sites. It gets little traffic. I use roughly 30 GB of storage. To find out the number of I/O ops I ran following command

> cat /proc/diskstats | egrep "sd[a-b] " | awk '{print $4" "$8}'
154756 3576927
773387 1844813

This lists number of both read and write ops for both drives in my machine. It adds to about 6 mil iops. Machine was last rebooted 7 days ago making this 25 mil iops per month. On the outbound network traffic side I have consumed 5 GB of traffic so far so => 20 GB per month (charged at $0.15/GB).

Thus the cost breaks down like this per month

Instance cost (30*24*$0.02) = $14.40

EBS storage charge ( 30 * $0.10) = $3.00

EBS I/O ops charge ( 25 * $0.10)  = $2.50

Outbound network traffic ( 20 * $0.15 ) = $3.00

Total: $22.90

Not too bad. A word of warning though. Since these micro instances come with only 613 MB of RAM if you load even a handful of services such as a mySQL database, web or app server you may end up swapping causing your EBS I/O ops charges to go up. I doubt these would be enormous however depending on the level of swapping they could be 25, 50% or 100% higher than what you planned for. Obviously EBS has some nice features such as persistency, snapshotting and ability to boot instances automatically after a failure however it may come with unanticipated cost.

Update: Some have pointed out that instance costs can be even lower if you reserve (assuming 1-yr commit micro instances are $115/year vs. $172 non-reserved). That is true however as I point out the biggest X factor in the whole equation is EBS charges. It's nothing that will break the bank however I prefer having idea upfront what the cost is. If your use case is a DNS server, mail server, Nagios checker than this fits the bill well however if you plan to use a ticketing system, wiki that uses a DB backend you will likely exceed memory footprint and start swapping.

Install Openstack Nova easily using Chef and Nova-Solo

September 1st, 2010

Inspired by Cloudscaling's Swift-Solo and being excited about being able to create my own cloud I am announcing the Nova-Solo project. Openstack Nova is the Compute portion of the project trying to build open source stack to run Amazon EC2 type service. Nova-Solo is a set of Opscode Chef recipes that allow you to quickly get most parts of the Nova stack up and running. You can fetch it from Github at

http://github.com/vvuksan/nova-solo

At this time Nova-Solo is targeted for Ubuntu 10.04 and it relies on Soren Hansen's package repository to install all of the necessary packages. Following Nova services are installed

  • Cloud controller
  • Object store
  • Volume store
  • API server
  • Compute Server

Soren's package archive is a bit outdated so some of the things don't work. For example you can create users, generate credentials, upload files into buckets but you can't register the image. Soren has said he is in the process of building new packages and I am also in the process of doing the same so hopefully things improve quickly. Nova code is definitely alphaish so beware. To get started use git to clone the nova-solo repository and off you go

git clone git://github.com/vvuksan/nova-solo.git

In the future as things stabilize we'll be making adjustments to support multiple compute servers (pieces for it are already in Nova-Solo), support other distributions like RHEL/Centos, etc.

Slides from the Boston DevOps meetup

August 25th, 2010

Here are slides from the August 3rd, 2010 Boston DevOps meetup where Jeff Buchbinder and I spoke about deployment and other helpful hints

http://www.scribd.com/doc/35757228/Deploying-Yourself-Into-Happiness

Slides have been slightly modified based on the feedback we received at the meetup. If you have any questions please post them in comments and I'll attempt to answer them.

Tunnel all your traffic on “hostile” networks with OpenVPN

August 20th, 2010

I am often on wireless networks that are unsecured ie. either don't use encryption or if they are I may not trust they will not tamper with my data (you never know). To protect my traffic on such networks I decided to tunnel nearly all my traffic through an OpenVPN server while I'm on such networks. I will show you how you can do it yourself on your Linux or Mac laptops. You should be able to do similar in Windows but it may be a bit more work on the client.

OpenVPN server setup

Set up OpenVPN on a network you trust e.g. home, work, cloud etc. You can either use Community Edition of OpenVPN which is free http://openvpn.net/index.php/open-source/downloads.html or you may want to pay OpenVPN money for their OpenVPN appliance package. I prefer using pfSense which is customized FreeBSD distribution geared for firewalls/routers with superb Web GUI. If you are gonna use the Community Edition follow the Quickstart guide.
One last step is to make sure that VPN network ie. 10.8.0.0/16 is NATed e.g. on a Linux OpenVPN server you could do

iptables -t nat -A POSTROUTING -s 10.8.0.0/16 -o eth0 -j MASQUERADE

OpenVPN client setup

Configure OpenVPN client to connect to your OpenVPN server. You can find the client HOWTO here.
Make sure you can access your home/work network. This will in general provide you with "split-tunnel" access ie. only traffic intended for your home/work network will be tunneled through VPN and everything else will go the normal "insecure" way.

Tunnel all traffic

Update: Shame on me. Someone has already posted the directions on how to do this at

http://manoftoday.wordpress.com/2006/12/03/openvpn-20-howto/

Thanks to @somic for pointing this out.

Tricky part in all this is that OpenVPN uses a simple TUN/TAP interface through which tunnels all the traffic. Temptation is to simply add an entry in the OpenVPN file that sets a default route through OpenVPN. This will likely fail as you will now have competing default routes. Instead what you need to do is add a route to your VPN server that uses the wireless networks default gateway and make your VPN interface the default route. This way all the traffic goes into the VPN interface and OpenVPN takes care of tunneling it through.
For this you will need to configure an external script that fires off once VPN tunnel is up. To enable post-up script put following two lines in your ovpn file
script-security 3 system
up /usr/local/bin/set_up_routes.sh

Your set_up_routes.sh would look something like this. Please change the VPN_SERVER_IP variable to the IP of your OpenVPN server.

#!/bin/sh
# Note the wireless network default gateway
DEFAULT_GATEWAY=`netstat -nr | grep ^0.0.0.0 | awk '{ print $2 }'`
# Find out what's the IP on the
VPN_GATEWAY=`netstat -nr | grep tun | grep -v 0.0.0.0 | awk '{ print $2 }' | sort | uniq`
VPN_GATEWAY=`ifconfig  | grep 172.16 | cut -f3 -d:  | cut -f1 -d" "`
VPN_SERVER_IP="1.2.3.4"
sudo /sbin/route del default
#
sudo /sbin/route add -host $VPN_SERVER_IP gw $DEFAULT_GATEWAY
# Don't tunnel traffic to 2.3.4.5 since it's already SSLized
sudo /sbin/route add -host 2.3.4.5 gw $DEFAULT_GATEWAY
sudo /sbin/route add default gw $VPN_GATEWAY

This script was tested under Ubuntu Linux but should work the same under Mac OS X. On Windows you may need to use PowerShell or use Cygwin.

Tunneling traffic for specific IPs

If you only wish to tunnel traffic for particular set of IPs you only need to add those routes to your ovpn file e.g.

route 72.0.0.0 255.0.0.0
route 75.0.0.0 255.0.0.0
You do NOT need to go through the excercise of setting up a script.
If you are looking for other OpenVPN guides Sam Johnston has a OpenVPN guide on howto set up OpenVPN in a VPS

http://samj.net/2010/01/howto-set-up-openvpn-in-vps.html

Skipping MySQL replication errors

August 19th, 2010

I was talking to my buddy Jeff Buchbinder and he mentioned that he recently added following to mySQL in order to reduce mySQL replication breakages

slave-skip-errors=1062,1053,1146,1051,1050

What this does is not stop replication in case following errors are encountered

Error: 1050 SQLSTATE: 42S01 (ER_TABLE_EXISTS_ERROR)

Message: Table '%s' already exists

Error: 1051 SQLSTATE: 42S02 (ER_BAD_TABLE_ERROR)

Message: Unknown table '%s'

Error: 1053 SQLSTATE: 08S01 (ER_SERVER_SHUTDOWN)

Message: Server shutdown in progress

Error: 1062 SQLSTATE: 23000 (ER_DUP_ENTRY)

Message: Duplicate entry '%s' for key %d

Error: 1146 SQLSTATE: 42S02 (ER_NO_SUCH_TABLE)

Message: Table '%s.%s' doesn't exist

This will avoid the very common primary key collisions and "temporary tables aren't there" problems. Writing this down for posterity. Use with caution.

Marius Ducea has a post about it as well

http://www.ducea.com/2008/02/13/mysql-skip-duplicate-replication-errors/

PHP 5.3 name spaces separator

August 16th, 2010

I am posting this to help others that may encounter a similar problem.

I have been doing some PHP development recently using Predis, a PHP Redis library. While instantiating the Redis\Client object I get

Warning: Unexpected character in input:  '\' (ASCII=92) state=1 in .....

Problem was explained in this issue

http://github.com/nrk/predis/issues/closed#issue/11

If you are still running on PHP 5.2 you should use the backported version of Predis and not the mainline library which targets only PHP >= 5.3 (the backslash is the namespace separator in PHP 5.3).

More discussion of this change can be found here.

http://giorgiosironi.blogspot.com/2009/09/introspection-of-php-namespaces.html

Deployment rollback

August 12th, 2010

This is a question that often comes up in deployment discussion. How do you rollback in case of a "bad" deploy ? Bad deploy can be any of the following

  • Site completely broken
  • Significant performance degradation
  • Key feature(s) broken

There are obviously a number of ways to deal with this issue. You could put up a notice on the site that x and y feature is broken while you work to fix it. Same with performance degradation. Let's however deal with rollback ie. you decided (determined by a number of different factors) that the stuff you just deployed is broken and you should roll back to a previous last know version. In such a case you would

  • Undo any configuration changes you may have applied (often none)
  • Deploy last known good version that worked. This is one of the reasons why I prefer using labelled binary packages. I simply instruct the deployment tool to install version 1.5.2 which was last good version and off we go.

The only caveat are database changes. In general you can't easily undo DB changes especially in the situations where you discover a deployment problem couple hours after deployment has taken place since by then users may have added new posts, changed their profiles etc. It would be a major effort to undo all DB changes, evaluate newly added data and whether it needs to be changed. That said DB changes are usually not a problem if you follow these easy steps

  1. Don't do any column drops immediately after the release. You can do those in QA but in production those can wait. In most cases they only take up space. I have heard of places that would first zero out then drop "unused" columns once a quarter or so.
  2. Related to 1. never ever use SELECT * since if you drop or add a column your code may break during roll back
  3. If there are data changes you have to do ie. update carrier set name="AT&T" where name="Cingular", have the reverse SQL statement ready as the insurance policy. Those are quite easy to implement.
  4. You don't have to worry about added tables since older version will not use them.
  5. You don't have to worry about added columns provided you don't do 2. and have not placed constraints ie. NOT NULL. In that case you may need to adjust those or drop them during rollback.

The wildcard in all this is added or removed constraints ie. new foreign keys. There is no single solution for this one. Perhaps the right policy is to discuss constraints prior to deployment and have a plan ready on what to do. Good luck.

Bootstraping your cloud environment with puppet and mcollective

July 28th, 2010

This is a "recipe" on how to bootstrap your whole environment in case of a disaster ie. your data center goes dark or if you are migrating from one environment to another. This guide differs from others in that it uses mcollective and DNS to provide you with greater flexibility in deploying and bootstraping environments. Some of the alternate ways are ec2-boot-init by R.I. Pienaar or Grig Gheorghiu's Bootstrapping EC2 images as Puppet clients.

Intro

You will need two disk images, your code repository and your DB backup and you can rebuild your whole environment from scratch in a relatively short period of time. This could be adapted to generic cloud provisioning however use case I'm trying to address is disaster recovery. We are using DNS so that we can keep hostnames consistent between environments ie. mail01 will be a mail server in all environments instead of domU-1-2-3-4 in one, rack-2345 in other etc.

Set up a master node image

Master node is the node that controls all the other nodes. Most importantly it contains all your configuration management data. You will need to install following

  • mcollective with ActiveMQ
  • DnsMasq
  • Puppet from Puppet Labs

1.  You will need to get a DNS name from a dynamic DNS provider such as DynDNS. Once you have that you will need to write a shell script that runs at boot and sets your EC2 private IP to that DNS name. Let's say we want our controller station to be known as controller.ec2.domain.com we can do something like this

IP=`facter ipaddress`
change_my_dns_ip controller.ec2.domain.com
# Delete any entries from hosts
sed -i "/controller.ec2.domain.com/d" /etc/hosts
echo "${IP}     controller.ec2.domain.com" >> /etc/hosts

2. Set up ActiveMQ to be used with mcollective http://code.google.com/p/mcollective/wiki/GettingStarted
3. Set up mcollective

Configure controller.ec2.domain.com as the stomp host in your mcollective configuration for both client and server configuration.

4.Install dnsmasq. You don't need to configure anything since by default dnsmasq will read /etc/hosts and serve those names over DNS

5. Install puppetmaster, configure it anyway you want

6. Image it

Set up a generic/worker node image

You will need to Install following

  • Mcollective
  • puppet agent

1. On the worker node you need to configure the server piece of mcollective and make sure the stomp.host is pointed to the master ie.  controller.ec2.domain.com.

2. Create a reboot agent (we'll discuss later how to use it). Please visit http://code.google.com/p/mcollective/wiki/SimpleRPCIntroduction for an example. Create a new file ie. reboot.rb. Paste this code in it

module MCollective
 module Agent
  class Reboot<RPC::Agent
    def reboot_action
     `/sbin/shutdown -r now`
    end
  end
 end
end

Copy the resulting file to the mcollective agents directory

3. Add following script to the bootup

MASTER=`host controller.ec2.domain.com | grep address | cut -f4 -d" "`
IS_ALREADY_SET=`grep -c ec2.domain.com /etc/resolv.conf`
if [ $IS_ALREADY_SET -lt 1 ]; then   
sed -i "s/^search .*/search ec2.domain.com/g" /etc/resolv.conf
sed -i "s/^nameserver/nameserver ${MASTER}\nnameserver/g" /etc/resolv.conf
fi
# Set Hostname
IP=`facter ipaddress`
MY_HOST=`/bin/ipcalc --silent --hostname ${IP} | cut -f2 -d=`
hostname ${MY_HOST}

What that does is point tells your worker nodes to use controller DNS for resolving names as well as setting your hostname.

4. Get the mcollective puppet plugin from github

5. Image it

Bringing up the environment

You will need to start the master instance first since that's the instance that everyone will be talking to. As soon as it's up you can start up as many instances as you'd like.

While you wait rsync your puppet manifests and configurations to the master node

To find out what nodes are up and available issue mc-ping from the master and you should get a response similar to this

# mc-ping
controller.ec2.domain.com               time=77.21 ms
domu-12-31-55-11-22-18.compute-1.internal time=188.76 ms

Trouble is that hostnames on the worker nodes are set to Amazon names. We want to make them recognizable e.g. mail01.

To do so simply add the IP of the worker instance and it's name into /etc/hosts on the master e.g.

echo "10.1.2.3      mail01.ec2.domain.com" >> /etc/hosts

Reload dnsmasq configuration ie.

/etc/init.d/dnsmasq reload

What this has bought you is reverse DNS resolution of the node.  To take effect you will need to reboot the worker node. We already have the reboot agent on the worker nodes so all we have to do is run following command on the master node

./mc-rpc -F hostname=domu-12-31-55-11-22-18 reboot reboot

This will seek out the domU-1-2-3-4 host and reboot it (--arg is irrelevant so put anything). Once the machine is up it will advertise it's new name :-) ie. running mc-ping will show you this

# mc-ping
controller.ec2.domain.com           time=47.59 ms
mail01.ec2.domain.com               time=80.71 ms

Now let's activate puppet. From master node run

# mc-puppetd -F hostname=mail01 runonce

 * [ ============================================================> ] 1 / 1

Finished processing 1 / 1 hosts in 1051.23 ms

Once that is done puppetca should give you this

# puppetca --list
mail01.ec2.domain.com

Sign it

# puppetca –sign mail01.ec2.domain.com

Now you can simply run

# mc-puppetd -F hostname=mail01 enable

and off you go. Now lather, rinse, repeat to get the rest of the instances going. You would certainly want to automate this further but I leave that exercise to you :-) .

If you are looking for an easy cross-cloud API check out my "Provision to cloud in 5 minutes using fog".

Next Boston DevOps meetup

July 21st, 2010

Next Boston DevOps meetup we'll try something new, Jeff Buchbinder of FreeMed Software fame and myself will talk about "Deploying your way into happiness". If you want flavor of the kinds of things we'll talk about you can check out my Devops homebrew post. We will go into much more detail with actual code snippets and some of the omitted nitty gritty details. We will also open the floor for questions.

Date for the meetup is August 3rd, 2010 from 6-8 pm and we'll be meeting at Microsoft's New England R&D center. I expect we'll start presenting around 6:45 or so.

Please register at

http://www.eventbrite.com/event/770217742

since we need to provide building security at NERD with the list of people attending.


Switch to our mobile site