Archive for the ‘Systems Management’ Category

PHP HTTP caching defaults

Tuesday, May 22nd, 2012

I have recently moved this blog to be hosted on Fastly, a CDN service with bunch of great features like dynamic content caching with instant purges. Fastly utilizes HTTP headers to determine what to cache as described in the Cache Control document. While configuring my service I noticed that my WordPress (origin server) kept returning HTTP headers like these

Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache

I looked through the WordPress code and couldn't see where such value was set. After some internet searches I discovered that they are set using session-cache-limiter option in php.ini. In most distributions this defaults to nocache which ends up with above headers. You can read more on the cache-limiter options here.

http://www.php.net/manual/en/function.session-cache-limiter.php

What we need is session-cache-limiter = public results in headers like these

Expires: (sometime in the future, according session.cache_expire)
Cache-Control: public, max-age=(sometime in the future, according to session.cache_expire)
Last-Modified: (the timestamp of when the session was last saved)

e.g.

Expires: Tue, 22 May 2012 16:38:33 GMT
Cache-Control: public, max-age=10800
Last-Modified: Tue, 04 Oct 2005 00:55:59 GMT

If you want to adjust the max-age you can set cache-expire in php.ini e.g.

; http://php.net/session.cache-expire
session.cache_expire = 180

 

RESTful way to manage your databases

Monday, January 2nd, 2012

I have a need in my development environment to easily create/drop mySQL databases and users. Initially I was gonna implement a simple hacky HTTP GET method but was dissuaded by Ben Black from doing so. He suggested I write a proper RESTful interface. Without further ado I present to you dbrestadmin

https://github.com/vvuksan/dbrestadmin

It is my first foray into writing RESTful services so things may be rough around the edges. However it allows you to do following

  • manage multiple database servers
  • create/drop databases
  • list databases
  • create/drop users
  • list users
  • give user grants
  • view grants given to the user
  • view database privileges on a particular database given to a user

For example need to create a database called testdb on dbserver ID=0 use this cURL command

curl -X POST http://myhost/dbrestadmin/v1/databases/0/dbs/testdb

Create a user test2 with password test

curl -X POST "http://localhost:8000/dbrestadmin/v1/databases/0/users/test2@localhost" -d "password=test"

Give test2 user all privileges on testdb

curl -X POST "http://localhost:8000/dbrestadmin/databases/0/users/test2@'localhost'/grants" -d "grants=all privileges&database=testdb"

There is more. You can see all of the methods here

https://github.com/vvuksan/dbrestadmin/blob/master/API.md

Improvements and constructive criticism welcome

Using Jenkins as a Cron Server

Monday, August 22nd, 2011

There are a number of problems with cron which cause lots of grief for system administrators with big ones being manageability, cron-spam and auditability. To fix some of these issues I have lately started using Jenkins. Jenkins is an open source Continuous Integration server it has lots of features that make it a great cron replacement for a number of uses. These are some of the problems it solves for me

Auditability

Jenkins can be configured to retain logs of all jobs that it has run. You can set it up to keep last 10 runs or you can set it up to keep only last 2 weeks of logs. This is incredibly useful since sometimes jobs can fail silently so it's useful to have the output instead of sending it to /dev/null.

Centralized management

I have my most important jobs centralized. I can export all Jenkins jobs as XML and check it into a repository. If I need to execute jobs on remote hosts I simply have Jenkins ssh and execute command remotely. Alternatively you can use Jenkins slaves.

Cron Spam

Cron spam is a common problem with solutions such as this, this and this. To avoid this condition I only have Jenkins alert me when a particular job fails ie. a job exits with return code other than 0.  In addition you can use the awesome Jenkins Text Finder plugin which allows you to specify words or regular expressions to look for in console output. They can be used to mark a "job" unstable. For example in text finder config I checked

X Also search the console output

and specified

Regular expression ([Ee]rror*).*

This has saved our bacon since we used the automysqlbackup.sh script which "swallows" up the errors codes from the mysqldump command and exits normally. Text Finder caught this

mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' bytes when dumping table `users` at row: 234

Happily we caught this one on time.

Job dependency

Often you will have job dependencies ie. main backup job where you first dump a database locally then upload it somewhere off-site or to the cloud. The way we have done this in the past is to leave a sufficiently large window between the first job and consecutive job to be sure first job has finished. This says nothing about what to do if the first job fails. Likely the second one will too. With Jenkins I no longer have to do that. I can simply tell Jenkins to trigger "backup to the cloud" once local DB backup concludes successfully.

Test immediately

While you are adding a job it's useful to test whether job runs properly. With cron you often had to wait until the job executed at e.g. 3 am in the morning to discover that PATH wasn't set properly or there was some other problem with your environment. With Jenkins I can click Build Now and job will run immediately.

Easy setup

Setting up jobs is easy. I have engineers set up their own job by copying an existing job and modifying it to do what they need to do. I don't remember last time someone asked me how to do it :-).

What I don't use Jenkins for

I don't use Jenkins to run jobs that collect metrics or anything that has to run too often.

 

Misconceptions about RRD storage

Tuesday, December 14th, 2010

I want to address the misconceptions about RRD (Round-Robin Database) that seem to crop up often even among seasoned sysadmins. Complaints can be summarized with these two points

  • RRD doesn't offer high resolution ie. after about an hour it's all averages and I want to knows what was the metric value last year at this hour and minute
  • Data drops off/is destroyed after a year - I want to keep my data forever, disk is cheap etc.

Those are valid points however none of them are the fault of RRD. RRD is a circular buffer so in order to be able to write into it you have to precreate it (otherwise it wouldn't be a circular buffer :-)). Obviously more data points you store bigger the RRD file will be. To illustrate the point Ganglia Monitoring uses following defaults to create RRDs

RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244" "RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" "RRA:AVERAGE:0.5:5760:374"

This will create multiple circular buffers within the same RRD database file. In order to make sense out of this you need to know what the polling interval is ie. how often do you write into RRDs. In Ganglia's case the default is 15 seconds so

  • "RRA:AVERAGE:0.5:1:244" says write actual values (:1:) for every polling interval. Save last 244 of those so in our case we'll have 61 minutes worth of actual data points. Since it's a circular buffer data older than 61 minutes will be "dropped"
  • "RRA:AVERAGE:0.5:24:244" says average 24 values (:24:), 24 * 15 seconds = 360 seconds = 6 minutes. 244 of those times 6 is a whole day
  • You can do the next two :-)
  • Last one "RRA:AVERAGE:0.5:5760:374" says average whole day (5760 * 15 seconds = 1440 minutes = 1 day) worth of values and store it in 374 points ie. little more than a year

When graphing RRDtool is smart enough to use the buffer which gives you the most data points. To store all this data RRD file will use about 12kBytes. Thus if you want higher resolution you will need to change the definition e.g. you could do this

"RRA:AVERAGE:0.5:1:2137440"

which will give you one year worth of data points with no averaging with 15 second interval. Trouble is the size of this RRD file is 17 Mbytes. This may not seem as bad but one of the RRD drawbacks is that every time you add data to an RRD the whole file is written over so if you have 1000 metrics you can be potentially writing 17 GBs of data every 15 seconds. This may be a problem depending how many metrics you are keeping track of. There are alternatives which increase throughput such as storing RRDs in RAMdisk or using rrdcached. Alternatively you can opt to keep 2 weeks worth of data points with e.g.

"RRA:AVERAGE:0.5:1:81984"

which will result in size of about 650 kBytes per RRD file. Or you can do something else altogether. Flip side of RRD is that there are no indexes to maintain, no tables that need to be rotated.

Update: I was wrong about the whole RRD file needing to be updated. In retrospect it makes sense and I apologize for providing the wrong info. You can read comment from Tobi Oetiker (creator of rrdtool) in comments below for more detail. This is actually awesome news since there is very little downside in making larger RRDs.

As far as Ganglia you can modify the defaults in /etc/ganglia/gmetad.conf file. You can also use gmetad-python which allows you to write your own plugins and store metric data in both RRD format, SQL or any other storage engine of your choice.

More on RRDtool can be found here

http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html

Rethinking Ganglia Web UI

Friday, December 10th, 2010

I have been a long time fan of Ganglia. Ganglia is a scalable distributed monitoring system initially developed for high-performance computing systems such as clusters and Grids. Today Ganglia is being used by some of the largest web properties such as Facebook, Twitter, Etsy, etc. as well as tons of smaller organizations. Some of Ganglia benefits are

  • Push based metrics ie. a lightweight agent on hosts that need to be monitored
  • Lots of basic metrics by default such as load, cpu utilization, memory utilization
  • Trivial to add new metrics ie. execute the gmetric command with metric value and graph automatically shows up
  • Decent web interface that allows you to easily drill down when troubleshooting problems

I have used other monitoring systems such as Cacti, Zenoss and Zabbix and found them lacking since they were overly complicated, hard to configure and customize. That said I have also had misgivings about certain parts of the Ganglia UI. Specifically what I missed were following features

  1. Ability to search hosts and metrics - looking for specific host or metric gets cumbersome even on clusters with 20-30 hosts
  2. Ability to create arbitrary groupings of host metrics on one page ie. a page with web response time for each web server and mySQL lock time would be something you'd have to write custom code for
  3. Easy way to create custom graphs ie. either aggregate line graphs or stacked graphs
  4. Easy way to add custom graphs to either clusters or hosts ie. I have a stacked Apache report showing number of GETs vs. POSTs. It's hard or impossible to show that graph only on webservers but not on mySQL servers.
  5. Mobile (WebKit) optimized experience - minimize zooming/panning etc.

Couple months ago on #ganglia Freenode IRC channel we were discussing some of the pitfalls of the UI and the idea of rewriting Ganglia UI was born. As I have been doing quite a bit of work with jQuery in months past I decided to to give it a shot.

Goals

My initial goals were

  1. Implement basic search functionality ie. one search term that will show matching hosts and metrics
  2. Add a way to add "optional" graphs on per cluster/per host basis ie. have a default set of graphs and allow those to be overriden using cluster or host override config files
  3. Add Views ie. ability to group host/metrics
  4. Add Mobile/Webkit View
  5. Store view and optional graphs config information in a format that can be easily manipulated by web UI, config management system or by hand - this is one of the key omissions in most monitoring setups where adding/removing hosts requires either manual intervention or kludgy hacks. As someone who has had to spend hours manually clicking around Zabbix interface whenever we added a new server this had major importance

Implementation

Initially there was an idea to rebuild the whole interface from scratch which we still may do but I decided that that would be too much work especially since I wasn't absolutely sure whether my intended changes would make sense for most people. Thus I decided to modify the existing UI.

So far these are the features that have been implemented

Visual aides

In cluster view next to each host now you'll see the full hostname in text on top of the graph. Same goes for metric names  in host view. Now even if you have hundreds of metrics you can click CTRL-F in your browser and find the metric quickly. Also there is a hidden anchor next to each metric which is used by the search tab.

Doesn't seem like much until you need it :-).

Search

Search tab allows you to type in a single term which will match hosts and metrics. It will search as you type. Hosts first, metrics on host second. Clicking on hosts opens a new window with the view of the host. Clicking on a particular metric takes you to the metric in question.

Views

Views are defined using JSON configuration files. One JSON file per view. There are two types of views, standard and regex views. For example standard view will look like this

{  "view_name":"default",
   "items":[
      {"hostname":"host1.domain.com","graph":"cpu_report"},
      {"hostname":"host2.domain.com","graph":"apache_report"}
    ],
    "view_type":"standard"
}

It will group cpu report from host1 and apache_report for host2. Regex view allows you to use regular expressions to define hosts (soon also metrics) ie. you want to group all hosts that have imap, amavis or smtp in their names. That view definition would look something like this

{  "view_name":"mailservers",
    "items":[
      {"hostname":"(imap|amavis|smtp)", "graph":"cpu_report"}
    ],
    "view_type":"regex"}

If you don't want to edit JSON config files by hand you can use the UI to create standard views ie. first create a view then as you browse hosts there is a plus sign next to each graph. Clicking on it displays a dialog which allows you to add that particular host/metric to a view e.g.

Automatic rotation

Allows you to automatically rotate a view. It is an integration of GangliaView with Views. What's especially nice is that if you have multiple monitors you can open up separate browser windows and select different views to rotate.

Mobile view

There is a functional mobile view which provides mobile view of Views, Clusters and Search ie. there is very little panning or zooming. Also we are using lots of preloading ie. first page you open contains lots of hidden sub-pages in order to save on having to do subsequent requests.

You can view some of the screenshots on Flickr.

Optional Graphs

You can specify which optional graphs you want displayed for each host or cluster. Similar to views these are configured via JSON config files e.g. this is the default list of graphs

{
	"included_reports": ["load_report","mem_report","cpu_report","network_report","packet_report"]
}

You can exclude any of the default included graphs or include ones you want e.g.

Screencast

If you would like to see some of these features in action you can look at these screencasts.

Download

Ready to try ? Wait no more and check it out from SVN at

http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-web-2.0/

Future

In the future we are looking into polishing the Graphite/Ganglia integration (perhaps about that in a next post), add integrations with e.g. Nagios (you can see a hint of it in the add metric to view screenshot above), Logstash. Also another upcoming feature will be aggregate metrics and quick views. Full TODO list can be found here

http://sourceforge.net/apps/trac/ganglia/browser/branches/monitor-web-2.0/TODO

Acknowledgements

I'd like to thank Erik Kastner for helping on the Graphite/Ganglia integration. Ben Hartshorne for test driving the UI and providing a number of good suggestions/ideas.

Install Openstack Nova easily using Chef and Nova-Solo

Wednesday, September 1st, 2010

Inspired by Cloudscaling's Swift-Solo and being excited about being able to create my own cloud I am announcing the Nova-Solo project. Openstack Nova is the Compute portion of the project trying to build open source stack to run Amazon EC2 type service. Nova-Solo is a set of Opscode Chef recipes that allow you to quickly get most parts of the Nova stack up and running. You can fetch it from Github at

http://github.com/vvuksan/nova-solo

At this time Nova-Solo is targeted for Ubuntu 10.04 and it relies on Soren Hansen's package repository to install all of the necessary packages. Following Nova services are installed

  • Cloud controller
  • Object store
  • Volume store
  • API server
  • Compute Server

Soren's package archive is a bit outdated so some of the things don't work. For example you can create users, generate credentials, upload files into buckets but you can't register the image. Soren has said he is in the process of building new packages and I am also in the process of doing the same so hopefully things improve quickly. Nova code is definitely alphaish so beware. To get started use git to clone the nova-solo repository and off you go

git clone git://github.com/vvuksan/nova-solo.git

In the future as things stabilize we'll be making adjustments to support multiple compute servers (pieces for it are already in Nova-Solo), support other distributions like RHEL/Centos, etc.

Slides from the Boston DevOps meetup

Wednesday, August 25th, 2010

Here are slides from the August 3rd, 2010 Boston DevOps meetup where Jeff Buchbinder and I spoke about deployment and other helpful hints

http://www.scribd.com/doc/35757228/Deploying-Yourself-Into-Happiness

Slides have been slightly modified based on the feedback we received at the meetup. If you have any questions please post them in comments and I'll attempt to answer them.

Tunnel all your traffic on “hostile” networks with OpenVPN

Friday, August 20th, 2010

I am often on wireless networks that are unsecured ie. either don't use encryption or if they are I may not trust they will not tamper with my data (you never know). To protect my traffic on such networks I decided to tunnel nearly all my traffic through an OpenVPN server while I'm on such networks. I will show you how you can do it yourself on your Linux or Mac laptops. You should be able to do similar in Windows but it may be a bit more work on the client.

OpenVPN server setup

Set up OpenVPN on a network you trust e.g. home, work, cloud etc. You can either use Community Edition of OpenVPN which is free http://openvpn.net/index.php/open-source/downloads.html or you may want to pay OpenVPN money for their OpenVPN appliance package. I prefer using pfSense which is customized FreeBSD distribution geared for firewalls/routers with superb Web GUI. If you are gonna use the Community Edition follow the Quickstart guide.
One last step is to make sure that VPN network ie. 10.8.0.0/16 is NATed e.g. on a Linux OpenVPN server you could do

iptables -t nat -A POSTROUTING -s 10.8.0.0/16 -o eth0 -j MASQUERADE

OpenVPN client setup

Configure OpenVPN client to connect to your OpenVPN server. You can find the client HOWTO here.
Make sure you can access your home/work network. This will in general provide you with "split-tunnel" access ie. only traffic intended for your home/work network will be tunneled through VPN and everything else will go the normal "insecure" way.

Tunnel all traffic

Update: Shame on me. Someone has already posted the directions on how to do this at

http://manoftoday.wordpress.com/2006/12/03/openvpn-20-howto/

Thanks to @somic for pointing this out.

Tricky part in all this is that OpenVPN uses a simple TUN/TAP interface through which tunnels all the traffic. Temptation is to simply add an entry in the OpenVPN file that sets a default route through OpenVPN. This will likely fail as you will now have competing default routes. Instead what you need to do is add a route to your VPN server that uses the wireless networks default gateway and make your VPN interface the default route. This way all the traffic goes into the VPN interface and OpenVPN takes care of tunneling it through.
For this you will need to configure an external script that fires off once VPN tunnel is up. To enable post-up script put following two lines in your ovpn file
script-security 3 system
up /usr/local/bin/set_up_routes.sh

Your set_up_routes.sh would look something like this. Please change the VPN_SERVER_IP variable to the IP of your OpenVPN server.

#!/bin/sh
# Note the wireless network default gateway
DEFAULT_GATEWAY=`netstat -nr | grep ^0.0.0.0 | awk '{ print $2 }'`
# Find out what's the IP on the
VPN_GATEWAY=`netstat -nr | grep tun | grep -v 0.0.0.0 | awk '{ print $2 }' | sort | uniq`
VPN_GATEWAY=`ifconfig  | grep 172.16 | cut -f3 -d:  | cut -f1 -d" "`
VPN_SERVER_IP="1.2.3.4"
sudo /sbin/route del default
#
sudo /sbin/route add -host $VPN_SERVER_IP gw $DEFAULT_GATEWAY
# Don't tunnel traffic to 2.3.4.5 since it's already SSLized
sudo /sbin/route add -host 2.3.4.5 gw $DEFAULT_GATEWAY
sudo /sbin/route add default gw $VPN_GATEWAY

This script was tested under Ubuntu Linux but should work the same under Mac OS X. On Windows you may need to use PowerShell or use Cygwin.

Tunneling traffic for specific IPs

If you only wish to tunnel traffic for particular set of IPs you only need to add those routes to your ovpn file e.g.

route 72.0.0.0 255.0.0.0
route 75.0.0.0 255.0.0.0
You do NOT need to go through the excercise of setting up a script.
If you are looking for other OpenVPN guides Sam Johnston has a OpenVPN guide on howto set up OpenVPN in a VPS

http://samj.net/2010/01/howto-set-up-openvpn-in-vps.html

Skipping MySQL replication errors

Thursday, August 19th, 2010

I was talking to my buddy Jeff Buchbinder and he mentioned that he recently added following to mySQL in order to reduce mySQL replication breakages

slave-skip-errors=1062,1053,1146,1051,1050

What this does is not stop replication in case following errors are encountered

Error: 1050 SQLSTATE: 42S01 (ER_TABLE_EXISTS_ERROR)

Message: Table '%s' already exists

Error: 1051 SQLSTATE: 42S02 (ER_BAD_TABLE_ERROR)

Message: Unknown table '%s'

Error: 1053 SQLSTATE: 08S01 (ER_SERVER_SHUTDOWN)

Message: Server shutdown in progress

Error: 1062 SQLSTATE: 23000 (ER_DUP_ENTRY)

Message: Duplicate entry '%s' for key %d

Error: 1146 SQLSTATE: 42S02 (ER_NO_SUCH_TABLE)

Message: Table '%s.%s' doesn't exist

This will avoid the very common primary key collisions and "temporary tables aren't there" problems. Writing this down for posterity. Use with caution.

Marius Ducea has a post about it as well

http://www.ducea.com/2008/02/13/mysql-skip-duplicate-replication-errors/

Deployment rollback

Thursday, August 12th, 2010

This is a question that often comes up in deployment discussion. How do you rollback in case of a "bad" deploy ? Bad deploy can be any of the following

  • Site completely broken
  • Significant performance degradation
  • Key feature(s) broken

There are obviously a number of ways to deal with this issue. You could put up a notice on the site that x and y feature is broken while you work to fix it. Same with performance degradation. Let's however deal with rollback ie. you decided (determined by a number of different factors) that the stuff you just deployed is broken and you should roll back to a previous last know version. In such a case you would

  • Undo any configuration changes you may have applied (often none)
  • Deploy last known good version that worked. This is one of the reasons why I prefer using labelled binary packages. I simply instruct the deployment tool to install version 1.5.2 which was last good version and off we go.

The only caveat are database changes. In general you can't easily undo DB changes especially in the situations where you discover a deployment problem couple hours after deployment has taken place since by then users may have added new posts, changed their profiles etc. It would be a major effort to undo all DB changes, evaluate newly added data and whether it needs to be changed. That said DB changes are usually not a problem if you follow these easy steps

  1. Don't do any column drops immediately after the release. You can do those in QA but in production those can wait. In most cases they only take up space. I have heard of places that would first zero out then drop "unused" columns once a quarter or so.
  2. Related to 1. never ever use SELECT * since if you drop or add a column your code may break during roll back
  3. If there are data changes you have to do ie. update carrier set name="AT&T" where name="Cingular", have the reverse SQL statement ready as the insurance policy. Those are quite easy to implement.
  4. You don't have to worry about added tables since older version will not use them.
  5. You don't have to worry about added columns provided you don't do 2. and have not placed constraints ie. NOT NULL. In that case you may need to adjust those or drop them during rollback.

The wildcard in all this is added or removed constraints ie. new foreign keys. There is no single solution for this one. Perhaps the right policy is to discuss constraints prior to deployment and have a plan ready on what to do. Good luck.