Vladimir Vuksan's Blog2019-10-24T21:01:44+00:00http://blog.vuksan.comVladimir Vuksanblah@email.testUbuntu 19.10 Chromium issues after 19.04 upgrade2019-10-24T18:00:00+00:00http://blog.vuksan.com/2019/10/24/ubuntu-19-10-chromium<p>Recently I upgraded from Ubuntu 19.04 to 19.10. Upgrade was uneventful except for Chromium losing all my
saved passwords and personal HTTPS certificates. Main cause of the issue is the new chromium packaging. As of 19.10
chromium is utilizing snap packaging instead of deb. You can read the rational behind the change
<a href="https://ubuntu.com/blog/chromium-in-ubuntu-deb-to-snap-transition">on the Ubuntu blog</a>.</p>
<p>As a result on first invocation <code class="language-plaintext highlighter-rouge">$HOME/.config/chromium</code> is copied into <code class="language-plaintext highlighter-rouge">$HOME/snap/chromium/.config/chromium</code>.
Unfortunately that is not sufficient as you end up with a following error message</p>
<pre>
[23083:23264:1024/150717.374528:ERROR:token_service_table.cc(140)] Failed to decrypt token for service AccountId-19854958475897323
</pre>
<p>Solution is to run</p>
<p><code>
snap connect chromium:password-manager-service
</code></p>
<p>Thanks to <a href="https://discourse.ubuntu.com/t/call-for-testing-chromium-browser-deb-to-snap-transition/11179/164">this post for providing the solution.</a>.</p>
<p>As far as personal HTTPS certificates are concerned your best course of action is to either export those prior to the
upgrade or if you don’t have that option <a href="https://www.chromium.org/getting-involved/download-chromium">download latest Chromium build</a> then export the cert and reimport into your new shiny chromium :-).</p>
<p>If you are curious what Chromium is using for it’s config directory you can enter a following URL</p>
<pre>
chrome://version
</pre>
Android 4.x TLS v1.2 built-in browser secure connection issues2017-03-09T18:00:00+00:00http://blog.vuksan.com/2017/03/09/android-4-tls-v12-built-in-browser-secure-connection-issues<p>Recently at Fastly we have been gradually turning off TLS v1.0 and v1.1 support due to PCI mandate to deprecate
them. You can read about the deprecation policy <a href="https://www.fastly.com/blog/phase-two-our-tls-10-and-11-deprecation-plan">here</a>.</p>
<p>We also recently received couple reports from customers about some of the Android 4.x users not being able to access some of these end
points. During the investigation I found following SSLLabs issue</p>
<p><a href="https://github.com/ssllabs/ssllabs-scan/issues/258">https://github.com/ssllabs/ssllabs-scan/issues/258</a></p>
<p>which had a pointer to this post about different vendors packaging a version of Google Chrome as their own built in browser</p>
<p><a href="http://www.quirksmode.org/blog/archives/2015/02/chrome_continue.html">http://www.quirksmode.org/blog/archives/2015/02/chrome_continue.html</a></p>
<p>Unfortunately it appears that some vendors notably Samsung standardized on version of Chrome which did not have TLS v1.2 support e.g.
Chrome 28. Can I Use site has a nice table of TLS v1.2 support</p>
<p><a href="http://caniuse.com/#search=tls%201.2">http://caniuse.com/#search=tls%201.2</a></p>
<p>This is clearly a major hassle as it may force you to keep TLS 1.0/1.1 around for longer than you’d like or educate users to install
latest Google Chrome from the Play Store. To get a better understanding what the experience may look like is I tested it on my Android
4.2 table and this is what it it looks like</p>
<p>This is what the built-in browser capabilities are</p>
<p><a href="/assets/android_4.2_built_in_browser_capability.png"><img src="/assets/android_4.2_built_in_browser_capability.png" alt="Android 4.2 Built-in Browser capabilities" /></a></p>
<p>Unfortunately this will result in a very nasty error that says secure connection cannot be established</p>
<p><a href="/assets/android_4.2_built_in_browser_tlsv12_error.png"><img src="/assets/android_4.2_built_in_browser_tlsv12_error.png" alt="Android 4.2 Built-in Browser error" /></a></p>
<p>Same device with Google Chrome installed passes the capability test with flying colors</p>
<p><a href="/assets/android_4.2_chrome_browser_capability.png"><img src="/assets/android_4.2_chrome_browser_capability.png" alt="Android 4.2 Chrome browser capabilities" /></a></p>
Setup Minecraft Server on Google Cloud Engine with terraform2016-05-11T12:00:00+00:00http://blog.vuksan.com/2016/05/11/minecraft-server-on-google-cloud-engine-with-terraform<p>My children like to play Minecraft and they often like to play with their friends and cousins who are remote. To do so in the past I would set up my laptop at the house, set up port forwarding on the
router, etc. This would often not work as the router would not accept the changes, my laptop firewall was on etc. Instead I decided to shift all this to the cloud.
In this particular example I will be using Google Cloud Engine since it allows you to have persistent disks. To minimize costs I will automate creation and destruction of minecraft server(s) using Hashicorp’s <a href="https://terraform.io">Terraform</a>.</p>
<p>All the terraform template and files can be found in this specific Github Repo</p>
<p><a href="https://github.com/vvuksan/terraform-playground/tree/master/minecraft-server/google_cloud">https://github.com/vvuksan/terraform-playground</a></p>
<p>You will need to sign up for a Google Cloud account. You may also optionally buy a domain name from a registrar so that you don’t need
to enter IP addresses in your minecraft client. If you do so rename dns.tf.disabled to dns.tf and change this section</p>
<pre>
variable "domain_name" {
description = "Domain Name"
default = "change_to_the_domain_name_you_bought.xyz"
}
</pre>
<p>As described in the README what this set of templates will do is create a persistent disk where you will store your gameplay and spin up
a minecraft server just for that time being. When you want to play
you will need to type</p>
<pre>
make create
</pre>
<p>and when you are done playing you will type</p>
<pre>
make destroy
</pre>
<p>Cost of this should be minimal. In the TF template I’m setting a persistent disk of size of 10 GB (change that in main.tf if you need to). That will cost you approximately $0.40 per month. On top of it you’d be paying for
g1.small instance cost which is about $0.02 per hour. You can certainly opt for a faster instance by adjusting the instance size in main.tf file.
Also if you are using DNS there will be DNS query costs but those should be minimal.</p>
<p>Have fun.</p>
Rsyslog server TLS termination2016-05-10T12:00:00+00:00http://blog.vuksan.com/2016/05/10/rsyslog-server-tls-termination<p>I was working with a customer trying to configure <a href="https://docs.fastly.com/guides/streaming-logs/setting-up-remote-log-streaming">Fastly’s Log Streaming</a>
and ship logs to their Rsyslog server. Fastly supports sending Syslog over TLS however it appeared that TLS handshake was not succeeding as we
would end up with gibberish in the logs e.g.</p>
<pre>
May 3 13:22:08 192.168.0.10 #001#000#000M#033#000#020#023#000#001#000#000#016log.domain.com#000#002#000#005#001#000#000#000#000
</pre>
<p>I looked over a number of different guides with no luck. After trying a number of different things I ended up with a following configuration. This was
tested on RSyslog 7 and 8.</p>
<pre>
auth,authpriv.* /var/log/auth.log
*.*;auth,authpriv.none -/var/log/openandclick.log
kern.* -/var/log/kern.log
mail.* -/var/log/mail.log
#
# Emergencies are sent to everybody logged in.
#
*.emerg :omusrmsg:*
# Setup disk assisted queues
$WorkDirectory /var/log/spool # where to place spool files
$ActionQueueFileName fwdRule1 # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g # 1gb space limit (use as much as possible)
$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
$ActionQueueType LinkedList # run asynchronously
$ActionResumeRetryCount -1 # infinite retries if host is down
#RsyslogGnuTLS
# CA certificate store. Uses generic Debian/Ubuntu CA store
$DefaultNetstreamDriverCAFile /etc/ssl/certs/ca-certificates.crt
$DefaultNetstreamDriverCertFile /etc/letsencrypt/archive/log.domain.com/fullchain1.pem
$DefaultNetstreamDriverKeyFile /etc/letsencrypt/archive/log.domain.com/privkey1.pem
$DefaultNetstreamDriver gtls
module(load="imtcp"
streamdriver.mode="1"
streamdriver.authmode="anon")
input(type="imtcp" port="5144" name="tcp-tls")
</pre>
<p>It will use the TLS certificate from /etc/letsencrypt and listen to TLS requests on port 5144. There is no client
authentication ie. authmode=anon. If you want to authenticate clients you will need to change authmode to e.g.</p>
<pre>
streamdriver.authMode="name"
streamdriver.permittedpeer=["test1.example.net", "test.example.net"]
</pre>
Ganglia Web frontend in Ubuntu 16.04 install issue2016-05-03T18:00:00+00:00http://blog.vuksan.com/2016/05/03/ganglia-webfrontend-ubuntu-1604-install-issue<p>Ubuntu 16.04 Xenial comes with Ganglia Web Front end 3.6.1 included however doesn’t pull in all the
dependencies. If you get an error like this</p>
<pre>
Sorry, you do not have access to this resource. "); } try { $dwoo = new Dwoo($conf['dwoo_compiled_dir'], $conf['dwoo_cache_dir']); } catch (Exception $e) { print "
</pre>
<p>You are missing Mod PHP and PHP7-XML module. To correct that you need to do execute following commands</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install libapache2-mod-php7.0 php7.0-xml ; sudo /etc/init.d/apache2 restart
</code></pre></div></div>
<p>If you don’t have Ganglia web frontend enabled all you need to do is type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo ln -s /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/001-ganglia.conf
sudo /etc/init.d/apache2 restart
</code></pre></div></div>
Ganglia Web frontend in Ubuntu 16.04 install issue2016-05-03T18:00:00+00:00http://blog.vuksan.com/2016/05/03/ganglia-webfrontend-ubuntu-1604-install-issue<p>Ubuntu 16.04 Xenial comes with Ganglia Web Front end 3.6.1 included however doesn’t pull in all the
dependencies. If you get an error like this</p>
<pre>
Sorry, you do not have access to this resource. "); } try { $dwoo = new Dwoo($conf['dwoo_compiled_dir'], $conf['dwoo_cache_dir']); } catch (Exception $e) { print "
</pre>
<p>You are missing Mod PHP and PHP7-XML module. To correct that you need to do execute following commands</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install libapache2-mod-php7.0 php7.0-xml ; sudo /etc/init.d/apache2 restart
</code></pre></div></div>
<p>If you don’t have Ganglia web frontend enabled all you need to do is type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo ln -s /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/001-ganglia.conf
sudo /etc/init.d/apache2 restart
</code></pre></div></div>
Google Compute Engine Load balancer Let's Encrypt integration2016-04-18T18:00:00+00:00http://blog.vuksan.com/2016/04/18/google-compute-load-balancer-lets-encrypt-integration<p><a href="https://letsencrypt.org/">Let’s Encrypt</a> (LE) is a new service started by Internet Security Research Group (ISRG)
to offer free SSL certificates. It’s intended to be automated so that you can obtain a certificate quickly and
easily. Currently however LE requires installation of their client software which makes a request to their API
for a domain you want to secure then generates a random script that it puts at a random Web Path for the
domain so that LE backend servers can check them. In a nutshell to get a certificate for domain <em>myhost.mydomain.xyz</em>
LE client will require you add predetermined text at a URL they provide e.g.</p>
<blockquote>
<p>http://myhost.mydomain.xyz/.well-known/jdoiewerhwkejhrwehrheuwhruewh</p>
</blockquote>
<p>If that matches you have validated you are the owner of the domain and LE
issues you a certificate. <a href="https://letsencrypt.org/how-it-works/">More detail on how it works can be found here</a>.</p>
<p>Difficulty is that in order to automate this process you either</p>
<ul>
<li>have to allow LE client to control your web server (currently only Apache) - this may disrupt your traffic in case of any issues</li>
<li>allow it to drop files into a web root which may be problematic if your domain is behind load balancer and you need
to copy the validation content to all nodes</li>
<li>use standalone method where LE spins it’s own standalone server but requires you to shut down your web server</li>
<li>devise a different method</li>
</ul>
<p>In following section I will describe a method on how to do this with Google Cloud Engine (GCE) Load balancer
since it supports conditional URL path matching. You could also do something very similar with other load balancers
such as Varnish or Haproxy.</p>
<p>Conceptually what we’ll do is</p>
<ul>
<li>Modify the GCE Load balancer URL map to send all traffic intended for LE to a special
backend e.g. any URL with /.well-known/ will be sent to a custom backend</li>
<li>Spin up a minimal VM with Apache on GCE</li>
<li>Use the LE client Docker image to manage the signing process or simply install the LE client</li>
</ul>
<p>To make configuration easy I will be using <a href="Terraform">https://www.terraform.io</a> since it greatly
simplifies this process. This process also assumes you are already running GCE load balancer against
the domain you are trying to secure.</p>
<p>First we’ll need to create an instance template. I am using the Google Container Engine
images as they already come with Docker installed.</p>
<pre>
variable "gce_image_le" {
description = "The name of the image for Let's Encrypt."
default = "google-containers/container-vm-v20160321"
}
resource "google_compute_instance_template" "lets-encrypt" {
name = "lets-encrypt"
machine_type = "f1-micro"
can_ip_forward = false
tags = [ "letsencrypt", "no-ip" ]
disk {
source_image = "${var.gce_image_le}"
auto_delete = true
}
network_interface {
network = "${var.gce_network}"
# No ephemeral IP. Use bastion to log into the instance
}
metadata {
startup-script = "${file("scripts/letsencrypt-init")}"
}
}
</pre>
<p>You will notice I am using a startup script (scripts/letsencrypt-init) inside this instance template which
looks like this</p>
<pre>
apt-get update
apt-get install -y apache2
rm -f /var/www/index.html
touch /var/www/index.html
docker pull quay.io/letsencrypt/letsencrypt:latest
mkdir /root/ssl-keys
echo "email = myemail@mydomain.com" > /root/ssl-keys/cli.ini
</pre>
<p>Basically I’m just preinstalling Apache and pulling the Let’s Encrypt Client Docker Image.</p>
<p>Next step is to create an Instance Group Manager (IGM) and Autoscaler. Instance group manager defines
what instance template is gonna be used and base instance name whereas autoscaler starts up instances in
IGM and makes sure there is one replica running. Last step is to define the backend service and
attach IGM to it.</p>
<pre>
resource "google_compute_instance_group_manager" "lets-encrypt-instance-group-manager" {
name = "lets-encrypt-instance-group-manager"
instance_template = "${google_compute_instance_template.lets-encrypt-instance-template.self_link}"
base_instance_name = "letsencrypt"
zone = "${var.gce_zone}"
named_port {
name = "http"
port = 80
}
}
resource "google_compute_autoscaler" "lets-encrypt-as" {
name = "lets-encrypt-as"
zone = "${var.gce_zone_1_fantomtest}"
target = "${google_compute_instance_group_manager.lets-encrypt-instance-group-manager.self_link}"
autoscaling_policy = {
max_replicas = 1
min_replicas = 1
cooldown_period = 60
cpu_utilization = {
target = 0.5
}
}
}
resource "google_compute_backend_service" "lets-encrypt-backend-service" {
name = "lets-encrypt-backend-service"
port_name = "http"
protocol = "HTTP"
timeout_sec = 10
region = "us-central1"
backend {
group = "${google_compute_instance_group_manager.lets-encrypt-instance-group-manager.instance_group}"
}
health_checks = ["${google_compute_http_health_check.fantomtest.self_link}"]
}
</pre>
<p>Next thing we’ll need to do is change the URL map for the load balancer. Basically we’ll
send anything matching /.well-known/* to our LE backend service. My URL map is called fantomtest
that by default uses the fantomtest backend service. This means any requests that don’t match
/.well-known/ will end up on my default backend service (which is what we want)</p>
<pre>
resource "google_compute_url_map" "fantomtest" {
name = "fantomtest-url-map"
description = "Fantomtest URL map"
default_service = "${google_compute_backend_service.fantomtest.self_link}"
# Add Letsencrypt
host_rule {
hosts = ["*"]
path_matcher = "letsencrypt-paths"
}
path_matcher {
default_service = "${google_compute_backend_service.fantomtest.self_link}"
name = "letsencrypt-paths"
path_rule {
paths = ["/.well-known/*"]
service = "${google_compute_backend_service.lets-encrypt-backend-service.self_link}"
}
}
}
</pre>
<p>Terraform apply it and if you have been successful you should see the letsencrypt service become healthy.</p>
<p>Now log into the instance running the LE client and run</p>
<pre>
docker run -it -v "$(pwd)/ssl-keys:/etc/letsencrypt" -v "/var/www:/var/www" quay.io/letsencrypt/letsencrypt:latest \
certonly --webroot -w /var/www -d www.mydomain.xyz
</pre>
<p>If you get</p>
<pre>
- Congratulations! Your certificate and chain have been saved at
/etc/letsencrypt/live/www.mydomain.xyz/fullchain.pem. Your
cert will expire on 2016-07-17. To obtain a new version of the
</pre>
<p>You are done and your certificate will be found in ssl-keys/live/www.mydomain.xyz/fullchain.pem. By default LE issues
certificates with validity of 90 days and they will nag you starting 30 days before expiration to update them. I will
leave it as an excercise to the reader to automate this. Do note that if you are gonna automate pushing certificates
make sure you validate the full chain to make sure things look good.</p>
Signing AWS Lambda API calls with Varnish2016-04-15T18:00:00+00:00http://blog.vuksan.com/2016/04/15/signing-aws-lambda-api-calls-with-varnish<p>A number of months ago Stephan Seidt <a href="https://twitter.com/evilhackerdude">@evilhackerdude</a> posed
a question on Twitter if it was <a href="https://twitter.com/evilhackerdude/status/667315959242162176">possible to
use Fastly to sign requests going to AWS Lambda</a>.
For those who do not know what AWS Lambda is here is <a href="https://en.wikipedia.org/wiki/Amazon_Lambda">Wikipedia’s succinct explanation</a></p>
<blockquote>
<p>AWS Lambda is a compute service that runs code in response to events and
automatically manages the compute resources required by that code. The purpose
of Lambda, as opposed to AWS EC2, is to simplify building smaller, on-demand
applications that are responsive to events and new information. AWS targets
starting a Lambda instance within milliseconds of an event.</p>
<p>AWS Lambda was designed for use cases such as image upload, responding to
website clicks or reacting to output from a connected device. AWS Lambda
can also be used to automatically provision back-end services triggered by custom requests.</p>
<p>Unlike Amazon EC2, which is priced by the hour, AWS Lambda is metered in increments of 100 milliseconds.</p>
</blockquote>
<p>Initially I thought this was not going to be possible since I thought I could only make asynchronous
calls however Stephan pointed out that there was a way to invoke synchronous calls as well since that is what
<a href="https://aws.amazon.com/api-gateway/">AWS API Gateway</a> does to expose Lambda functions.</p>
<p>In order to be able to send requests to Lambda you would need to sign requests going to Lambda. AWS has gone through a number of versions of their
signing API however for most services today you will need to use
<a href="http://docs.aws.amazon.com/general/latest/gr/sigv4-signed-request-examples.html">signature version 4</a>.
SIGV4 API relies on a number of HMAC and hashing functions that are not in stock varnish but are available in
the <a href="https://github.com/varnish/libvmod-digest">Libvmod-Digest VMOD</a> if you are deploying your VCL on Fastly this
VMOD is already built it.</p>
<h3 id="code">Code</h3>
<p>You can find full VCL for signing requests to Lambda here</p>
<p><a href="https://github.com/vvuksan/misc-stuff/blob/master/lambda/lambda.vcl">https://github.com/vvuksan/misc-stuff/blob/master/lambda/lambda.vcl</a></p>
<p>This code has some Fastly specific macros and functions which you can upload as custom VCL however most of the heavy lifting is done inside the aws4_lambda_sign_request subroutine so if you are using stock varnish copy that. Things to change in the vcl_recv are</p>
<pre>
set req.http.access_key = "CHANGEME";
set req.http.secret_key = "CHANGEME";
</pre>
<p>Change those with your AWS credentials that have access to Lambda. You can also change the region where you functions run. In addition you will
need to come up with a way to map incoming URLs to Lambda functions. In my sample VCL I am using
<a href="https://docs.fastly.com/guides/edge-dictionaries/about-edge-dictionaries">Fastly’s Edge Dictionaries</a> e.g.</p>
<pre>
table url_mapping {
"/": "/2015-03-31/functions/homePage/invocations",
"/test": "/2015-03-31/functions/test/invocations",
}
# If no match req.url will be set to /LAMBDA_Not_Found
set req.url = table.lookup(url_mapping, req.url.path, "/LAMBDA_Not_Found" );
# If page has not been found we just throw out a 404
if ( req.url == "/LAMBDA_Not_Found" ) {
error 404 "Page not found";
}
</pre>
<h3 id="pros-and-cons">Pros and Cons</h3>
<p>Pros:</p>
<ul>
<li>You get the power of VCL to route requests to different backends including Lambda</li>
<li>You may be able to cache some of the requests coming out of Lambda</li>
<li>Lower costs since API Gateway can be pricey</li>
</ul>
<p>Cons:</p>
<ul>
<li>Only POST requests with payload of up to 2 kbytes and GET requests with no query argument are supported
<ul>
<li>In order to compute the signature we need to calculate a hash of the payload. Unfortunately Varnish exposes
only 2 kbytes of the payload inside the VCL. This is a tunable if you run your own varnish. You can adjust by
running
<pre>
varnishadm param.set form_post_body 16384
</pre>
</li>
<li>Any request other than POST needs to be rewritten as a POST hence GET can query no argument</li>
</ul>
</li>
<li>You can output straight HTML however returned payload you will end up with leading and trailing ‘ character. You will
also need to fix up the returned Content Type since it returns as application/json. You can set Content Type in VCL by
doing following in vcl_deliver e.g.
<pre>
set resp.http.Content-Type = "text/html";
</pre>
</li>
<li>Currently it’s impossible to craft POST request froms scratch</li>
</ul>
<h3 id="future-work">Future work</h3>
<p>Look into using something like <a href="https://github.com/varnish/libvmod-curl">libvmod-curl VMOD</a> to create POST requests on the
fly.</p>
Howto speed up your monitoring system with Varnish2015-04-03T18:00:00+00:00http://blog.vuksan.com/2015/04/03/howto-speed-up-your-monitoring-system-with-varnish<p>If you use a monitoring system of any kind you are looking at lots of graphs. It also happens that
as size of your team grows you are looking at more and more graphs and often times member of your
team are looking at same graphs. In addition as you grow graphs become more complex and you may have
fairly complicated aggregated graphs with 100s of data sources which can become quite a bit of a burden
on your metrics system. This resulted in complaints about slowness of our monitoring. To speed it up
we figured we should our best bang for the buck would be to cache page fragments. Since we run a CDN
based on <a href="https://www.varnish-cache.org/">Varnish</a> it was logical what we were gonna use :-).</p>
<h2 id="assumptions">Assumptions</h2>
<ul>
<li>Most metric systems poll on a fixed time interval e.g. 10-15 seconds. If you make a graph you can safely cache it for 10 seconds or
longer since graph is not going to change</li>
<li>There are a number of static dashboard pages we can cache for longer since they don’t change. Only dependent images change</li>
<li>Even if we don’t cache or cache for really short e.g. 1-2 seconds Varnish supports <a href="http://wiki.squid-cache.org/Features/CollapsedForwarding">Collapsed Forward</a>
which will result in collapsing multiple request for same resource into one ie. if 5 clients request same resource /img/graph1.png
at the same time varnish will send only one request to the backend then respond to all 5 clients with the same resource. This
is ia huge win.</li>
</ul>
<p>You can find an example Varnish configuration in this repo. This is Ganglia specific however you can adapt to suit
your needs</p>
<p><a href="https://github.com/ganglia/ganglia_contrib/tree/master/varnish_web">Ganglia contrib repository</a></p>
<p>Key file you need is <strong>default.vcl</strong> which you need to put in /etc/varnish/default.vcl</p>
<h2 id="notes">Notes</h2>
<p>Your caching rules should be put in vcl_fetch function. For example</p>
<pre>
if (req.url ~ "^/(ganglia2/)?$" ) {
set beresp.ttl = 1200s;
unset beresp.http.Cache-Control;
unset beresp.http.Expires;
unset beresp.http.Pragma;
unset beresp.http.Set-Cookie;
}
</pre>
<p>This is a regex match that will match /ganglia2/ or / and cache it for 20 minutes (1200 seconds).
Resulting object will also be stripped of any Cache-Control, Expires, Pragma or Set-Cookie
headers since we don’t want to send those to browsers.</p>
<pre>
if (req.url ~ "/(ganglia2/)?graph.php") {
set beresp.ttl = 15s;
set beresp.http.Cache-Control = "public, max-age=10";
unset beresp.http.Pragma;
unset beresp.http.Expires;
unset beresp.http.Set-Cookie;
}
</pre>
<p>Similar to the rule above we set cache time to 15 seconds, we unset all the headers except for Cache-Control
which we set to 10 seconds. What this will mean is that varnish will cache the object for 15 seconds however
we’ll instruct the browser to cache it for 10 seconds.</p>
<p>You could also get creative an do things based on content type of the resulting object</p>
<pre>
if ( beresp.http.Content-Type ~ "image/png" ) {
set beresp.ttl = 15s;
}
</pre>
<p>Have fun.</p>
Adventures with Arduino part 22015-02-04T18:00:00+00:00http://blog.vuksan.com/2015/02/04/adventures-with-arduino-2<p>In my last post <a href="/2015/01/01/adventures-with-arduino-part1/">Adventures with Arduino part 1</a> I discussed
some of the options of wiring up and getting metrics with Arduino. Here is the work in progress</p>
<p><a href="/assets/arduino1.jpg"><img src="/assets/arduino1.jpg" alt="Arduino Wiring image" /></a></p>
<p>It includes a <a href="http://www.adafruit.com/product/386">DHT11</a> humidity temperature sensor,
<a href="http://www.instructables.com/id/Water-Level-Sensor-Module-for-Arduino-AVR-ARM-STM3/">water sensor</a> and a
<a href="http://www.adafruit.com/product/375">reid switch</a>. The way things are set up is that it polls the sensors periodically e.g.</p>
<ul>
<li>Reid switch (to see if door is open or closed) every 2-10 seconds</li>
<li>Humidity and temperature every minute</li>
</ul>
<p>It then sends those values as a simple comma separated file. Data format I’m using is</p>
<p><code class="language-plaintext highlighter-rouge">device uptime,device name,metric_name=value</code></p>
<p>with multiple metric values possibly sent in the same packet. On the receiving side I have a Raspberry
Pi that follows this work flow.</p>
<ul>
<li>Uses a modified raspberryfriends daemon from <a href="https://github.com/riyas-org/nrf24pihub/">nrf24pihub</a></li>
<li>Daemon receives and parses the payload and ships it off to <a href="https://github.com/sivy/pystatsd/">Statsd</a> - using a gauge data type</li>
<li>Statsd rolls up any metrics and sends them over to <a href="http://ganglia.info">Ganglia</a>. Ganglia is used for trending
and data collection e.g. this shows temperature and humidity in one of my bedrooms. You can notice the effect of a
room humidifer on humidity in the room :-)</li>
</ul>
<p><a href="/assets/arduino_humidity_temperature.png"><img src="/assets/arduino_humidity_temperature.png" alt="Arduino metrics" /></a></p>
<ul>
<li>I can also see if I left my garage door open :-)</li>
</ul>
<p><a href="/assets/garage_door.png"><img src="/assets/garage_door.png" alt="Arduino metrics" /></a></p>
<ul>
<li>In addition I have set up alerting using <a href="https://github.com/ganglia/ganglia-web/wiki/Nagios-Integration">Nagios/Ganglia integration</a>
and <a href="http://www.opsgenie.com/">OpsGenie</a> which alerts me if I leave my garage door open</li>
</ul>
<p><a href="/assets/garage_door_alerts.png"><img src="/assets/garage_door_alerts.png" alt="Arduino metrics" /></a></p>
<ul>
<li>In this particular instance that alert has dual meaning since this particular Arduino is driven by one of those “lip-stick” USB battery packs
and Ganglia will expire a particular metric if it hasn’t been reported for a defined amount of time (in my case 1 minute).
In this particular case alert state UNKNOWN tells me that most likely battery is out I need to recharge it.</li>
</ul>
Raspberry Pi Revision B+ NRF24L01 wiring2015-01-31T18:00:00+00:00http://blog.vuksan.com/2015/01/31/raspberry-pi-nrf24L01-wiring<p>Raspberry Pi Revision B+ has 40 pins unlike the original Raspberry Pi. To connect
NRF24L01 module to it you need to connect it as follows. This is how NRF24L01 looks
like. Pins start with 1 in bottom right.</p>
<p><a href="/assets/24L01Pinout-800.jpg"><img src="/assets/24L01Pinout-800.jpg" alt="NRF24L01 Pinout" /></a></p>
<table>
<thead>
<tr>
<th>NRF Pin</th>
<th>Raspberry Pi</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Ground PIN 4</td>
</tr>
<tr>
<td>2</td>
<td>3.3 V PIN 1</td>
</tr>
<tr>
<td>3</td>
<td>PIN 22</td>
</tr>
<tr>
<td>4</td>
<td>PIN 24</td>
</tr>
<tr>
<td>5</td>
<td>PIN 23</td>
</tr>
<tr>
<td>6</td>
<td>PIN 19</td>
</tr>
<tr>
<td>7</td>
<td>PIN 21</td>
</tr>
</tbody>
</table>
Adventures with Arduino part 12015-01-01T18:00:00+00:00http://blog.vuksan.com/2015/01/01/adventures-with-arduino-1<p>Few months ago I got myself an <a href="https://www.adafruit.com/products/50">Arduino Uno board from Adafruit</a>.
I have had couple use cases I was going to try to use them for in order of importance</p>
<ol>
<li>Water leak detection e.g. put a sensor in crawl space or attic for early detection of water leaks</li>
<li>Garage door open/close state - find out if the garage door has been left open</li>
<li>Humidity and temperature monitoring around the house</li>
</ol>
<p>One nice thing about Arduino is that some of the small boards are really inexpensive.
For example <a href="http://arduino.cc/en/Main/arduinoBoardNano">Arduino Nano board</a> or
<a href="https://www.adafruit.com/products/1501">Adafruit Trinket</a> can run you anywhere between $3.50-$7.</p>
<p>After receiving the board and playing with some of the basic examples I figured it was time to resolve
how to push sensor data to a central location. Some of the options I discovered were as follows</p>
<ol>
<li><a href="https://www.adafruit.com/product/128">XBee modules</a>. Approx cost per module ~ $20</li>
<li>Bluetooth - Approx cost per module $7</li>
<li>315/433 Mhz remote control - Approx cost per module ~ $1</li>
<li>Nordic Semiconductor nRF24L01+ modules - Approx cost per module ~ $1</li>
</ol>
<p>XBee and Bluetooth are really nice options however I considered them too overpriced for this particular
use case so I went ahead and bought a pair of 433 Mhz and nRF24L01+ modules.</p>
<h2 id="315433mhz-modules">315/433MHz modules</h2>
<p>First thing I tried was 433 MHz modules. They were easy to wire and configure since e.g. transmitter
has only 3 pins and receiver 4 pins (although you only use 3) and using the
<a href="https://code.google.com/p/rc-switch/">rc-switch project libraries</a> I was able to communicate between
my Arduino and a Raspberry Pi. The drawback is that it’s fairly low bandwidth and payload size maxes
out at 24 bit so pretty limiting.</p>
<p>That said an interesting side benefit of these modules is that large number of remote controlled power
outlets out there use either 315/433 Mhz bands e.g. <a href="http://www.etekcity.com/c-59-outlets.aspx">Etekcity outlets</a>.
If you have a remote controlled device in your house you can look up what frequency it uses with
<a href="http://transition.fcc.gov/oet/ea/fccid/">FCC ID search</a>. There is also the 303 Mhz frequency however
I have not been able to find the modules for it yet.</p>
<p>As a result of this tinkering I know am able to turn outlets around my house with my phone :-).</p>
<h2 id="nordic-semiconductor-nrf24l01">Nordic Semiconductor nRF24L01+</h2>
<p>These are a lot more tricky to get going as they have total of 8 pins with one unused and it is
easy to mis-wire things. It took me a lot of trying to get these going however I finally got it
going and was able to pass data between the Arduino and a Raspberry Pi. Max payload on this
is 32 bytes which should be enough for shipping out metric data and you can ship them at a pretty
rapid rate. Libraries I ended up using were these</p>
<p><a href="https://github.com/stanleyseow/RF24">RF24 library</a> and <a href="https://github.com/riyas-org/nrf24pihub/">NRF24PiHub</a></p>
<p>Do note that these may not work with Adafruit’s Trinket.</p>
<h2 id="next-steps">Next steps</h2>
<p>I have ordered some <a href="https://learn.adafruit.com/dht">temperature and humidity sensors</a>,
<a href="http://www.instructables.com/id/Water-Level-Sensor-Module-for-Arduino-AVR-ARM-STM3/">water sensors</a>
and some Arduino Nanos and will be wiring it up :-), shiping metrics to <a href="http://ganglia.info/">Ganglia</a>
and alerting on it.</p>
Bosnia and Serbia Floods aid2014-05-18T13:51:30+00:00http://blog.vuksan.com/2014/05/18/bosnia-and-serbia-floods-aid<p>As you may have heard Bosnia, Serbia and in smaller part Croatia are facing worst floods ever in recorded history</p>
<p><a href="http://www.bbc.com/news/world-africa-27439139">http://www.bbc.com/news/world-africa-27439139</a></p>
<p>There are a number of ways to donate. Here are few that are being posted on Twitter from</p>
<p><a href="https://twitter.com/ncerovac/status/467785693427941376/photo/1">https://twitter.com/ncerovac/status/467785693427941376/photo/1</a></p>
<p><img src="https://pbs.twimg.com/media/Bn3om6aCQAAPdh7.jpg:large" alt="" /></p>
<p><a href="https://twitter.com/DJ_Miss_AKA/status/468024802272616448">https://twitter.com/DJ_Miss_AKA/status/468024802272616448</a></p>
<p><img src="https://pbs.twimg.com/media/Bn7CETgCUAAp74h.jpg" alt="" /></p>
<p>I have not seen an easy way on those sites for people in the US to make donation. However Croatian Red Cross allows on-line donations with proceeds being transferred to Bosnian and Serbian Red Cross. You can read about it here</p>
<p><a href="http://www.hck.hr/en/page/emergency-appeal-for-flood-affected-people-in-bosnia-and-herzegovina-and-serbia-414">http://www.hck.hr/en/page/emergency-appeal-for-flood-affected-people-in-bosnia-and-herzegovina-and-serbia-414</a></p>
<p>The only drawback is donation page is in Croatian :-( so here is a quick guide.</p>
<p><a href="https://secure.webteh.hr/donate/79">https://secure.webteh.hr/donate/79</a></p>
<p>On this page you will need to pick who you are donating to</p>
<ul>
<li>
<p>Pomoć za poplavljena područja u Bosni i Hercegovini - Aid for flooded regions of Bosnia</p>
</li>
<li>
<p>Pomoć za poplavljena područja u Srbiji - Aid for flooded regions of Serbia</p>
</li>
<li>
<p>Pomoć za poplavljena područja u Hrvatskoj - Aid for flooded regions of Croatia</p>
</li>
</ul>
<p>Pick the amount and currency e.g. 25 USD.</p>
<p>Click on Autorizacija Kreditne kartice. Next screen will include your confirmation as well as currency exchange into Croatian Kunas e.g.</p>
<p>10 CAD (50.90 KN)</p>
<p>1 Canadian Dollar is about 5 Kunas so don’t despair :-).</p>
<p>Payment info will look like this</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2014/05/hck_donacija.png"><img src="http://blog.vuksan.com/wp-content/uploads/2014/05/hck_donacija.png" alt="hck_donacija" /></a></p>
<ul>
<li>
<p>Ime is name</p>
</li>
<li>
<p>Prezime Last Name</p>
</li>
<li>
<p>Adresa - Address</p>
</li>
<li>
<p>Grad - City</p>
</li>
<li>
<p>Poštanski Broj - Postal Code</p>
</li>
<li>
<p>Država - Country</p>
</li>
<li>
<p>Telefonski broj - telephone number</p>
</li>
</ul>
<p>Click Doniraj and that should be it.</p>
Building your own binary packages for Cumulus Linux (PowerPC)2014-02-07T21:04:11+00:00http://blog.vuksan.com/2014/02/07/building-your-own-binary-packages-for-cumulus-linux-powerpc<p><a href="http://cumulusnetworks.com/">Cumulus Networks</a> is a new entrant in the network gear space. What separates them from other players is that they are not selling hardware but their own network focused Linux distribution called Cumulus Linux. Basically you buy a switch from one of their resellers or ODMs then pay Cumulus a yearly support license. There are a number of interesting things you can do like run your own code on the switch as well as use common Linux commands to configure the switch e.g. brctl, ports are exposed as Linux network interfaces etc.</p>
<p>One of the first things we ended up doing is installing <a href="http://ganglia.info/">Ganglia agent</a> so that we can monitor what’s going on on the switch. Cumulus switch we had was running a PowerPC based control plane so that made things a bit tricky since we couldn’t use any of the amd64 built packages. One way to build PowerPC packages would be to get an old PowerPC based Mac and install Linux on it. Unfortunately that seemed like a lot of work and overkill. I realized we could just use <a href="http://http://wiki.qemu.org/Main_Page">Qemu </a>which is an Open Source machine emulator so I could run PowerPC machine on my own laptop :-). Quickest way to get up and running is as follows.</p>
<p>On Ubuntu you will need to install following packages</p>
<p>apt-get install qemu-system-ppc openbios-ppc qemu-utils</p>
<p><strong>Warning</strong>: Under at least Ubuntu 13.10 openbios-ppc doesn’t seem to work well. If you get a blank yellow screen after you start the install you will need to get openbios from other places e.g. <a href="https://github.com/qemu/qemu/tree/master/pc-bios">https://github.com/qemu/qemu/tree/master/pc-bios</a></p>
<p>Once you get those you will need to download Debian Squeeze for PowerPC. You will need to download</p>
<ul>
<li>
<p>vmlinux</p>
</li>
<li>
<p>initrd.gz</p>
</li>
</ul>
<p>from</p>
<p><a href="http://ftp.debian.org/debian/dists/squeeze/main/installer-powerpc/current/images/powerpc64/netboot/">http://ftp.debian.org/debian/dists/squeeze/main/installer-powerpc/current/images/powerpc64/netboot/</a></p>
<p>as well as the netboot image e.g.</p>
<p><a href="http://cdimage.debian.org/cdimage/archive/6.0.8/powerpc/iso-cd/debian-6.0.8-powerpc-netinst.iso">http://cdimage.debian.org/cdimage/archive/6.0.8/powerpc/iso-cd/debian-6.0.8-powerpc-netinst.iso</a></p>
<p>Reason why you need initrd.gz and vmlinux is that if you try to do an install straight off the CD-ROM your install will hang here</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2014/02/powerpc_install.png"><img src="http://blog.vuksan.com/wp-content/uploads/2014/02/powerpc_install.png" alt="Power PC install QEMU" /></a></p>
<p>Once you have those pieces initiate the install with</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qemu-img create -f qcow2 squeeze-powerpc.img 10G
sudo qemu-system-ppc -m 256 -kernel vmlinux \
-cdrom debian-6.0.8-powerpc-netinst.iso \
-initrd initrd.gz -hda squeeze-powerpc.img -boot d -append "root=/dev/ram" \
-net nic,macaddr=00:16:3e:00:00:02 -net tap
</code></pre></div></div>
<p>Now follow the installation process as you would if you were installing Debian or Ubuntu from scratch. When you are done with the install shut down the emulator. Now to invoke your PowerPC emulator execute</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo qemu-system-ppc -m 256 -hda squeeze-powerpc.img \
-net nic,macaddr=00:16:3e:00:00:02 -net tap
</code></pre></div></div>
<p>Congratulations you are done. What you end up with is this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@debian:~# cat /proc/cpuinfo
processor : 0
cpu : 740/750
temperature : 62-64 C (uncalibrated)
revision : 3.1 (pvr 0008 0301)
bogomips : 33.14
timebase : 16570400
platform : PowerMac
model : Power Macintosh
machine : Power Macintosh
motherboard : AAPL,PowerMac G3 MacRISC
detected as : 49 (PowerMac G3 (Silk))
pmac flags : 00000000
pmac-generation : OldWorld
Memory : 256 MB
</code></pre></div></div>
Running Arista vEOS under Linux KVM2014-01-10T21:05:19+00:00http://blog.vuksan.com/2014/01/10/running-arista-veos-under-linux-kvm<p>At my current job we run a lot of <a href="http://www.ar">Arista</a> gear. They are great little boxes. You can also run Ganglia on them :-) since they are basically Fedora Core 14 OS with some Arista proprietary sauce. You can find Arista specific Ganglia gmetric scripts here</p>
<p><a href="https://github.com/ganglia/gmetric/tree/master/arista">https://github.com/ganglia/gmetric/tree/master/arista</a></p>
<p>On occasion I have wanted to test some things and Arista offers VM images you can run on your choice of virtualization. You can find more details here</p>
<p><a href="https://eos.aristanetworks.com/2011/11/running-eos-in-a-vm/">https://eos.aristanetworks.com/2011/11/running-eos-in-a-vm/</a></p>
<p>I use KVM on my Ubuntu laptop and although booting the imaged worked I could not SSH into vEOS from my laptop. After a bit of testing I discovered that Arista’s document misses a very important option ie.</p>
<p>-net tap</p>
<p>So full invocation is really</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kvm -cdrom Aboot-veos-2.0.8.iso -boot d -hda EOS-4.12.3-veos.vmdk -usb -m 1024 \
-net nic,macaddr=52:54:00:01:02:03,<wbr></wbr>model=e1000 \
-net nic,macaddr=52:54:00:01:02:04,<wbr></wbr>model=e1000 \
-net nic,macaddr=52:54:00:01:02:05,<wbr></wbr>model=e1000 \
-net nic,macaddr=52:54:00:01:02:06,<wbr></wbr>model=e1000 \
-net nic,macaddr=52:54:00:01:02:07,<wbr></wbr>model=e1000 \
-net tap
</code></pre></div></div>
<p>After that I was able to configure vlan 1 to e.g. 192.168.122.2<a href="http://192.168.122.2/24">
</a></p>
<p>Log in into the console you just fired up and type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>localhost#configure
localhost(config)#interface vlan 1
localhost(config-if-Vl1)#ip address 192.168.122.2/24
localhost(config)#username admin secret 0 secret
</code></pre></div></div>
<p>You also want to set the password e.g. here I set it to secret and voila you can now SSH into 192.168.122.2. If you have too many SSH private keys loaded log in may not work so turn of public key authentication e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh -o PubkeyAuthentication=no admin@192.168.122.2
</code></pre></div></div>
<p>Only note may be that if you just install libvirt /etc/qemu-ifup doesn’t quite work since it determines which bridge to connect to based on the default route To “fix” that add</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>switch="virbr0"
</code></pre></div></div>
<p>Just above this section in /etc/qemu-ifup</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># only add the interface to default-route bridge if we
# have such interface (with default route) and if that
# only add the interface to default-route bridge if we
# have such interface (with default route) and if that
# interface is actually a bridge.
# It is possible to have several default routes too
for br in $switch; do
if [ -d /sys/class/net/$br/bridge/. ]; then
if [ -n "$ip" ]; then
ip link set "$1" master "$br"
else
brctl addif $br "$1"
fi
exit # exit with status of the previous command
fi
done
</code></pre></div></div>
Bring your own device cell service / VoIP2012-12-25T23:11:45+00:00http://blog.vuksan.com/2012/12/25/bring-your-own-device-cell-service-voip<p>Recently I had to get my own mobile phone service and decided to forgo the standard post-paid cell service and go prepaid. Decision was largely cost based since I already had my own GSM phone and planned to buy a Nexus 4. I did quite a bit of research and ended up with Straight Talk service</p>
<p><a href="https://www.straighttalk.com/">https://www.straighttalk.com/</a></p>
<p>Straight Talk is a <a href="http://en.wikipedia.org/wiki/Mobile_virtual_network_operator">MVNO</a> (Mobile virtual network operator) that leases network capacity from T-Mobile USA and AT&T. To sign up you either order a SIM from their web site or you can pick up starter package at Walmart. I did the latter option. In the package they provided me with 2 different mini-SIM cards and a micro-SIM card. SIM cards are really T-Mobile and AT&T unbranded SIM cards. Pick the card that is supported by the phone e.g. if it’s a locked phone like AT&T use the AT&T SIM, change some of the phone settings (APN) and off you go. Quality of the signal is the same as if you used T-Mobile or AT&T. I picked unlimited everything plan for $45/month with a T-Mobile SIM. If you sign up for auto refill they cut it down to $42.50. Drawback is lack of international roaming and iffy customer service ie. hold times can be 30-40 minutes.</p>
<p>Another option I considered was T-Mobile’s prepaid service called Go Smart which is similarly priced</p>
<p><a href="https://www.gosmartmobile.com/">https://www.gosmartmobile.com/ </a></p>
<p>I decided against it since cost was similar but with Straight Talk I have the option to switch to AT&T if I ever find T-Mobile coverage inadequate. That said Go Smart does have a wider array of calling plans so it may still be a good choice.</p>
<p>While we are at it I can also recommend an inexpensive VoIP service called <a href="http://www.galaxyvoice.com/resphone.html">GalaxyVoice</a>. I use their free-tier which gives you up to 60 minutes of outgoing calls a month and all I pay is for taxes and 911 compliance ~ $3/month. You just need to pay the signup cost of $25 and get your own SIP device.</p>
<p>An extra bonus is that their web site is fairly unsophisticated and easy to automate for certain things :-) e.g. forwarding my home phone calls to my cell phone</p>
Mockupdata sample data creator2012-11-25T22:53:56+00:00http://blog.vuksan.com/2012/11/25/mockupdata-sample-data-creator<p>My friend <a href="http://twitter.com/mockupscreens">Igor Ješe</a> who is an all around awesome guy recently wrote a new piece of commercial software called <a href="http://www.MockupData.com">MockupData</a> that allows you to create realistic sample data you can use to test your applications. Without further ado you can read the <a href="http://www.mockupscreens.com/blog/?p=174">press release about MockupData here</a> and <a href="http://www.mockupscreens.com/blog/?p=180">newsletter announcement</a>. If you have a need for such a tool I would highly recommend Igor’s software.</p>
<p>This is not Igor’s first product. Prior to MockupData he wrote a tool called <a href="http://MockupScreens.com">Mockupscreens</a> which let’s you create mockups easily.</p>
Monitoring health of Dell/LSI RAID arrays with Ganglia2012-09-28T19:55:25+00:00http://blog.vuksan.com/2012/09/28/monitoring-health-of-delllsi-raid-arrays-with-ganglia<p>I have couple hundred Dell systems with LSI RAID arrays however we lacked hardware monitoring which would occasionally result in situations where there would be multiple disk failures that would not get caught. Some time ago I read this post about using MegaCli to monitor Dell’s RAID controller</p>
<p><a href="http://timjacobs.blogspot.com/2008/05/installing-lsi-logic-raid-monitoring.html">http://timjacobs.blogspot.com/2008/05/installing-lsi-logic-raid-monitoring.html</a></p>
<p>One thing I did not like about this approach is that it may generate too many e-mails as it will send e-mails every hour until the disk has been fixed. Instead I used Ganglia with Nagios to provide me with similar type of functionality.</p>
<p>Create a file called analysis.awk with following content</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> /Device Id/ { counter += 1; device[counter] = $3 }
/Firmware state/ { state_drive[counter] = $3 }
/Inquiry/ { name_drive[counter] = $3 " " $4 " " $5 " " $6 }
END {
for (i=1; i<=counter; i+=1) printf ( "Device %02d (%s) status is: %s\n", device[i], name_drive[i], state_drive[i]);
}
</code></pre></div></div>
<p>Get MegaCli utilities from e.g. <a href="http://rpmfind.net/linux/rpm2html/search.php?query=megacli">RPMFind</a>. Copy following BASH script into a file and run into from a cron at frequency you need e.g. every 30 minutes, hour etc.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/bin/sh
GMETRIC_BIN="/usr/bin/gmetric -d 7200 "
MEGACLI_DIR="/opt/MegaRAID"
MEGACLI_PATH="/opt/MegaRAID/MegaCli/MegaCli64"
BAD_DISKS=`$MEGACLI_PATH -PDList -aALL | awk -f ${MEGACLI_DIR}/analysis.awk | grep -Ev "*: Online" | wc -l`
if [ $BAD_DISKS -eq 0 ]; then
STATUS="All RAID Arrays healthy"
else
STATUS=`$MEGACLI_PATH -PDList -aALL | awk -f ${MEGACLI_DIR}/analysis.awk | grep -Ev "*: Online"`
fi
$GMETRIC_BIN -t uint16 -n failed_unconfigured_disks -v $BAD_DISKS -u disks
$GMETRIC_BIN -t string -n raid_array_status -T "Raid Array Status" -v "$STATUS"
</code></pre></div></div>
<p>This will create two different Ganglia metrics. One is number of failed or unconfigured disks and another one is just a string value that gives you details on the failure e.g. Disk 4 failed. Besides being able to alert on this metric it also gives me a valuable data point I can use to correlate node behavior ie. when disk failed load on the machine went up.</p>
<p>If you are using <a href="http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_nagios_integration">Ganglia with Nagios integration</a> you have two different options on how you want to alert.</p>
<ol>
<li>
<p>Create a separate check for every host you want monitored e.g.</p>
<p>define command{
command_name check_ganglia_metric
command_line /bin/sh /var/www/html/ganglia/check_ganglia_metric.sh host=$HOSTADDRESS$ metric_name=$ARG1$ operator=$ARG2$ critical_value=$ARG3$
}</p>
<p>define service{
use generic-service
host_name server1
service_description Failed/unconfigured RAID disk(s)
check_command check_ganglia_metric!failed_unconfigured_disks!more!0.1
}</p>
</li>
<li>
<p>Create a single check that gives you a single alert if at least one machine has a bad disk (how I do it :-)). For this purpose I’m utilizing the check_host_regex which allows me to specify a regular expression of matching hosts. In my case I check every single host. If a host doesn’t have the failed_disks metric I assume it doesn’t have it and I “ignore” matches. My config is similar to this</p>
<p>define command{
command_name check_host_regex_ignore_unknowns
command_line /bin/sh /etc/icinga/objects/check_host_regex.sh hreg=$ARG1$ checks=$ARG2$ ignore_unknowns=1
}</p>
<p>define service{
use generic-service
host_name server2
service_description Failed disk - RAID array
check_command check_host_regex_ignore_unknowns!’.*‘!failed_disks,more,0.5
}</p>
</li>
</ol>
<p>Which will give you something like this</p>
<h1 id="services-ok--236-critunk--2--critical-compute-4566domaincom-failed_disks--1-disks-critical-git-0341domaincom-failed_disks--1-disks">Services OK = 236, CRIT/UNK = 2 : CRITICAL compute-4566.domain.com failed_disks = 1 disks, CRITICAL git-0341.domain.com failed_disks = 1 disks</h1>
My monitoring setup2012-09-01T14:51:00+00:00http://blog.vuksan.com/2012/09/01/my-monitoring-setup<p>On Twitter <a href="https://twitter.com/griggheo">Grig Gheorghiu</a> posed a number of questions about monitoring tools and <a href="http://agiletesting.blogspot.com/2012/09/what-i-want-in-monitoring-tool.html">what he wants in a monitoring tool</a>. This is my attempt at describing what my setup looks like or has looked like in the past.</p>
<p><strong>1. Metrics acquisition / performance trending</strong></p>
<p>I use Ganglia to collect all my metrics including string metrics. Base installation for Ganglia Gmond will give you over 100 metrics. There are a number of Python modules that are disabled by default like mySQL, Redis that you can easily enable and get more. If you need even more you can check out these to Github repositories.</p>
<ul>
<li>
<p><a href="https://github.com/ganglia/gmond_python_modules/">Gmond Python Modules</a></p>
</li>
<li>
<p><a href="https://github.com/ganglia/gmetric/">Gmetric</a> scripts</p>
</li>
</ul>
<p>Don’t worry about sending too many metrics. I have hosts that send in excess of 1100 metrics per host. Ganglia can handle it so don’t be shy :-). Also when I say all metrics go into Ganglia I mean EVERYTHING. If I want to alert on it it will be in Ganglia so I have things like these</p>
<ul>
<li>
<p>NTP time offset</p>
</li>
<li>
<p>What version is particular key piece of software on e.g. deploy ID 123af58</p>
</li>
<li>
<p>Memory utilization/CPU utilization for key daemon processes</p>
</li>
<li>
<p><a href="http://blog.vuksan.com/2012/09/28/monitoring-health-of-delllsi-raid-arrays-with-ganglia/">Number of failed disks in a RAID array</a></p>
</li>
<li>
<p>Application uptime</p>
</li>
<li>
<p>Etc.</p>
</li>
</ul>
<p><strong>2. Alerting</strong></p>
<p>I use Nagios or Icinga for alerting. I don’t really use any Nagios plugins as all the checks are driven by data coming out of Ganglia. I have written a post in the past about <a href="http://blog.vuksan.com/2011/04/19/use-your-trending-data-for-alerting/">why you should use your trending data for alerting</a> which you can read for some background. About a year ago <a href="https://github.com/ganglia/ganglia-web/wiki/Nagios-Integration">Ganglia/Nagios integration</a> has been added to Ganglia Web which makes a number of things much easier so for example I have</p>
<ul>
<li>
<p>A single check that checks all hosts in the cluster for failed disk in a RAID array</p>
</li>
<li>
<p>A single check that checks whether time is within certain offset on all hosts ie. to make sure NTP is actually running</p>
</li>
<li>
<p>A single check that makes sure version of deployed code is the same everywhere</p>
</li>
<li>
<p>A single check for all file systems on a local system with each file system having their own thresholds</p>
</li>
<li>
<p>A single check for elevated rates of TCP errors - useful to get a quick idea if things are globally slow, affecting a certain set of hosts in a geographic area or an individual host</p>
</li>
</ul>
<p>Beauty in having all of the metric data in Ganglia is that you can also get creative by writing custom checks that have your own user specified logic e.g.</p>
<ul>
<li>Alert me only if 20% of my data nodes are down e.g. in architectures where you can withstand few nodes failing</li>
</ul>
<p>In addition I recommend adding as much <a href="http://blog.vuksan.com/2012/03/29/adding-context-to-your-alerts/">context to alerts</a> as possible so that you get as much as information in alert as possible.</p>
<p>I also heavily utilize something I wrote called the alerting controller</p>
<p><a href="https://github.com/vvuksan/alerting-controller">https://github.com/vvuksan/alerting-controller</a></p>
<p>Which allows you easily enable/disable/schedule downtime for services in Nagios e.g. disable process alive check for configuration servlet on all hosts while we are doing an upgrade etc. In addition I have a tab with Naglite opened up most of the time to check on any outstanding alerts.</p>
<p><a href="https://github.com/saz/Naglite3">https://github.com/saz/Naglite3</a></p>
<p><strong>3. Notifications</strong></p>
<p>Beyond just alerting I am always interested to see what is happening with the infrastructure so we can act proactively. For that purpose I use IRC and a modified version of <a href="https://github.com/cluenet/cluemon/blob/master/nagios-bin/nagiosbot.py">NagiosBot </a>which sends echoes following things to the channel</p>
<ul>
<li>
<p>Nagios alerts (same script that adds context to alerts above) - helpful for quick team coordination</p>
</li>
<li>
<p>Dynect DNS zone changes - those may be “invisible” so good idea to track</p>
</li>
<li>
<p>Zendesk tickets - anyone can handle support requests</p>
</li>
<li>
<p>Twitter mentions of a particular keyword</p>
</li>
<li>
<p>Application configuration changes e.g. new version of the code deployed or in progress</p>
</li>
<li>
<p>Severe Application errors - short summary of an error and which node it occured on</p>
</li>
</ul>
<p><strong>Closing</strong></p>
<p>This is by no means an exhaustive list and this may not be the best way to do things but it does work for me.</p>
WebOps Hackathon/(un)conference in Boston in October2012-08-20T20:41:31+00:00http://blog.vuksan.com/2012/08/20/webops-hackathonunconference-in-boston-in-october<p>Few weeks ago I posed a question on Twitter whether there would be interest to organize a webops hackathon I am close to getting confirmation for conference space at <a href="http://microsoftcambridge.com/">Microsoft NERD</a> for October 11, 2012. Now comes the hard part :-). We need to figure out what format should hackathon and unconference take as well as what topics we should talk about. There is going to be a monitoring conference at the end of March spearheaded by Jason Dixon (<a href="https://twitter.com/obfuscurity">@obfuscurity</a>) so we may want to leave out monitoring topics.</p>
<p>In a nutshell we need to decide</p>
<ol>
<li>
<p>How much of the event should be a hackathon and how much should be an (un)conference</p>
</li>
<li>
<p>If we want talks what topics should we seek talks for e.g. continuous integration, noSQL, site scalability, etc.</p>
</li>
<li>
<p>Name of the event :-)</p>
</li>
</ol>
<p>If you have any ideas send me an e-mail conference X vuksan.com (replace X with you know what).</p>
Parsing JSON POST in PHP2012-06-15T21:42:35+00:00http://blog.vuksan.com/2012/06/15/parsing-json-post-in-php<p>I have an application that uses HTTP POST to submit a JSON encoded array. Basically there are no variables that are being submitted, just JSON. This causes $_REQUEST and $_POST arrays to get messed up where “random” parts of the JSON will end up as the key and rest as the value. Instead what you need to do is get contents of the request as input then decode it e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> $input = file_get_contents('php://input');
$my_array = json_decode( $input, TRUE);
</code></pre></div></div>
PHP HTTP caching defaults2012-05-22T13:42:34+00:00http://blog.vuksan.com/2012/05/22/php-caching-defaults<p>I have recently moved this blog to be hosted on <a href="http://www.fastly.com/">Fastly</a>, a CDN service with bunch of great features like dynamic content caching with instant purges. Fastly utilizes HTTP headers to determine what to cache as described in the <a href="http://www.fastly.com/docs/tutorials#cache_control">Cache Control document</a>. While configuring my service I noticed that my Wordpress (origin server) kept returning HTTP headers like these</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
</code></pre></div></div>
<p>I looked through the Wordpress code and couldn’t see where such value was set. After some internet searches I discovered that they are set using session-cache-limiter option in php.ini. In most distributions this defaults to nocache which ends up with above headers. You can read more on the cache-limiter options here.</p>
<p><a href="http://www.php.net/manual/en/function.session-cache-limiter.php">http://www.php.net/manual/en/function.session-cache-limiter.php</a></p>
<p>What we need is session-cache-limiter = public results in headers like these</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expires: (sometime in the future, according session.cache_expire)
Cache-Control: public, max-age=(sometime in the future, according to session.cache_expire)
Last-Modified: (the timestamp of when the session was last saved)
</code></pre></div></div>
<p>e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expires: Tue, 22 May 2012 16:38:33 GMT
Cache-Control: public, max-age=10800
Last-Modified: Tue, 04 Oct 2005 00:55:59 GMT
</code></pre></div></div>
<p>If you want to adjust the max-age you can set cache-expire in php.ini e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>; http://php.net/session.cache-expire
session.cache_expire = 180
</code></pre></div></div>
My programming language beat your honor roll language2012-04-10T19:32:58+00:00http://blog.vuksan.com/2012/04/10/my-programming-language-beat-your-honor-roll-language<p>For a while I have been observing a tendency of technologists/engineers to describe technology in either black or white terms ie. X technology sucks, use Y technology. Most recent example is an article by someone going by the name of <a href="https://twitter.com/#!/eevee">Eevee</a> in his <a href="http://me.veekun.com/blog/2012/04/09/php-a-fractal-of-bad-design/">PHP a fractal of bad design</a>. It is a damning expose of PHP’s failings/bad decisions/inconsistencies etc. Unfortunately as most articles of this type it involves a number of ad-hominem attacks like these</p>
<p>It’s so broken, but so lauded by every empowered amateur who’s yet to learn anything else, as to be maddening. It has paltry few redeeming qualities and I would prefer to forget it exists at all.</p>
<p>or</p>
<p>I assert that the following qualities are <em>important</em> for making a language productive and useful, and PHP violates them with wild abandon. If you can’t agree that these are crucial, well, I can’t imagine how we’ll ever agree on much.</p>
<p>This irritates me on many levels since it makes so many misguided assumptions e.g.</p>
<ul>
<li>Everyone’s mind is the same therefore everyone should like or hate language X</li>
</ul>
<p>Of course not. Your mind is different than my “defective” mind. I quite prefer writing in PHP. I have written/write code in Ruby/Python/Perl and PHP is my preferred language. That may change but at this point it’s my preference. You may disagree with my choice and that’s OK.</p>
<ul>
<li>Issues we are trying to solve are similar/identical and we have same resource constraints aka one-size fits all</li>
</ul>
<p>Of course not. If most coders on my team are well versed with PHP and we have a tight schedule you bet we are most likely to choose PHP. Technical merits are not the only consideration to be taken. They are certainly important but they often pale in comparison to other considerations such as people and culture.</p>
<ul>
<li>PHP core developers are incompetent</li>
</ul>
<p>I have on a number of occasions disagreed and been frustrated with decisions made by PHP core developers however I do assume that in most/all cases they are well intentioned and are making the best decision under available circumstances. PHP has been around for a long time so I imagine making major changes is tough and involve making significant tradeoffs. If those tradeoffs become a show stopper for me I’ll use a different language.</p>
<p>Anyways the most bothersome part of the whole post is the technology “tribalism” which results in things like this (from <a href="https://github.com/rashidkpc/Kibana/blob/63e84c2f07214fadfc8586c2703830cb28d05f93/README">Kibana README file</a>)</p>
<p>Q: Why is this in PHP instead of Java, Ruby, etc?
A: Because PHP is what I know. The total PHP is less than 200 lines. If you want it in something else, it shouldn’t be too hard to port it to your language of choice</p>
<p>That makes me pretty sad.</p>
Compute a 15 minute average of a metric easily with Ganglia2012-04-06T23:12:05+00:00http://blog.vuksan.com/2012/04/06/compute-a-15-minute-average-of-a-metric-easily-with-ganglia<p>This is a quick way to extract a 15 minute average of a metric in Ganglia. It utilizes Ganglia’s CSV export function to get the values then uses awk to actually compute the average.</p>
<p>First of all find a metric graph you want to calculate average from. Right click over the image and copy the image location. Then append &csv=1 to the URL and UNIX time stamp from 15 minutes ago and put that as the &cs= argument. This is a simple shell script that illustrates it</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MIN15AGO=`date --date="15 minutes ago" "+%s" ;
curl --silent "http://ganglia.domain.com/ganglia/graph.php?c=NetApp&h=host1&v=&m=netapp_cpuutil&<span style="color: #ff0000;">cs=$MIN15AGO&csv=1</span>" | \
awk -F, '{sum+=$2} END { print "Average = ",sum/NR}'
</code></pre></div></div>
Adding context to your alerts2012-03-29T19:54:28+00:00http://blog.vuksan.com/2012/03/29/adding-context-to-your-alerts<p>I am a big believer in adding context to alerts. This allows the recipient of an alert to make a better decision on how to deal with an alert. It’s often hard to classify alerts so providing as much context to the alert is extremely helpful. For instance if I am alerting on a value of a metric I like to attach an image of that metric for the past hour. This way if I am on my mobile phone and out and about I have the alerting metric graph right there without needing to open up another window or having to start up my laptop.</p>
<p>In more recent versions of Ganglia there is an option to add <a href="http://ganglia.info/?p=382">overlay events</a> to hosts which show up as vertical lines on the graph. I figured that would be great context to add to alerts. Since I’m using Nagios I decided to extend a mail handler I used before to query Ganglia events database and include any events that were connected to the matching host in 24 hours. This helps in a number of scenarios to keep team on the same page and well informed e.g.</p>
<ul>
<li>
<p>There was a code push/config change however host/service was not scheduled for maintenance</p>
</li>
<li>
<p>Recent code push is causing issues ie. web servers are crashing</p>
</li>
</ul>
<p>This is an example e-mail you get</p>
<p><a href="https://github.com/vvuksan/misc-stuff/blob/master/nagios/send_nagios_email.php"><img src="http://blog.vuksan.com/wp-content/uploads/2012/03/event_context.png" alt="" /></a></p>
<p>As an added bonus mail handler sends all alerts to a <a href="https://github.com/cluenet/cluemon/blob/master/nagios-bin/nagiosbot.py">Nagios Bot</a> :-). Now all you need to make sure is to record events for any major changes. You could do a lot of these things automatically by e.g.</p>
<ul>
<li>
<p>Adding hooks to your startup scripts so that when you purposely restart services it is logged</p>
</li>
<li>
<p>Watching logs then inserting proper events in the timeline. App stoppe</p>
</li>
<li>
<p>Querying external services e.g. Dynect provides an API to query zone changes</p>
</li>
</ul>
<p>You can download the mail handler from here</p>
<p><a href="https://github.com/vvuksan/misc-stuff/blob/master/nagios/send_nagios_email.php">https://github.com/vvuksan/misc-stuff/blob/master/nagios/send_nagios_email.php</a></p>
Monitoring NetApp Fileservers with Ganglia2012-03-29T16:37:34+00:00http://blog.vuksan.com/2012/03/29/monitoring-netapp-fileservers-with-ganglia<p>In our environment we use NFS on Netapp fileservers a lot. They are used for home directories, build directories (don’t ask), DB data directories etc. This is done mostly for reliability and data integrity. However it leads to a number of problems since they are shared by a number of different groups of users and are a “black box” for users. Frequently we’ll get reports of machines or builds being unusually slow. This results in lots of confusion since we have observed in the past that in most cases of slowness machines involved are frequently “idle” where CPU utilization is unremarkable ie. < 10% yet CPU wait I/O is significantly elevated. We would then posit it was external e.g. NFS related. To avoid the guess work I have decided to start monitoring Netapp fileservers to get insight into what is going on. My team doesn’t manage the Netapps however we use <a href="http://ganglia.info/">Ganglia</a>. I found <a href="https://github.com/wAmpIre/check_netappfiler">check_netappfiler</a> and using it as a template I built a script to gather metrics from Netapp and send them to Ganglia. You can download the script from here</p>
<p><a href="https://github.com/ganglia/gmetric/tree/master/netapp">https://github.com/ganglia/gmetric/tree/master/netapp</a></p>
<p>Basically it queries a list of Netapp servers and injects those metrics to Ganglia. So far metric gathering has been invaluable. For example on one occasion we got a report of slowness from a couple of users. I observed that CPU utilization on the Netapp that project was using was 100 percent. That may “explain” the slowness.</p>
<p><img src="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_cpuutil.png" alt="" /></p>
<p>However that wasn’t all either. Preceding the event there was a heavy NFS utilization.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_nfsops.png"><img src="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_nfsops.png" alt="" /></a></p>
<p>However as soon as CPU utilization goes to 100% number of NFS ops plummets.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_net_sent.png"><img src="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_net_sent.png" alt="" /></a></p>
<p>and so does the Network bytes sent.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_readbytes.png"><img src="http://blog.vuksan.com/wp-content/uploads/2012/03/netapp_readbytes.png" alt="" /></a></p>
<p>and bytes read from the disk. It sure looks like a Netapp bug. Luckily since we have all these metrics available we could much quicker figure out what is going on and what further steps to conduct ie. contact the vendor.</p>
RESTful way to manage your databases2012-01-03T03:43:35+00:00http://blog.vuksan.com/2012/01/03/restful-way-to-manage-your-databases<p>I have a need in my development environment to easily create/drop mySQL databases and users. Initially I was gonna implement a simple hacky HTTP GET method but was dissuaded by <a href="https://twitter.com/b6n">Ben Black </a>from doing so. He suggested I write a proper RESTful interface. Without further ado I present to you dbrestadmin</p>
<p><a href="https://github.com/vvuksan/dbrestadmin">https://github.com/vvuksan/dbrestadmin</a></p>
<p>It is my first foray into writing RESTful services so things may be rough around the edges. However it allows you to do following</p>
<ul>
<li>
<p>manage multiple database servers</p>
</li>
<li>
<p>create/drop databases</p>
</li>
<li>
<p>list databases</p>
</li>
<li>
<p>create/drop users</p>
</li>
<li>
<p>list users</p>
</li>
<li>
<p>give user grants</p>
</li>
<li>
<p>view grants given to the user</p>
</li>
<li>
<p>view database privileges on a particular database given to a user</p>
</li>
</ul>
<p>For example need to create a database called testdb on dbserver ID=0 use this cURL command</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -X POST http://myhost/dbrestadmin/v1/databases/0/dbs/testdb
</code></pre></div></div>
<p>Create a user test2 with password test</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -X POST "http://localhost:8000/dbrestadmin/v1/databases/0/users/test2@localhost" -d "password=test"
</code></pre></div></div>
<p>Give test2 user all privileges on testdb</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl -X POST "http://localhost:8000/dbrestadmin/databases/0/users/test2@'localhost'/grants" -d "grants=all privileges&database=testdb"
</code></pre></div></div>
<p>There is more. You can see all of the methods here</p>
<p><a href="https://github.com/vvuksan/dbrestadmin/blob/master/API.md">https://github.com/vvuksan/dbrestadmin/blob/master/API.md</a></p>
<p>Improvements and constructive criticism welcome</p>
Operating on Dell RAID arrays cheatsheet2011-11-23T21:16:47+00:00http://blog.vuksan.com/2011/11/23/operating-on-dell-raid-arrays-cheatsheet<p>I have to infrequently add new drives to Dell RAID arrays like H700. For some reason it takes me couple searches to find the info so here so I can find it later.</p>
<p><strong>List all drives</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
</code></pre></div></div>
<p><strong>Create a RAID array (e.g. RAID 0)</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdAdd -r0 [32:4, 32:5] -aALL
</code></pre></div></div>
<p><strong>List working RAID arrays</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
</code></pre></div></div>
<p>Confirm you got the right RAID array e.g. Virtual Disk 1</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -L1 -aALL
</code></pre></div></div>
<p><strong>Delete RAID array</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/opt/MegaRAID/MegaCli/MegaCli64 -CfgLdDel -L1 -aALL
</code></pre></div></div>
Use fantomTest to test web pages from multiple locations2011-09-27T23:39:49+00:00http://blog.vuksan.com/2011/09/27/fantomtest-multiple-locations<p>In my previous I introduced <a href="http://blog.vuksan.com/2011/08/02/testing-your-web-pages-with-fantomtest/">Testing your web pages with fantomtest</a>. I have recently added ability to test the same page from multiple sites within the same interface. You simply install the copy of fantomTest on a remote site then configure your primary site to access it. For example this is a test of Google from my laptop.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2011/09/fantomtest-goog.png"><img src="http://blog.vuksan.com/wp-content/uploads/2011/09/fantomtest-goog.png" alt="" /></a></p>
<p>Looks like my network connection is really slow :-(. Changing the testing site to Croatia where I have a server I get</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2011/09/fantomtest-hr.png"><img src="http://blog.vuksan.com/wp-content/uploads/2011/09/fantomtest-hr.png" alt="" /></a></p>
<p>Slightly different since Google redirects me to their localized Google site however it leads me to believe that it’s my connection that is slow not Google.</p>
<p>Any number of “remotes” can be added. Want it ? Get it @GitHub</p>
<p><a href="https://github.com/vvuksan/fantomtest">https://github.com/vvuksan/fantomtest</a></p>
Using Jenkins as a Cron Server2011-08-22T21:33:49+00:00http://blog.vuksan.com/2011/08/22/using-jenkins-as-a-cron-server<p>There are a number of problems with cron which cause lots of grief for system administrators with big ones being manageability, cron-spam and auditability. To fix some of these issues I have lately started using <a href="http://jenkins-ci.org/">Jenkins</a>. <a href="http://jenkins-ci.org/">Jenkins</a> is an open source Continuous Integration server it has lots of features that make it a great cron replacement for a number of uses. These are some of the problems it solves for me</p>
<h3 id="auditability">Auditability</h3>
<p>Jenkins can be configured to retain logs of all jobs that it has run. You can set it up to keep last 10 runs or you can set it up to keep only last 2 weeks of logs. This is incredibly useful since sometimes jobs can fail silently so it’s useful to have the output instead of sending it to /dev/null.</p>
<h3 id="centralized-management">Centralized management</h3>
<p>I have my most important jobs centralized. I can export all Jenkins jobs as XML and check it into a repository. If I need to execute jobs on remote hosts I simply have Jenkins ssh and execute command remotely. Alternatively you can use <a href="https://wiki.jenkins-ci.org/display/JENKINS/Distributed+builds">Jenkins slaves</a>.</p>
<h3 id="cron-spam">Cron Spam</h3>
<p>Cron spam is a common problem with solutions such as <a href="http://habilis.net/cronic/">this</a>, <a href="http://iamthewalr.us/blog/2007/10/howto-make-cron-not-spam-you-to-death/">this</a> and <a href="http://blog.dynamichosting.biz/2010/11/01/stop-crond-from-sending-e-mails/#more-83">this</a>. To avoid this condition I only have Jenkins alert me when a particular job fails ie. a job exits with return code other than 0. In addition you can use the awesome <a href="https://wiki.jenkins-ci.org/display/JENKINS/Text-finder+Plugin">Jenkins Text Finder</a> plugin which allows you to specify words or regular expressions to look for in console output. They can be used to mark a “job” unstable. For example in text finder config I checked</p>
<p>X Also search the console output</p>
<p>and specified</p>
<p>Regular expression ([Ee]rror<em>).</em></p>
<p>This has saved our bacon since we used the automysqlbackup.sh script which “swallows” up the errors codes from the mysqldump command and exits normally. Text Finder caught this</p>
<p><code class="language-plaintext highlighter-rouge">mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' bytes when dumping table </code>users<code class="language-plaintext highlighter-rouge"> at row: 234
</code></p>
<p>Happily we caught this one on time.</p>
<h3 id="job-dependency">Job dependency</h3>
<p>Often you will have job dependencies ie. main backup job where you first dump a database locally then upload it somewhere off-site or to the cloud. The way we have done this in the past is to leave a sufficiently large window between the first job and consecutive job to be sure first job has finished. This says nothing about what to do if the first job fails. Likely the second one will too. With Jenkins I no longer have to do that. I can simply tell Jenkins to trigger “backup to the cloud” once local DB backup concludes successfully.</p>
<h3 id="test-immediately">Test immediately</h3>
<p>While you are adding a job it’s useful to test whether job runs properly. With cron you often had to wait until the job executed at e.g. 3 am in the morning to discover that PATH wasn’t set properly or there was some other problem with your environment. With Jenkins I can click Build Now and job will run immediately.</p>
<h3 id="easy-setup">Easy setup</h3>
<p>Setting up jobs is easy. I have engineers set up their own job by copying an existing job and modifying it to do what they need to do. I don’t remember last time someone asked me how to do it :-).</p>
<h3 id="what-i-dont-use-jenkins-for">What I don’t use Jenkins for</h3>
<p>I don’t use Jenkins to run jobs that collect metrics or anything that has to run too often.</p>
Testing your web pages with fantomtest2011-08-02T13:28:32+00:00http://blog.vuksan.com/2011/08/02/testing-your-web-pages-with-fantomtest<p>Coming from web operations background my web site/page monitoring had largely focused at looking at metrics such as average request duration, 90th percentile request duration etc. These are all great metrics however through <a href="http://velocityconf.com/">Velocity Conferences</a> I have come to appreciate that there is a lot more to web performance than simply knowing how long it takes to load HTML in a web page. As a result I have been looking for ways to try to get better metrics by utilizing real browsers instead of Perl/Ruby/Python scripts. For some time I have been playing with Selenium RC to give me an easy way to test and time my web application. Unfortunately I found it heavy and slow. At last Velocity conference I was fortunate enough to see a demo of <a href="http://phantomjs.org/">PhantomJS</a>. PhantomJS is a semi-headless webkit browser with Javascript support. What I really appreciated about it is that it is light weight, fast and very easy to instrument using Javascript. In addition it includes a number of useful examples such as netsniff.js which output a HTTP Archive (HAR) of requests to a certain web page. From a HAR file you can builds among other things waterfall charts. There are a number of services you can use to have your site tested for free e.g. <a href="http://webpagetest.org/">webpagetest.org</a>. Limitation is that they can’t test your intranet infrastructure since that is usually behind a firewall or it doesn’t allow you to test remote sites that are connected to your intranet via a VPN.</p>
<p>That is why I’m introducing fantomTest. A simple web application that allows you to generate waterfall graphs using PhantomJS. It will also take a screenshot of a rendered page. Here is what that looks like</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2011/08/screenshot2.png"><img src="http://blog.vuksan.com/wp-content/uploads/2011/08/screenshot2-1024x322.png" alt="" /></a></p>
<p>What’s interesting in this particular case is that Google is not utilizing web performance recommendations by using a HTTP redirect from google.com to www.google.com.</p>
<p>Anyways to get fantomTest go to</p>
<p><a href="https://github.com/vvuksan/fantomtest">https://github.com/vvuksan/fantomtest</a></p>
Monitoring links and monitoring anti-patterns video2011-06-06T01:37:42+00:00http://blog.vuksan.com/2011/06/06/monitoring-links-and-monitoring-anti-pattern-video<p>John Vincent aka. <a href="https://twitter.com/lusis">lusis</a> has started an interesting conversation surrounding monitoring on Freenode on channel he named ##monitoringsucks. He has also done an awesome job of starting up a Github project of the same name that is shaping up to be a nice collection of links to tools and blog posts. Check it out</p>
<p><a href="https://github.com/monitoringsucks/">https://github.com/monitoringsucks/</a></p>
<p>Also we just got a hold of the monitoring anti-patterns Ignite Talk from <a href="http://devopsdays.org/">Devopsdays Boston</a> by Alexis Lê-Quôc aka. <a href="https://twitter.com/#!/alq">@alq</a>. It is a short video (5 minutes) so it’s definitely worth seeing.</p>
Use your trending data for alerting2011-04-19T19:59:49+00:00http://blog.vuksan.com/2011/04/19/use-your-trending-data-for-alerting<p>This post will deal with helping you use the data you already have to do alerting. It is most helpful for people running Nagios or it’s variants such as Icinga, Netreo etc. It could likely be used with other decoupled alerting systems (not Zabbix or Zenoss though since they do their own trending).</p>
<p>Recently I came to a realization that lots of sysadmins are unaware that they could easily use trending data they already capture with systems such as Ganglia, Graphite, Collectd, Munin etc. to do alerting. Standard way of doing health checks of remote nodes in Nagios is to install the <a href="http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE-%252D-Nagios-Remote-Plugin-Executor/details">Nagios Remote Plugin Executor aka. NRPE</a> which allows you to execute Nagios plugins on remote nodes and pipe output to the Nagios server. NRPE does the job however has three major disadvantages</p>
<ol>
<li>
<p>It is another daemon that needs to run on the remote host possibly introducing security concerns</p>
</li>
<li>
<p>Depending on the load of the machine can be slow thus bogging down the Nagios server</p>
</li>
<li>
<p>Last and most important is that commonly it’s used to alert on common metrics such as disk, load, CPU, swap which you should be trending anyways.</p>
</li>
</ol>
<p>Instead what you ought to be doing is use trending data for alerting. I can think of at least 4 reasons to do so</p>
<ol>
<li>
<p>You may already be collecting pertinent data ie. system load, swap, CPU utilization</p>
</li>
<li>
<p>If you are alerting on a particular metric you should likely be trending it</p>
</li>
<li>
<p>It’s fast</p>
</li>
<li>
<p>Allows you to do more sophisticated checks easily ie. alert me if more than 5 hosts have a load greater than 5 etc.</p>
</li>
</ol>
<p>Years ago I used Ganglia Web PHP code to write my own generic <a href="http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics">Nagios Ganglia plugin</a>. This has served me well. Most recently <a href="Michael Conigliaro">Michael Conigliaro</a> rewrote the script in Python making it more versatile and more powerful. You can download it from here</p>
<p><a href="https://github.com/ganglia/ganglia_contrib/tree/master/nagios">https://github.com/ganglia/ganglia_contrib/tree/master/nagios</a></p>
<p>In a nutshell what it does is download the whole metrics tree ie. list of all hosts with their associated metrics. Caches it for a configurable amount of time then uses <a href="http://packages.python.org/NagAconda/plugin.html">NagAconda</a> to support all the threshold reporting as defined in <a href="http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT">Nagios developer guidelines</a>.</p>
<p>Another alternative if you have a very large site is Ganglios which was opensourced by guys at Linden Lab. Their problem is/was that they have thousands of hosts and downloading the whole metrics tree takes ~15 seconds so they have separated the logic that downloads the metric tree and one that does alerting. You can download Ganglios from</p>
<p><a href="https://bitbucket.org/maplebed/ganglios">https://bitbucket.org/maplebed/ganglios</a></p>
<p>This can easily be adapted to work with your trending system of choice.</p>
JSON representation for graphs in Ganglia2011-02-21T02:38:46+00:00http://blog.vuksan.com/2011/02/21/json-representation-for-graphs-in-ganglia<p>Recently thanks to work done by Alex Dean aka. <a href="https://twitter.com/mostlyalex">@mostlyalex</a> Ganglia UI supports defining custom graphs using JSON. Prior to this only way to create custom graphs was by writing custom PHP code. This has two major problems ie. lots of people are not comfortable writing or modifying PHP code and second you have to target a particular graphing engine e.g. rrdtool. As I have written in the past we are gonna be supporting both rrdtool and graphite for graphing so having a common way to describe graphs has been one of our goals.</p>
<p>To describe a custom graph you would create a JSON file similar to this one</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"report_name" : "network_report",
"report_type" : "standard",
"title" : "Network Report",
"vertical_label" : "Bytes/sec",
"series" : [
{ "metric": "bytes_in", "color": "33cc33", "label": "In", "line_width": "2", "type": "line" },
{ "metric": "bytes_out", "color": "5555cc", "label": "Out", "line_width": "2", "type": "line" }
]
}
</code></pre></div></div>
<p>This will create a line graph with bytes_in and bytes_out metrics. Since hostname and cluster are not specified it is assumed that we want metrics for the current host we are viewing. You could however specify a particular host and metric you want to graph by adding hostname and cluster attributes to series ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"report_name" : "our_load_report",
"report_type" : "standard",
"title" : "Load Report vs. Database Load",
"vertical_label" : "Loads",
"series" : [
{ "metric": "load_one", "color": "3333bb", "label": "Load 1", "line_width": "2", "type": "line" },
{ "hostname": "db1.domain.com", "clustername": "Databases", "metric": "load_one", "color": "44ddbb", "label": "DB1 Load 1", "line_width": "2", "type": "line" },
]
}
</code></pre></div></div>
<p>To use the reports all you have to do is put the report in the $GANGLIA_WEB_ROOT/graph.d directory. Name them something_report.json and it will be available for any host in the cluster. There is one important thing to note. By default graphing function will look for PHP definitions for graphs as those in theory provide more power and flexibility and if those are not available use JSON definition.</p>
<h3 id="types-of-graphs">Types of graphs</h3>
<p>Currently both line and stacked graphs are supported. Look in graph.d/ directory for additional examples.</p>
<h3 id="future">Future</h3>
<p>I am particularly excited about this feature as it allows us to define <a href="http://blog.vuksan.com/2010/06/05/beauty-of-aggregate-line-graphs/">aggregate graphs</a> easily. There is even an alpha implementation of functionality which would allow you to specify a metric and a regex host entry and you would end up with an aggregate graph :-).</p>
<h3 id="download-location">Download location</h3>
<p>Latest version of the UI can be downloaded either from <a href="http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-web-2.0/">Ganglia Monitor Web 2.0 SVN branch</a> or you can get it on <a href="https://github.com/vvuksan/ganglia-misc">Github</a>.</p>
Misconceptions about RRD storage2010-12-14T22:04:57+00:00http://blog.vuksan.com/2010/12/14/misconceptions-about-rrd-storage<p>I want to address the misconceptions about RRD (Round-Robin Database) that seem to crop up often even among seasoned sysadmins. Complaints can be summarized with these two points</p>
<ul>
<li>
<p>RRD doesn’t offer high resolution ie. after about an hour it’s all averages and I want to knows what was the metric value last year at this hour and minute</p>
</li>
<li>
<p>Data drops off/is destroyed after a year - I want to keep my data forever, disk is cheap etc.</p>
</li>
</ul>
<p>Those are valid points however none of them are the fault of RRD. RRD is a circular buffer so in order to be able to write into it you have to precreate it (otherwise it wouldn’t be a circular buffer :-)). Obviously more data points you store bigger the RRD file will be. To illustrate the point <a href="http://ganglia.info/">Ganglia Monitoring</a> uses following defaults to create RRDs</p>
<p>RRAs “RRA:AVERAGE:0.5:1:244” “RRA:AVERAGE:0.5:24:244” “RRA:AVERAGE:0.5:168:244” “RRA:AVERAGE:0.5:672:244” “RRA:AVERAGE:0.5:5760:374”</p>
<p>This will create multiple circular buffers within the same RRD database file. In order to make sense out of this you need to know what the polling interval is ie. how often do you write into RRDs. In Ganglia’s case the default is 15 seconds so</p>
<ul>
<li>
<p>“RRA:AVERAGE:0.5:1:244” says write actual values (:1:) for every polling interval. Save last 244 of those so in our case we’ll have 61 minutes worth of actual data points. Since it’s a circular buffer data older than 61 minutes will be “dropped”</p>
</li>
<li>
<p>“RRA:AVERAGE:0.5:24:244” says average 24 values (:24:), 24 * 15 seconds = 360 seconds = 6 minutes. 244 of those times 6 is a whole day</p>
</li>
<li>
<p>You can do the next two :-)</p>
</li>
<li>
<p>Last one “RRA:AVERAGE:0.5:5760:374” says average whole day (5760 * 15 seconds = 1440 minutes = 1 day) worth of values and store it in 374 points ie. little more than a year</p>
</li>
</ul>
<p>When graphing RRDtool is smart enough to use the buffer which gives you the most data points. To store all this data RRD file will use about 12kBytes. Thus if you want higher resolution you will need to change the definition e.g. you could do this</p>
<p>“RRA:AVERAGE:0.5:1:2137440”</p>
<p>which will give you one year worth of data points with no averaging with 15 second interval. Trouble is the size of this RRD file is 17 Mbytes. This may not seem as bad but one of the RRD drawbacks is that every time you add data to an RRD the whole file is written over so if you have 1000 metrics you can be potentially writing 17 GBs of data every 15 seconds. This may be a problem depending how many metrics you are keeping track of. There are alternatives which increase throughput such as storing RRDs in RAMdisk or using rrdcached. Alternatively you can opt to keep 2 weeks worth of data points with e.g.</p>
<p>“RRA:AVERAGE:0.5:1:81984”</p>
<p>which will result in size of about 650 kBytes per RRD file. Or you can do something else altogether. Flip side of RRD is that there are no indexes to maintain, no tables that need to be rotated.</p>
<p><strong>Update:</strong> I was wrong about the whole RRD file needing to be updated. In retrospect it makes sense and I apologize for providing the wrong info. You can read comment from Tobi Oetiker (creator of rrdtool) in comments below for more detail. This is actually awesome news since there is very little downside in making larger RRDs.</p>
<p>As far as Ganglia you can modify the defaults in /etc/ganglia/gmetad.conf file. You can also use gmetad-python which allows you to write your own plugins and store metric data in both RRD format, SQL or any other storage engine of your choice.</p>
<p>More on RRDtool can be found here</p>
<p><a href="http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html">http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html</a></p>
Rethinking Ganglia Web UI2010-12-11T01:09:47+00:00http://blog.vuksan.com/2010/12/11/rethinking-ganglia-web-ui<p>I have been a long time fan of Ganglia. Ganglia is a scalable distributed monitoring system initially developed for high-performance computing systems such as clusters and Grids. Today Ganglia is being used by some of the largest web properties such as Facebook, Twitter, Etsy, etc. as well as tons of smaller organizations. Some of Ganglia benefits are</p>
<ul>
<li>
<p>Push based metrics ie. a lightweight agent on hosts that need to be monitored</p>
</li>
<li>
<p>Lots of basic metrics by default such as load, cpu utilization, memory utilization</p>
</li>
<li>
<p>Trivial to add new metrics ie. execute the gmetric command with metric value and graph automatically shows up</p>
</li>
<li>
<p>Decent web interface that allows you to easily drill down when troubleshooting problems</p>
</li>
</ul>
<p>I have used other monitoring systems such as Cacti, Zenoss and Zabbix and found them lacking since they were overly complicated, hard to configure and customize. That said I have also had misgivings about certain parts of the Ganglia UI. Specifically what I missed were following features</p>
<ol>
<li>
<p>Ability to search hosts and metrics - looking for specific host or metric gets cumbersome even on clusters with 20-30 hosts</p>
</li>
<li>
<p>Ability to create arbitrary groupings of host metrics on one page ie. a page with web response time for each web server and mySQL lock time would be something you’d have to write custom code for</p>
</li>
<li>
<p>Easy way to create custom graphs ie. either aggregate line graphs or stacked graphs</p>
</li>
<li>
<p>Easy way to add custom graphs to either clusters or hosts ie. I have a stacked Apache report showing number of GETs vs. POSTs. It’s hard or impossible to show that graph only on webservers but not on mySQL servers.</p>
</li>
<li>
<p>Mobile (WebKit) optimized experience - minimize zooming/panning etc.</p>
</li>
</ol>
<p>Couple months ago on #ganglia Freenode IRC channel we were discussing some of the pitfalls of the UI and the idea of rewriting Ganglia UI was born. As I have been doing quite a bit of work with jQuery in months past I decided to to give it a shot.</p>
<h2 id="goals">Goals</h2>
<p>My initial goals were</p>
<ol>
<li>
<p>Implement basic search functionality ie. one search term that will show matching hosts and metrics</p>
</li>
<li>
<p>Add a way to add “optional” graphs on per cluster/per host basis ie. have a default set of graphs and allow those to be overriden using cluster or host override config files</p>
</li>
<li>
<p>Add Views ie. ability to group host/metrics</p>
</li>
<li>
<p>Add Mobile/Webkit View</p>
</li>
<li>
<p>Store view and optional graphs config information in a format that can be easily manipulated by web UI, config management system or by hand - this is one of the key omissions in most monitoring setups where adding/removing hosts requires either manual intervention or kludgy hacks. As someone who has had to spend hours manually clicking around Zabbix interface whenever we added a new server this had major importance</p>
</li>
</ol>
<h2 id="implementation">Implementation</h2>
<p>Initially there was an idea to rebuild the whole interface from scratch which we still may do but I decided that that would be too much work especially since I wasn’t absolutely sure whether my intended changes would make sense for most people. Thus I decided to modify the existing UI.</p>
<p>So far these are the features that have been implemented</p>
<h3 id="visual-aides">Visual aides</h3>
<p>In cluster view next to each host now you’ll see the full hostname in text on top of the graph. Same goes for metric names in host view. Now even if you have hundreds of metrics you can click CTRL-F in your browser and find the metric quickly. Also there is a hidden anchor next to each metric which is used by the search tab.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_visual_aides1.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_visual_aides1.png" alt="" /></a></p>
<p>Doesn’t seem like much until you need it :-).</p>
<h3 id="search">Search</h3>
<p>Search tab allows you to type in a single term which will match hosts and metrics. It will search as you type. Hosts first, metrics on host second. Clicking on hosts opens a new window with the view of the host. Clicking on a particular metric takes you to the metric in question.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_search.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_search.png" alt="" /></a></p>
<h3 id="views">Views</h3>
<p>Views are defined using JSON configuration files. One JSON file per view. There are two types of views, standard and regex views. For example standard view will look like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ "view_name":"default",
"items":[
{"hostname":"host1.domain.com","graph":"cpu_report"},
{"hostname":"host2.domain.com","graph":"apache_report"}
],
"view_type":"standard"
}
</code></pre></div></div>
<p>It will group cpu report from host1 and apache_report for host2. Regex view allows you to use regular expressions to define hosts (soon also metrics) ie. you want to group all hosts that have imap, amavis or smtp in their names. That view definition would look something like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ "view_name":"mailservers",
"items":[
{"hostname":"(imap|amavis|smtp)", "graph":"cpu_report"}
],
"view_type":"regex"}
</code></pre></div></div>
<p>If you don’t want to edit JSON config files by hand you can use the UI to create standard views ie. first create a view then as you browse hosts there is a plus sign next to each graph. Clicking on it displays a dialog which allows you to add that particular host/metric to a view e.g.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_add_metric_to_view.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_add_metric_to_view.png" alt="" /></a></p>
<h3 id="automatic-rotation">Automatic rotation</h3>
<p>Allows you to automatically rotate a view. It is an integration of <a href="http://blog.vuksan.com/2010/06/16/gangliaview-automatically-rotate-ganglia-metrics/">GangliaView</a> with Views. What’s especially nice is that if you have multiple monitors you can open up separate browser windows and select different views to rotate.</p>
<h3 id="mobile-view">Mobile view</h3>
<p>There is a functional mobile view which provides mobile view of Views, Clusters and Search ie. there is very little panning or zooming. Also we are using lots of preloading ie. first page you open contains lots of hidden sub-pages in order to save on having to do subsequent requests.</p>
<p>You can view some of the <a href="http://www.flickr.com/photos/51166390@N05/sets/72157625551485278/">screenshots </a>on Flickr.</p>
<h3 id="optional-graphs">Optional Graphs</h3>
<p>You can specify which optional graphs you want displayed for each host or cluster. Similar to views these are configured via JSON config files e.g. this is the default list of graphs</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{
"included_reports": ["load_report","mem_report","cpu_report","network_report","packet_report"]
}
</code></pre></div></div>
<p>You can exclude any of the default included graphs or include ones you want e.g.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_edit_optional_graphs.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/12/ganglia_edit_optional_graphs.png" alt="" /></a></p>
<h3 id="screencast">Screencast</h3>
<p>If you would like to see some of these features in action you can look at <a href="http://vuksan.com/ganglia-ui.html">these screencasts</a>.</p>
<h3 id="download">Download</h3>
<p>Ready to try ? Wait no more and check it out from SVN at</p>
<p><a href="http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-web-2.0/">http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-web-2.0/ </a></p>
<p><strong>Future</strong></p>
<p>In the future we are looking into polishing the Graphite/Ganglia integration (perhaps about that in a next post), add integrations with e.g. Nagios (you can see a hint of it in the add metric to view screenshot above), Logstash. Also another upcoming feature will be aggregate metrics and quick views. Full TODO list can be found here</p>
<p><a href="http://sourceforge.net/apps/trac/ganglia/browser/branches/monitor-web-2.0/TODO">http://sourceforge.net/apps/trac/ganglia/browser/branches/monitor-web-2.0/TODO</a></p>
<h3 id="acknowledgements">Acknowledgements</h3>
<p>I’d like to thank Erik Kastner for helping on the Graphite/Ganglia integration. Ben Hartshorne for test driving the UI and providing a number of good suggestions/ideas.</p>
Integrating Graphite with Ganglia2010-09-29T14:32:56+00:00http://blog.vuksan.com/2010/09/29/integrating-graphite-with-ganglia<p>Some time ago I saw a demo on using Graphite (<a href="http://graphite.wikidot.com/">http://graphite.wikidot.com/</a>). I was impressed by the ease of creating custom graphs and the quality/visual appeal of the graphs. Trouble was that Graphite uses it’s own storage engine instead of RRD and I figured it may be too much work to figure out how to inject my existing <a href="http://ganglia.info/">Ganglia</a> metrics.</p>
<p>Couple days ago I saw a tweet from <a href="http://twitter.com/mikebrittain">Mike Brittain</a> at Etsy on how Graphite is becoming one of his favorite graphing tools. I know that they use Ganglia at Etsy so I asked if/how they use integration between Graphite and Ganglia. He pointed me in the direction of <a href="http://twitter.com/kastner">Erik Kastner</a> who has done Ganglia Graphite integration. I asked him if he could post the patches and he was gracious to do so. In a nutshell he uses RRD files directly and rsyncs them every few minutes. While trying to install Graphite I realized that injecting metrics into Graphite is really simple. For example graphite-web contains a simple client example that injects system load. All it does is connects to port 2003 of the graphite installation and sends a following payload</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>system.loadavg_1min 0.08 1285763852
system.loadavg_5min 0.02 1285763852
system.loadavg_15min 0.01 1285763852
</code></pre></div></div>
<p>That’s simple :-) ie. some type of a metric name, value and what looks like current UNIX timestamp. I then remembered that <a href="https://twitter.com/georgiou">Kostas Georgiou</a> showed me a ruby script that connects to gmond, retrieves the XML for the host, parses it and adds to <a href="http://www.puppetlabs.com/puppet/related-projects/facter/">Facter</a>. Unfortunately that didn’t seem to have much value until now :-). What I did is change Kostas’ script to send metrics to Graphite instead of adding them to facter. You can find the result at <a href="http://github.com/ganglia/ganglia_contrib/tree/master/graphite_integration/">Ganglia Add-Ons GitHub repository</a>. You can run the script either from cron or as a daemon.</p>
<p>There are two ways to do this. I have tested only the first way. I am not sure if the graphite receiver would freak out if it gets too many metrics in a payload. Let me know if you know :-).</p>
<ol>
<li>
<p>Run this script on every host that runs gmond. This may be somewhat tricky since I usually set up gmond to only send metrics and turn off receiving by setting deaf = yes. For this approach to work you have to turn on receiving. To make it more secure we’ll just listen on loopback. In global make sure you have these settings</p>
<p>mute = no
deaf = no</p>
</li>
</ol>
<p>In the rest of the section make sure you add/have</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>udp_send_channel {
host = 127.0.0.1
port = 8649
ttl = 1
}
udp_recv_channel {
bind = 127.0.0.1
port = 8649
}
tcp_accept_channel {
bind = 127.0.0.1
port = 8649
}
</code></pre></div></div>
<ol>
<li>Run this on the main gmond collector daemon. Main gmond collector daemon will have metrics from all hosts. Trouble is that I haven’t tested injecting thousands of metrics in a single payload. I’m sure there is a way around it and perhaps someone can post a patch :-D.</li>
</ol>
<h3 id="future-improvements">Future Improvements</h3>
<p>I can think of couple possible improvements</p>
<ol>
<li>
<p>There is a rewrite of gmetad written in Python. It supports plugins. I don’t think it would be a stretch to add a plug-in where gmetad sends data to Graphite when it updates the RRDs</p>
</li>
<li>
<p>Currently metrics are sent as <hostname>.<metric_name>. It may make sense to send them into the appropriate part of the tree ie. <type_of_metric>.<hostname>.<metric_name> e.g. database.web1.mysql_selects</metric_name></hostname></type_of_metric></metric_name></hostname></p>
</li>
<li>
<p>Better integrate Ganglia Web UI and Graphite. Graphite supports flexible URL parameters so this should be doable.</p>
</li>
</ol>
<p>And obligatory screenshots. This is the stacked graph I created in 20 seconds :-)</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/09/graphite1.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/09/graphite1.png" alt="Graphite view of Ganglia Metrics" /></a></p>
EC2 micro instances cost analysis2010-09-09T22:02:25+00:00http://blog.vuksan.com/2010/09/09/ec2-micro-instances-cost-analysis<p>Amazon today announced addition of EC2 micro instances which is their smallest instance size coming with 613 MB RAM and priced at $0.02/hour. You can read more about the announcement here</p>
<p><a href="http://aws.typepad.com/aws/2010/09/new-amazon-ec2-micro-instances.html">http://aws.typepad.com/aws/2010/09/new-amazon-ec2-micro-instances.html</a></p>
<p>There is a wrinkle though. There is no local (ephemeral) storage so you need to use <a href="http://aws.typepad.com/aws/2009/12/new-amazon-ec2-feature-boot-from-elastic-block-store.html">EBS backed volumes</a>. EBS is charged at $0.10/GB per month along with the charge of $0.10/1 million I/O requests to the volume. That is actually a reasonably good idea since it likely cuts down on I/O subsystem abuse since if you start abusing I/O it will cost you. That said I thought I would run a quick cost analysis to determine how much would it cost to actually run an instance. I have a personal server I use for handling my family’s e-mail, blog and personal web sites. It gets little traffic. I use roughly 30 GB of storage. To find out the number of I/O ops I ran following command</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> cat /proc/diskstats | egrep "sd[a-b] " | awk '{print $4" "$8}'
154756 3576927
773387 1844813
</code></pre></div></div>
<p>This lists number of both read and write ops for both drives in my machine. It adds to about 6 mil iops. Machine was last rebooted 7 days ago making this 25 mil iops per month. On the outbound network traffic side I have consumed 5 GB of traffic so far so => 20 GB per month (charged at $0.15/GB).</p>
<p>Thus the cost breaks down like this per month</p>
<p>Instance cost (30<em>24</em>$0.02) = $14.40</p>
<p>EBS storage charge ( 30 * $0.10) = $3.00</p>
<p>EBS I/O ops charge ( 25 * $0.10) = $2.50</p>
<p>Outbound network traffic ( 20 * $0.15 ) = $3.00</p>
<p><strong>Total: $22.90</strong></p>
<p>Not too bad. A word of warning though. Since these micro instances come with only 613 MB of RAM if you load even a handful of services such as a mySQL database, web or app server you may end up swapping causing your EBS I/O ops charges to go up. I doubt these would be enormous however depending on the level of swapping they could be 25, 50% or 100% higher than what you planned for. Obviously EBS has some nice features such as persistency, snapshotting and ability to boot instances automatically after a failure however it may come with unanticipated cost.</p>
<p><strong>Update</strong>: Some have pointed out that instance costs can be even lower if you reserve (assuming 1-yr commit micro instances are $115/year vs. $172 non-reserved). That is true however as I point out the biggest X factor in the whole equation is EBS charges. It’s nothing that will break the bank however I prefer having idea upfront what the cost is. If your use case is a DNS server, mail server, Nagios checker than this fits the bill well however if you plan to use a ticketing system, wiki that uses a DB backend you will likely exceed memory footprint and start swapping.</p>
Install Openstack Nova easily using Chef and Nova-Solo2010-09-01T12:26:52+00:00http://blog.vuksan.com/2010/09/01/install-openstack-nova-easily-using-chef-and-nova-solo<p>Inspired by <a href="http://github.com/cloudscaling/swift-solo">Cloudscaling’s Swift-Solo</a> and being excited about being able to create my own cloud I am announcing the Nova-Solo project. <a href="http://openstack.org/">Openstack</a> Nova is the Compute portion of the project trying to build open source stack to run Amazon EC2 type service. Nova-Solo is a set of <a href="http://Opscode.com/chef/">Opscode Chef</a> recipes that allow you to quickly get most parts of the Nova stack up and running. You can fetch it from Github at</p>
<p><a href="http://github.com/vvuksan/nova-solo">http://github.com/vvuksan/nova-solo</a></p>
<p>At this time Nova-Solo is targeted for Ubuntu 10.04 and it relies on <a href="https://wiki.ubuntu.com/SorenHansen">Soren Hansen’s</a> package repository to install all of the necessary packages. Following Nova services are installed</p>
<ul>
<li>
<p>Cloud controller</p>
</li>
<li>
<p>Object store</p>
</li>
<li>
<p>Volume store</p>
</li>
<li>
<p>API server</p>
</li>
<li>
<p>Compute Server</p>
</li>
</ul>
<p>Soren’s package archive is a bit outdated so some of the things don’t work. For example you can create users, generate credentials, upload files into buckets but you can’t register the image. Soren has said he is in the process of building new packages and I am also in the process of doing the same so hopefully things improve quickly. Nova code is definitely alphaish so beware. To get started use git to clone the nova-solo repository and off you go</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone git://github.com/vvuksan/nova-solo.git
</code></pre></div></div>
<p>In the future as things stabilize we’ll be making adjustments to support multiple compute servers (pieces for it are already in Nova-Solo), support other distributions like RHEL/Centos, etc.</p>
Slides from the Boston DevOps meetup2010-08-26T01:38:37+00:00http://blog.vuksan.com/2010/08/26/slides-from-the-boston-devops-meetu<p>Here are slides from the August 3rd, 2010 Boston DevOps meetup where Jeff Buchbinder and I spoke about deployment and other helpful hints</p>
<p><a href="http://www.scribd.com/doc/35757228/Deploying-Yourself-Into-Happiness">http://www.scribd.com/doc/35757228/Deploying-Yourself-Into-Happiness</a></p>
<p>Slides have been slightly modified based on the feedback we received at the meetup. If you have any questions please post them in comments and I’ll attempt to answer them.</p>
Tunnel all your traffic on "hostile" networks with OpenVPN2010-08-20T12:29:20+00:00http://blog.vuksan.com/2010/08/20/tunnel-all-your-traffic-on-hostile-networks-with-openvpn<p>I am often on wireless networks that are unsecured ie. either don’t use encryption or if they are I may not trust they will not tamper with my data (you never know). To protect my traffic on such networks I decided to tunnel nearly all my traffic through an OpenVPN server while I’m on such networks. I will show you how you can do it yourself on your Linux or Mac laptops. You should be able to do similar in Windows but it may be a bit more work on the client.</p>
<h2 id="openvpn-server-setup">OpenVPN server setup</h2>
<p>Set up OpenVPN on a network you trust e.g. home, work, cloud etc. You can either use Community Edition of OpenVPN which is free <a href="http://openvpn.net/index.php/open-source/downloads.html">http://openvpn.net/index.php/open-source/downloads.html</a> or you may want to pay OpenVPN money for their OpenVPN appliance package. I prefer using <a href="http://pfsense.com/">pfSense</a> which is customized FreeBSD distribution geared for firewalls/routers with superb Web GUI. If you are gonna use the Community Edition follow the <a href="http://openvpn.net/index.php/open-source/documentation/howto.html#quick">Quickstart guide</a>.</p>
<p>One last step is to make sure that VPN network ie. 10.8.0.0/16 is NATed e.g. on a Linux OpenVPN server you could do</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><code style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: #eeeeee; font-family: Consolas, Menlo, Monaco, 'Lucida Console', 'Liberation Mono', 'DejaVu Sans Mono', 'Bitstream Vera Sans Mono', 'Courier New', monospace, serif; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;"><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">iptables </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">-</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">t nat </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">-</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">A POSTROUTING </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">-</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">s </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln"><span style="color: #4e0000;">10.8.0.0</span></span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">/</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun"><span style="color: #4e0000;">16</span></span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln"> </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">-</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">o eth0 </span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #888888; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pun">-</span><span style="font-size: 14px; vertical-align: baseline; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: transparent; color: #333333; background-position: initial initial; background-repeat: initial initial; padding: 0px; margin: 0px; border: 0px initial initial;" class="pln">j MASQUERADE</span></code>
</code></pre></div></div>
<h2 id="openvpn-client-setup">OpenVPN client setup</h2>
<p>Configure OpenVPN client to connect to your OpenVPN server. You can find the <a href="http://openvpn.net/index.php/openvpn-client/howto-openvpn-client.html">client HOWTO</a> here.</p>
<p>Make sure you can access your home/work network. This will in general provide you with “split-tunnel” access ie. only traffic intended for your home/work network will be tunneled through VPN and everything else will go the normal “insecure” way.</p>
<h2 id="tunnel-all-traffic">Tunnel all traffic</h2>
<p><strong>Update</strong>: Shame on me. Someone has already posted the directions on how to do this at</p>
<p><a href="http://manoftoday.wordpress.com/2006/12/03/openvpn-20-howto/">http://manoftoday.wordpress.com/2006/12/03/openvpn-20-howto/</a></p>
<p>Thanks to <a href="http://twitter.com/somic">@somic</a> for pointing this out.</p>
<p>Tricky part in all this is that OpenVPN uses a simple TUN/TAP interface through which tunnels all the traffic. Temptation is to simply add an entry in the OpenVPN file that sets a default route through OpenVPN. This will likely fail as you will now have competing default routes. Instead what you need to do is add a route to your VPN server that uses the wireless networks default gateway and make your VPN interface the default route. This way all the traffic goes into the VPN interface and OpenVPN takes care of tunneling it through.</p>
<p>For this you will need to configure an external script that fires off once VPN tunnel is up. To enable post-up script put following two lines in your ovpn file</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span style="text-decoration: line-through;">script-security 3 system
up /usr/local/bin/set_up_routes.sh</span>
</code></pre></div></div>
<p>Your set_up_routes.sh would look something like this. Please change the VPN_SERVER_IP variable to the IP of your OpenVPN server.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span style="text-decoration: line-through;">#!/bin/sh
# Note the wireless network default gateway
DEFAULT_GATEWAY=`netstat -nr | grep ^0.0.0.0 | awk '{ print $2 }'`
# Find out what's the IP on the
VPN_GATEWAY=`netstat -nr | grep tun | grep -v 0.0.0.0 | awk '{ print $2 }' | sort | uniq`
VPN_GATEWAY=`ifconfig | grep 172.16 | cut -f3 -d: | cut -f1 -d" "`
VPN_SERVER_IP="1.2.3.4"
sudo /sbin/route del default
#
sudo /sbin/route add -host $VPN_SERVER_IP gw $DEFAULT_GATEWAY
# Don't tunnel traffic to 2.3.4.5 since it's already SSLized
sudo /sbin/route add -host 2.3.4.5 gw $DEFAULT_GATEWAY
sudo /sbin/route add default gw $VPN_GATEWAY</span>
</code></pre></div></div>
<p>This script was tested under Ubuntu Linux but should work the same under Mac OS X. On Windows you may need to use PowerShell or use Cygwin.</p>
<h2 id="tunneling-traffic-for-specific-ips">Tunneling traffic for specific IPs</h2>
<p>If you only wish to tunnel traffic for particular set of IPs you only need to add those routes to your ovpn file e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>route 72.0.0.0 255.0.0.0
route 75.0.0.0 255.0.0.0
</code></pre></div></div>
<p>You do NOT need to go through the excercise of setting up a script.</p>
<p>If you are looking for other OpenVPN guides <a href="https://twitter.com/samj">Sam Johnston</a> has a OpenVPN guide on howto set up OpenVPN in a VPS</p>
<p><a href="http://samj.net/2010/01/howto-set-up-openvpn-in-vps.html">http://samj.net/2010/01/howto-set-up-openvpn-in-vps.html</a></p>
Skipping MySQL replication errors2010-08-19T17:59:47+00:00http://blog.vuksan.com/2010/08/19/skippingmysql-replication-errors<p>I was talking to my buddy Jeff Buchbinder and he mentioned that he recently added following to mySQL in order to reduce mySQL replication breakages</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>slave-skip-errors=1062,1053,1146,1051,1050
</code></pre></div></div>
<p>What this does is not stop replication in case following errors are encountered</p>
<p>Error: <code class="language-plaintext highlighter-rouge">1050</code> SQLSTATE: <code class="language-plaintext highlighter-rouge">42S01</code> (<a href="http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html#error_er_table_exists_error"><code class="language-plaintext highlighter-rouge">ER_TABLE_EXISTS_ERROR</code></a>)</p>
<p>Message: Table ‘%s’ already exists</p>
<p>Error: <code class="language-plaintext highlighter-rouge">1051</code> SQLSTATE: <code class="language-plaintext highlighter-rouge">42S02</code> (<a href="http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html#error_er_bad_table_error"><code class="language-plaintext highlighter-rouge">ER_BAD_TABLE_ERROR</code></a>)</p>
<p>Message: Unknown table ‘%s’</p>
<p>Error: <code class="language-plaintext highlighter-rouge">1053</code> SQLSTATE: <code class="language-plaintext highlighter-rouge">08S01</code> (<a href="http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html#error_er_server_shutdown"><code class="language-plaintext highlighter-rouge">ER_SERVER_SHUTDOWN</code></a>)</p>
<p>Message: Server shutdown in progress</p>
<p>Error: <code class="language-plaintext highlighter-rouge">1062</code> SQLSTATE: <code class="language-plaintext highlighter-rouge">23000</code> (<a href="http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html#error_er_dup_entry"><code class="language-plaintext highlighter-rouge">ER_DUP_ENTRY</code></a>)</p>
<p>Message: Duplicate entry ‘%s’ for key %d</p>
<p>Error: <code class="language-plaintext highlighter-rouge">1146</code> SQLSTATE: <code class="language-plaintext highlighter-rouge">42S02</code> (<a href="http://dev.mysql.com/doc/refman/5.0/en/error-messages-server.html#error_er_no_such_table"><code class="language-plaintext highlighter-rouge">ER_NO_SUCH_TABLE</code></a>)</p>
<p>Message: Table ‘%s.%s’ doesn’t exist</p>
<p>This will avoid the very common primary key collisions and “temporary tables aren’t there” problems. Writing this down for posterity. Use with caution.</p>
<p>Marius Ducea has a post about it as well</p>
<p><a href="http://www.ducea.com/2008/02/13/mysql-skip-duplicate-replication-errors/">http://www.ducea.com/2008/02/13/mysql-skip-duplicate-replication-errors/</a></p>
PHP 5.3 name spaces separator2010-08-16T19:03:30+00:00http://blog.vuksan.com/2010/08/16/php-5-3-names-space-separator<p>I am posting this to help others that may encounter a similar problem.</p>
<p>I have been doing some PHP development recently using <a href="http://github.com/nrk/predis/">Predis</a>, a PHP <a href="http://code.google.com/p/redis/">Redis</a> library. While instantiating the Redis\Client object I get</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Warning: Unexpected character in input: '\' (ASCII=92) state=1 in .....
</code></pre></div></div>
<p>Problem was explained in this issue</p>
<p><a href="http://github.com/nrk/predis/issues/closed#issue/11">http://github.com/nrk/predis/issues/closed#issue/11</a></p>
<p>If you are still running on PHP 5.2 you should use the backported version of Predis and not the mainline library which targets only PHP >= 5.3 (the <em><strong>backslash is the namespace separator</strong></em> in PHP 5.3).</p>
<p>More discussion of this change can be found here.</p>
<p><a href="http://giorgiosironi.blogspot.com/2009/09/introspection-of-php-namespaces.html">http://giorgiosironi.blogspot.com/2009/09/introspection-of-php-namespaces.html</a></p>
Deployment rollback2010-08-12T17:55:36+00:00http://blog.vuksan.com/2010/08/12/deployment-rollback<p>This is a question that often comes up in deployment discussion. How do you rollback in case of a “bad” deploy ? Bad deploy can be any of the following</p>
<ul>
<li>
<p>Site completely broken</p>
</li>
<li>
<p>Significant performance degradation</p>
</li>
<li>
<p>Key feature(s) broken</p>
</li>
</ul>
<p>There are obviously a number of ways to deal with this issue. You could put up a notice on the site that x and y feature is broken while you work to fix it. Same with performance degradation. Let’s however deal with rollback ie. you decided (determined by a number of different factors) that the stuff you just deployed is broken and you should roll back to a previous last know version. In such a case you would</p>
<ul>
<li>
<p>Undo any configuration changes you may have applied (often none)</p>
</li>
<li>
<p>Deploy last known good version that worked. This is one of the reasons why I prefer using labelled binary packages. I simply instruct the deployment tool to install version 1.5.2 which was last good version and off we go.</p>
</li>
</ul>
<p>The only caveat are database changes. In general you can’t easily undo DB changes especially in the situations where you discover a deployment problem couple hours after deployment has taken place since by then users may have added new posts, changed their profiles etc. It would be a major effort to undo all DB changes, evaluate newly added data and whether it needs to be changed. That said DB changes are usually not a problem if you follow these easy steps</p>
<ol>
<li>
<p>Don’t do any column drops immediately after the release. You can do those in QA but in production those can wait. In most cases they only take up space. I have heard of places that would first zero out then drop “unused” columns once a quarter or so.</p>
</li>
<li>
<p>Related to 1. never ever use SELECT * since if you drop or add a column your code may break during roll back</p>
</li>
<li>
<p>If there are data changes you have to do ie. update carrier set name=”AT&T” where name=”Cingular”, have the reverse SQL statement ready as the insurance policy. Those are quite easy to implement.</p>
</li>
<li>
<p>You don’t have to worry about added tables since older version will not use them.</p>
</li>
<li>
<p>You don’t have to worry about added columns provided you don’t do 2. and have not placed constraints ie. NOT NULL. In that case you may need to adjust those or drop them during rollback.</p>
</li>
</ol>
<p>The wildcard in all this is added or removed constraints ie. new foreign keys. There is no single solution for this one. Perhaps the right policy is to discuss constraints prior to deployment and have a plan ready on what to do. Good luck.</p>
Bootstraping your cloud environment with puppet and mcollective2010-07-29T01:31:49+00:00http://blog.vuksan.com/2010/07/29/bootstraping-your-cloud-environment-with-puppet-and-mcollective<p>This is a “recipe” on how to bootstrap your whole environment in case of a disaster ie. your data center goes dark or if you are migrating from one environment to another. This guide differs from others in that it uses mcollective and DNS to provide you with greater flexibility in deploying and bootstraping environments. Some of the alternate ways are <a href="http://github.com/ripienaar/ec2-boot-init#readme">ec2-boot-init by R.I. Pienaar</a> or Grig Gheorghiu’s <a href="http://agiletesting.blogspot.com/2009/09/bootstrapping-ec2-images-as-puppet.html">Bootstrapping EC2 images as Puppet clients</a>.</p>
<h2 id="intro">Intro</h2>
<p>You will need two disk images, your code repository and your DB backup and you can rebuild your whole environment from scratch in a relatively short period of time. This could be adapted to generic cloud provisioning however use case I’m trying to address is disaster recovery. We are using DNS so that we can keep hostnames consistent between environments ie. mail01 will be a mail server in all environments instead of domU-1-2-3-4 in one, rack-2345 in other etc.</p>
<h2 id="set-up-a-master-node-image">Set up a master node image</h2>
<p>Master node is the node that controls all the other nodes. Most importantly it contains all your configuration management data. You will need to install following</p>
<ul>
<li>
<p>mcollective with ActiveMQ</p>
</li>
<li>
<p>DnsMasq</p>
</li>
<li>
<p>Puppet from <a href="http://www.puppetlabs.com/">Puppet Labs</a></p>
</li>
</ul>
<p>1. You will need to get a DNS name from a dynamic DNS provider such as DynDNS. Once you have that you will need to write a shell script that runs at boot and sets your EC2 private IP to that DNS name. Let’s say we want our controller station to be known as controller.ec2.domain.com we can do something like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>IP=`facter ipaddress`
change_my_dns_ip controller.ec2.domain.com
# Delete any entries from hosts
sed -i "/controller.ec2.domain.com/d" /etc/hosts
echo "${IP} controller.ec2.domain.com" >> /etc/hosts
</code></pre></div></div>
<ol>
<li>Set up ActiveMQ to be used with mcollective <a href="http://code.google.com/p/mcollective/wiki/GettingStarted">http://code.google.com/p/mcollective/wiki/GettingStarted</a></li>
<li>Set up mcollective</li>
</ol>
<p>Configure controller.ec2.domain.com as the stomp host in your mcollective configuration for both client and server configuration.</p>
<p>4.Install dnsmasq. You don’t need to configure anything since by default dnsmasq will read /etc/hosts and serve those names over DNS</p>
<ol>
<li>
<p>Install puppetmaster, configure it anyway you want</p>
</li>
<li>
<p>Image it</p>
</li>
</ol>
<h2 id="set-up-a-genericworker-node-image">Set up a generic/worker node image</h2>
<p>You will need to Install following</p>
<ul>
<li>
<p>Mcollective</p>
</li>
<li>
<p>puppet agent</p>
</li>
</ul>
<ol>
<li>
<p>On the worker node you need to configure the server piece of mcollective and make sure the stomp.host is pointed to the master ie. controller.ec2.domain.com.</p>
</li>
<li>
<p>Create a reboot agent (we’ll discuss later how to use it). Please visit <a href="http://code.google.com/p/mcollective/wiki/SimpleRPCIntroduction">http://code.google.com/p/mcollective/wiki/SimpleRPCIntroduction</a> for an example. Create a new file ie. reboot.rb. Paste this code in it</p>
<p>module MCollective
module Agent
class Reboot<RPC::Agent
def reboot_action
<code class="language-plaintext highlighter-rouge">/sbin/shutdown -r now</code>
end
end
end
end</p>
</li>
</ol>
<p>Copy the resulting file to the mcollective agents directory</p>
<ol>
<li>
<p>Add following script to the bootup</p>
<p>MASTER=<code class="language-plaintext highlighter-rouge">host controller.ec2.domain.com | grep address | cut -f4 -d" "</code>
IS_ALREADY_SET=<code class="language-plaintext highlighter-rouge">grep -c ec2.domain.com /etc/resolv.conf</code>
if [ $IS_ALREADY_SET -lt 1 ]; then
sed -i “s/^search .*/search ec2.domain.com/g” /etc/resolv.conf
sed -i “s/^nameserver/nameserver ${MASTER}\nnameserver/g” /etc/resolv.conf
fi
# Set Hostname
IP=<code class="language-plaintext highlighter-rouge">facter ipaddress</code>
MY_HOST=<code class="language-plaintext highlighter-rouge">/bin/ipcalc --silent --hostname ${IP} | cut -f2 -d=</code>
hostname ${MY_HOST}</p>
</li>
</ol>
<p>What that does is point tells your worker nodes to use controller DNS for resolving names as well as setting your hostname.</p>
<ol>
<li>
<p>Get the mcollective puppet plugin from <a href="http://github.com/ripienaar/mcollective-plugins/tree/master/agent/puppetd/">github</a></p>
</li>
<li>
<p>Image it</p>
</li>
</ol>
<h2 id="bringing-up-the-environment">Bringing up the environment</h2>
<p>You will need to start the master instance first since that’s the instance that everyone will be talking to. As soon as it’s up you can start up as many instances as you’d like.</p>
<p>While you wait rsync your puppet manifests and configurations to the master node</p>
<p>To find out what nodes are up and available issue mc-ping from the master and you should get a response similar to this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># mc-ping
controller.ec2.domain.com time=77.21 ms
domu-12-31-55-11-22-18.compute-1.internal time=188.76 ms
</code></pre></div></div>
<p>Trouble is that hostnames on the worker nodes are set to Amazon names. We want to make them recognizable e.g. mail01.</p>
<p>To do so simply add the IP of the worker instance and it’s name into /etc/hosts on the master e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo "10.1.2.3 mail01.ec2.domain.com" >> /etc/hosts
</code></pre></div></div>
<p>Reload dnsmasq configuration ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/etc/init.d/dnsmasq reload
</code></pre></div></div>
<p>What this has bought you is reverse DNS resolution of the node. To take effect you will need to reboot the worker node. We already have the reboot agent on the worker nodes so all we have to do is run following command on the master node</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./mc-rpc -F hostname=domu-12-31-55-11-22-18 reboot reboot
</code></pre></div></div>
<p>This will seek out the domU-1-2-3-4 host and reboot it (–arg is irrelevant so put anything). Once the machine is up it will advertise it’s new name :-) ie. running mc-ping will show you this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># mc-ping
controller.ec2.domain.com time=47.59 ms
mail01.ec2.domain.com time=80.71 ms
</code></pre></div></div>
<p>Now let’s activate puppet. From master node run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># mc-puppetd -F hostname=mail01 runonce
* [ ============================================================> ] 1 / 1
Finished processing 1 / 1 hosts in 1051.23 ms
</code></pre></div></div>
<p>Once that is done puppetca should give you this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># puppetca --list
mail01.ec2.domain.com
</code></pre></div></div>
<p>Sign it</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># puppetca –sign mail01.ec2.domain.com
</code></pre></div></div>
<p>Now you can simply run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># mc-puppetd -F hostname=mail01 enable
</code></pre></div></div>
<p>and off you go. Now lather, rinse, repeat to get the rest of the instances going. You would certainly want to automate this further but I leave that exercise to you :-).</p>
<p>If you are looking for an easy cross-cloud API check out my “<a href="http://blog.vuksan.com/2010/07/20/provision-to-cloud-in-5-minutes-using-fog/">Provision to cloud in 5 minutes using fog</a>”.</p>
Next Boston DevOps meetup2010-07-21T23:35:16+00:00http://blog.vuksan.com/2010/07/21/next-boston-devops-meetup<p>Next Boston DevOps meetup we’ll try something new, Jeff Buchbinder of <a href="http://freemedsoftware.org/">FreeMed Software</a> fame and myself will talk about “Deploying your way into happiness”. If you want flavor of the kinds of things we’ll talk about you can check out my <a href="http://blog.vuksan.com/2010/04/09/devops-homebrew/">Devops homebrew</a> post. We will go into much more detail with actual code snippets and some of the omitted nitty gritty details. We will also open the floor for questions.</p>
<p>Date for the meetup is August 3rd, 2010 from 6-8 pm and we’ll be meeting at Microsoft’s New England R&D center. I expect we’ll start presenting around 6:45 or so.</p>
<p>Please register at</p>
<p><a href="http://www.eventbrite.com/event/770217742">http://www.eventbrite.com/event/770217742</a></p>
<p>since we need to provide building security at NERD with the list of people attending.</p>
Provision to cloud in 5 minutes using fog2010-07-20T12:30:25+00:00http://blog.vuksan.com/2010/07/20/provision-to-cloud-in-5-minutes-using-fog<p>Most recently I have been working on disaster recovery project where we are assembling documentation, processes and code to be able to fire up our whole environment in the cloud in case of a major disaster. At Velocity Conference I met <a href="http://twitter.com/geemus">Wesley Beary</a> who is the main developer for <a href="http://github.com/geemus/fog/">fog</a>, a Ruby cloud computing library. What appealed to me about fog is that it has varying support for different clouds so that we are not stuck using a provider due to our non-portable code. Now off to couple quick example to get you going.</p>
<p>To install fog you will need to install Ruby Gems. If you have them type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> sudo gem install fog
</code></pre></div></div>
<p>The install may fail if you don’t have the libxslt and libxml2 dev libraries. On my Ubuntu laptop I resolved it by doing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> sudo apt-get install libxslt1-dev libxml2-dev
</code></pre></div></div>
<p>On Centos/RHEL 5 I had to do</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> yum install libxslt-devel libxml2-devel
</code></pre></div></div>
<p>Create a file called config.rb which contains your credentials e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/ruby
@aws_access_key_id = "XXXXXXXXXXXXXXXXXX"
@aws_secret_access_key = "AXXZZZZZZZZZZZZZZZZZZ"
@aws_region = "us-east-1"
</code></pre></div></div>
<p>Let’s start with the basics. Let’s get our currently running instances and what images are available</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'
# Set up a connection
connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )
# Get a list of all the running servers/instances
instance_list = connection.servers.all
num_instances = instance_list.length
puts "We have " + num_instances.to_s() + " servers"
# Print out a table of instances with choice columns
instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])
###################################################################
# Get a list of our images
###################################################################
my_images_raw = connection.describe_images('Owner' => 'self')
my_images = my_images_raw.body["imagesSet"]
puts "\n###################################################################################"
puts "Following images are available for deployment"
puts "\nImage ID\tArch\t\tImage Location"
# List image ID, architecture and location
for key in 0...my_images.length
print my_images[key]["imageId"], "\t" , my_images[key]["architecture"] , "\t\t" , my_images[key]["imageLocation"], "\n";
end
</code></pre></div></div>
<p>Let’s spin up a m1.large instance</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'
# Set up a connection
connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )
server = connection.servers.create(:image_id => 'ami-1234567',
:flavor_id => 'm1.large')
# wait for it to be ready to do stuff
server.wait_for { print "."; ready? }
puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"
</code></pre></div></div>
<p>This may take a while so please be patient. You could obviously spin up a number of these instances without waiting for any of them to be available then use connection.servers.all to get a list of running instances.</p>
<p>Now let’s destroy a running instance</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#!/usr/bin/ruby
require 'rubygems'
require 'fog'
# Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id
require './config.rb'
# Set up a connection
connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )
instance_id = "1-123456"
server = connection.servers.get(instance_id)
puts "Flavor: #{server.flavor_id}"
puts "Public IP Address: #{server.ip_address}"
puts "Private IP Address: #{server.private_ip_address}"
server.destroy
</code></pre></div></div>
<p>There is tons more out there although this gets me going :-). Now off to playing with R.I. Pienaar’s <a href="http://github.com/ripienaar/ec2-boot-init">ec2-boot-init.</a></p>
<p>Thanks to Wesley Beary for answering questions about fog and Ian Meyer for pointing out <a href="http://github.com/opscode/chef/blob/master/chef/lib/chef/knife/ec2_server_create.rb">Chef Fog code</a>.</p>
<p>#!/usr/bin/ruby</p>
<p>require ‘rubygems’
require ‘fog’
require ‘pp’</p>
<h1 id="import-ec2-credentials-eg-aws_access_key_id-and-aws_access_key_id">Import EC2 credentials e.g. @aws_access_key_id and @aws_access_key_id</h1>
<p>require ‘./config.rb’</p>
<h1 id="set-up-a-connection">Set up a connection</h1>
<p>connection = Fog::AWS::EC2.new(
:aws_access_key_id => @aws_access_key_id,
:aws_secret_access_key => @aws_secret_access_key )</p>
<h1 id="get-a-list-of-all-the-running-serversinstances">Get a list of all the running servers/instances</h1>
<p>instance_list = connection.servers.all</p>
<p>num_instances = instance_list.length
puts “We have “ + num_instances.to_s() + “ servers”</p>
<h1 id="print-out-a-table-of-instances-with-choice-columns">Print out a table of instances with choice columns</h1>
<p>instance_list.table([:id, :flavor_id, :ip_address, :private_ip_address, :image_id ])</p>
<p>###################################################################</p>
<h1 id="get-a-list-of-our-images">Get a list of our images</h1>
<p>###################################################################
my_images_raw = connection.describe_images(‘Owner’ => ‘self’)</p>
<p>my_images = my_images_raw.body[“imagesSet”]</p>
<p>puts “\n###################################################################################”
puts “Following images are available for deployment”
puts “\nImage ID\tArch\t\tImage Location”</p>
<p>for key in 0…my_images.length
print my_images[key][“imageId”], “\t” , my_images[key][“architecture”] , “\t\t” , my_images[key][“imageLocation”], “\n”;
end</p>
<p>###################################################################</p>
<h1 id="get-a-list-of-all-instance-flavors">Get a list of all instance flavors</h1>
<p>###################################################################
flavors = connection.flavors()</p>
<p>print “\n\n============\nFlavors\n============\n”
#flavors.table([:bits, :cores, :disk, :ram, :name])
flavors.table</p>
Analyzing your backend web page response times2010-07-16T00:59:05+00:00http://blog.vuksan.com/2010/07/16/analyzing-your-web-page-response-times<p>I have blogged about in the past about some of the ways you can monitor your web site performance e.g how to <a href="http://blog.vuksan.com/2010/01/15/monitoring-your-site-via-90th-percentile-response-time/">monitor your site using 90th percentile response times</a>, <a href="http://blog.vuksan.com/2010/06/05/beauty-of-aggregate-line-graphs/">beauty of aggregate line graphs</a> and <a href="http://blog.vuksan.com/2010/04/20/tracking-web-clients-in-real-time/">tracking web clients in real time</a>.</p>
<p>Most recently we wanted to get better insight into how our site and more specifically backend is performing. We wanted a tool that could provide us with per URL/page metrics such as</p>
<ul>
<li>
<p>total number of requests</p>
</li>
<li>
<p>aggregate compute time</p>
</li>
<li>
<p>average request time</p>
</li>
<li>
<p>90th percentile time (you can find more explanation what it means at <a href="http://blog.vuksan.com/2010/01/15/monitoring-your-site-via-90th-percentile-response-time/">monitor your site using 90th percentile response times</a>) - this eliminates most of the really slow response times that may really affect your averages</p>
</li>
</ul>
<p>Initial plan was to build a basic set of reports to tell us what are the pages with excessive response times or large total (aggregate) compute times. Next and yet to be implemented portion was to be able to analyze data in real time so that we’d have another data point to use in troubleshooting in case there is a site slow down.</p>
<p>Basic requirements for the tool were these</p>
<ul>
<li>
<p>Capable of crunching 100+ million daily entries</p>
</li>
<li>
<p>Real-time analysis</p>
</li>
<li>
<p>Produce multiple metrics with potential to add more down the line</p>
</li>
<li>
<p>Low footprint</p>
</li>
</ul>
<p>An obvious way to do this is to store all data in a heavy duty data store like a relational/SQL database or something MapReduce capable. Trouble is we may be doing in logging in excess of 3,000 hits per second (all dynamic content as static assets are served from the CDN). Doing that many inserts per second on a SQL-type database will be tricky unless you have powerful hardware. Next obvious problem is to scan through hundreds of millions or billions of rows will be slow even if I use MapReduce unless of course you throw tons of hardware at it. We wanted a low footprint remember.</p>
<p>Instead we decided to go with a key/value store. Major pluses were that footprint is relatively low and it performs very fast. Downside was I would not be able to run any sophisticated queries. Since we already have an app that uses memcached to give us <a href="http://blog.vuksan.com/2010/04/20/tracking-web-clients-in-real-time/">real-time view per IP number of accesses</a> we ended up using it for this purpose as well.</p>
<h3 id="implementation">Implementation</h3>
<p>I have been working for a while now with <a href="http://bitbucket.org/maplebed/ganglia-logtailer/">ganglia-logtailer </a>which is a Python framework to crunch log data and submit it to <a href="http://ganglia.info/">Ganglia</a>. There are a number of good pieces from it we could reuse and we did. What we ended up is a two part tool. A Python based log parsing piece and a PHP based web GUI and computation part. Division of “labor” was roughly this</p>
<ul>
<li>
<p>Python part parses the logs and creates entries/keys where the value in each key represent all the response times observed on a particular server and URL in a particular time period ie. one hour</p>
</li>
<li>
<p>PHP part takes the list once the time period has ended, calculates total time, average time and 90th percentile times and stores computed values in memcache so that retrieval later can be quicker.</p>
</li>
</ul>
<p>Graphing is achieved using simple CSS graphs while time based series are done using <a href="http://sourceforge.net/projects/openflashchart">OpenFlashChart</a>. I did look at <a href="http://www.danvk.org/dygraphs/">Dygraphs </a>for Javascript/DHTML based graphing however couldn’t figure how to plot hourly values. I could only do daily values.</p>
<p>Tool is operational and so far it has led us to the realization that our mobile web pages are overall much slower than their corresponding web pages. This is due to the way we handle mobile ads since most feature phones don’t support Javascript so we have to download the ad which introduces a slight delay. We did figure out that we could use Javascript on Webkit browsers similar to what we do for regular browsers so that should help a bit. We are also chasing some of the other “leads” regarding inconsistent performance for particular pages on some of the servers.</p>
<p>Next steps are to adapt parsing code to work with ganglia-logtailer which would give us real-time reporting. I don’t expect too many problems with that. Also graphing could use some more love. Perhaps I’ll even do standard deviation calculations :-).</p>
<p>Anyways you can download source code from here</p>
<p><a href="http://github.com/vvuksan/pagetime-analyzer">http://github.com/vvuksan/pagetime-analyzer</a></p>
<p>You know what to do :-).</p>
<h2 id="obligatory-screenshots">Obligatory screenshots</h2>
<p>Hourly overview sorted by aggregate time in seconds (you can sort by any column)</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_overview.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_overview.png" alt="" /></a></p>
<p>This is the average response time (over an hour) for a particular URL on separate server instances</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_url_breakdown.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_url_breakdown.png" alt="" /></a></p>
<p>Daily view of performance for a particular URL</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_graph.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/07/pt_graph.png" alt="" /></a></p>
CouchDB views creation problems2010-07-15T01:49:19+00:00http://blog.vuksan.com/2010/07/15/couchdb-views-creation-problems<p>I have had a frustrating time creating views in CouchDB using curl. Executing following command I would get</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ curl -s -X PUT -H "text/plain;charset=utf-8" -d cronview.json http://localhost:5984/cronologger/_design/cronview
{"error":"bad_request","reason":"invalid UTF-8 JSON"}
</code></pre></div></div>
<p>I checked and rechecked JSON, used the same JSON using CouchDB’s Futon to no avail. Finally I found the answer here</p>
<p><a href="http://stackoverflow.com/questions/2461798/error-about-invalid-json-with-couchdb-view-but-the-jsons-fine">http://stackoverflow.com/questions/2461798/error-about-invalid-json-with-couchdb-view-but-the-jsons-fine</a></p>
<p>The <a href="http://curl.haxx.se/docs/manpage.html#-d--data"><code class="language-plaintext highlighter-rouge">-d</code> option of curl</a> expects the actual data as the argument!</p>
<p>If you want to provide the data in a file, you need to prefix it with <code class="language-plaintext highlighter-rouge">@</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><code>curl -X PUT -d @keys.json $CDB/_design/id
</code>
</code></pre></div></div>
Store your cron output for analysis and correlation with cronologger2010-07-06T12:32:41+00:00http://blog.vuksan.com/2010/07/06/store-your-cron-output-with-cronologger<p>For the longest time I have wanted to get rid of dozen or so cron messages I receive every morning about things like DB backups, DB cleanups/vacuums, reporting etc. There are a number of solutions out there to help you manage the cron spam such as <a href="http://habilis.net/cronic/">cronic</a>, <a href="http://web.taranis.org/shush/">shush</a> and <a href="http://www.uow.edu.au/~sah/cronwrap.html">cronwrap</a>. They help by e-mailing you only if there is a problem however don’t store the cron output itself. To get around that issue I have developed cronologger which can be downloaded from</p>
<p><a href="http://github.com/vvuksan/cronologger">http://github.com/vvuksan/cronologger</a></p>
<p>Cronologger is a BASH script that stores all the cron output into a database. I am using <a href="http://couchdb.apache.org/">CouchDB</a> since it is a great document oriented database that allows me to add attachments (blobs) to a document. I assume it would not be hard to use MongoDB, Riak and others.</p>
<p>Some of the benefits of this utility are</p>
<ul>
<li>
<p>Reduce cron spam</p>
</li>
<li>
<p>Provide the ability to correlate adverse affects by overlaying cron events on e.g. Ganglia graphs</p>
</li>
<li>
<p>Provide a better report of all the batch jobs that ran, diff them with past jobs if they should look the same, etc.</p>
</li>
<li>
<p>Provide the ability to easily view what is currently running on the whole infrastructure ie. job_duration < 0</p>
</li>
<li>
<p>Review historical output</p>
</li>
</ul>
<p>I am still working on web GUI for most of these things. I will gladly accept patches and new contributions.</p>
<p>Tip: To get view a list of documents in a CouchDB database you can use the _utils view e.g. http://localhost:5984/_utils/</p>
Overlay deploy timeline on Ganglia graphs2010-06-28T15:55:54+00:00http://blog.vuksan.com/2010/06/28/overlay-deploy-timeline-on-your-ganglia-graphs<p>Don’t you sometimes wish you could have a visual indicator of when code has been deployed in production. Something like this :-)</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/06/deploy_timeline.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/06/deploy_timeline.png" alt="Shows deploy time line on a load graph" /></a></p>
<p>This is how you can add deploy timeline to your Ganglia graphs or for that matter to any tool that uses RRDs such as Cacti, Munin, Collectd etc.</p>
<h3 id="background">Background</h3>
<p>RRDtool supports so called <a href="http://oss.oetiker.ch/rrdtool/doc/rrdgraph_graph.en.html">VRULEs</a> which are</p>
<h4 id="vruletimecolorlegenddasheson_soff_son_soff_sdash-offsetoffset"><a href="http://oss.oetiker.ch/rrdtool/doc/rrdgraph_graph.en.html#___top"><strong>VRULE</strong><strong>:</strong><em>time</em><strong>#</strong><em>color</em>[<strong>:</strong><em>legend</em>][<strong>:dashes</strong>[<strong>=</strong><em>on_s</em>[,<em>off_s</em>[,<em>on_s</em>,<em>off_s</em>]…]][<strong>:dash-offset=</strong><em>offset</em>]]</a></h4>
<p>Draw a vertical line at <em>time</em>. Its color is composed from three hexadecimal numbers specifying the rgb color components (00 is off, FF is maximum) red, green and blue followed by an optional alpha. Optionally, a legend box and string is printed in the legend section. <em>time</em> may be a number or a variable from a <strong>VDEF</strong>. It is an error to use _vname_s from <strong>DEF</strong> or <strong>CDEF</strong> here. Dashed lines can be drawn using the <strong>dashes</strong> modifier. See <strong>LINE</strong> for more details.</p>
<p>What we want to do is add a VRULE for each deployment. For example those three lines above have been generated using these VRULEs</p>
<p>VRULE:1277731886#FF00FF:”Deploys” VRULE:1277721886#FF00FF VRULE:1277711886#FF00FF</p>
<h3 id="implementation">Implementation</h3>
<p>Easiest way to add these to Ganglia is to modify graph.php in Ganglia Web. You need to look for following two lines at the end of the file</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$command .= array_key_exists('extras', $rrdtool_graph) ? ' '.$rrdtool_graph['extras'].' ' : '';
$command .= " $rrdtool_graph[series]";
</code></pre></div></div>
<p>Then append your own VRULEs ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";
</code></pre></div></div>
<p>Obviously you have to pull in the $time info from where you keep track of your deploy times. You can also get creative by using different colors for different deploys, change legend labels, add VRULEs to only certain graphs ie. load, CPU etc. This is a quick and dirty way to do it</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$deploy_times = array(1278082860,1279393200);
foreach ( $deploy_times as $key => $time ) {
# Put deploys label only once.
if ( $key == 0 )
$command .= " VRULE:" . $time . "#FF00FF:\"Deploys\"";
else
$command .= " VRULE:" . $time . "#FF00FF";
}
</code></pre></div></div>
<p>Now you just have to make sure you append deploy times in the array.</p>
<h3 id="alternate-implementations">Alternate implementations</h3>
<p>Alternate implementation is to create a RRD file whenever you do deploys then overlay that graph on top of an existing graph. Trouble is you have to worry about scaling the graph. Never could get it quite right.</p>
<h3 id="credit">Credit</h3>
<p>Thanks goes to the <a href="http://circonus.com/">Circonus</a> guys :-) since they made me think of vertical lines instead of trying the RRD overlay. Also thanks to <a href="https://twitter.com/toredash">@toredash</a> for pointing me in the right RRDtool direction by suggesting HRULE.</p>
Velocity Conference 2010 takeaways2010-06-27T17:01:03+00:00http://blog.vuksan.com/2010/06/27/velocity-conference-2010-takeaways<p><a href="http://en.oreilly.com/velocity2010/">Velocity 2010</a> was an excellent conference. Following are my takeways from the conference. There is tons more but following are some of the things that made a good impression and are likely not hard to do</p>
<h3 id="web-performance-optimization">Web performance optimization</h3>
<ul>
<li>
<p>Look at your Javascript. That is one of the major reasons for page slowness since Javascript download and parsing blocks other elements of the page to be loaded. Use <a href="http://code.google.com/closure/compiler/">Google’s Closure Compiler</a> to optimize Javascript by merging code and eliminating duplicate or unneeded functions. Check out <a href="http://www.monkey.org/~annie/ProgressiveEnhancement.html">Anne Sullivan’s Progressive Enhancement slides</a>.</p>
</li>
<li>
<p>Yahoo release Boomerang which they describe as a piece of javascript that you add to your web pages, where it measures the performance of your website from your end user’s point of view. It has the ability to send this data back to your server for further analysis.More details at <a href="http://github.com/yahoo/boomerang">http://github.com/yahoo/boomerang</a></p>
</li>
<li>
<p>Version your CSS/Javascript and set expire times for 10+ years. Check out slide 24 from <a href="http://www.slideshare.net/postwait/velocity-2010-scalable-internet-architectures">Theo Schlossnagle Scalable Internet Architectures slides</a>.</p>
</li>
<li>
<p>Use Cookies as a “distributed database”. If you are concerned about security or tampering encrypt the cookies.</p>
</li>
<li>
<p>Use JQuery sparingly. It takes 200-300 ms to parse it. This is even worse in the mobile world.</p>
</li>
<li>
<p>Google rewrote their show_ads.js with ASWIFT which causes script loading to be asynchronous and not block other elements from loading. More about “<a href="http://en.oreilly.com/velocity2010/public/schedule/detail/15412">Don’t Let Third Parties slow you down</a>” and <a href="http://www.royans.net/arch/speeding-up-3rd-party-widgets-using-iframes/">http://www.royans.net/arch/speeding-up-3rd-party-widgets-using-iframes/</a></p>
</li>
</ul>
<h3 id="mobile-performance-optimization">Mobile performance optimization</h3>
<p>Most of the recommendations have been taken off <a href="http://firt.mobi/">Maximiliano Firtman</a>’s Mobile Web High Performance. You can <a href="http://www.mobilexweb.com/blog/mobile-web-high-performance">view slides here</a>.</p>
<ul>
<li>
<p>Avoid JQuery unless you really need it. Check out slide 90. It takes 1.8 seconds on iPhone and 4 seconds on Android to download and parse JQuery. Use mobile optimized frameworks such as baseJS and XUI</p>
</li>
<li>
<p>Avoid DNS lookups and minimize number of requests since they are slow</p>
</li>
<li>
<p>Embed CSS and Javascript on the home page. After onload download external CSS and JS.</p>
</li>
<li>
<p>Use inline images (slide 56) and <a href="http://pukupi.com/post/1964">pictograms</a></p>
</li>
<li>
<p>Avoid redirects</p>
</li>
<li>
<p>Use native constructs especially for Webkit browsers e.g. -webkit-text-stroke</p>
</li>
<li>
<p>Keynote announced their Mobile Testing tool for desktops that looks promising <a href="http://mite.keynote.com/">http://mite.keynote.com/</a></p>
</li>
</ul>
<h3 id="sslsecurity">SSL/Security</h3>
<ul>
<li>
<p>According to <a href="http://en.oreilly.com/velocity2010/public/schedule/detail/14217">Google SSL</a> overhead these days is pretty minimal. Around 1% on today’s servers.</p>
</li>
<li>
<p>Pet peeve about the presentation is they were advising everyone to use less secure key lengths ie. 1024 bits and RC4 cipher to improve performance. It is true that adding SSL to insecure connections is certainly an improvement but it should be qualified. E-mail probably fine. Financial sites probably bad.</p>
</li>
</ul>
<h3 id="scalability">Scalability</h3>
<ul>
<li><a href="http://en.oreilly.com/velocity2010/public/schedule/detail/13046">Hidden Scalability Gotchas in Memcached and Friends</a> by Neil Gunther (author of Guerilla Capacity Planning) and Shanti Subramanyam discussed their findings around memcached. They used quantitative analysis to analyze different memcache versions. Based on their analysis using Neil’s model memcache 1.4.5 has higher contention than 1.2.8.</li>
</ul>
<h3 id="culture">Culture</h3>
<ul>
<li>Thoroughly enjoyed John Rauser’s <a href="http://en.oreilly.com/velocity2010/public/schedule/detail/11793">Creating Cultural Change</a></li>
</ul>
GangliaView – automatically rotate Ganglia metrics2010-06-16T15:29:18+00:00http://blog.vuksan.com/2010/06/16/gangliaview-automatically-rotate-ganglia-metrics<p>GangliaView is a simple web app that allows you to automatically rotate selected Ganglia metrics. We use it to rotate key metrics with large graphs showing last hour and last day and smaller graphs showing last week and last month. A sample screen looks like this</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/06/gangliaview1.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/06/gangliaview1.png" alt="" /></a></p>
<p>GangliaView is derived from <a href="http://github.com/lozzd/CactiView">CactiView</a> with a number of changes to make it work with Ganglia and removal of frames. You can download it from here</p>
<p><a href="http://github.com/vvuksan/ganglia-misc">http://github.com/vvuksan/ganglia-misc</a></p>
Non-Dell SSDs/drives not supported until Q2 20112010-06-16T15:09:11+00:00http://blog.vuksan.com/2010/06/16/non-dell-ssdsdrives-not-supported-until-q2-2011<p>I am writing up this post so perhaps I can save some poor sysadmin from chasing their own tales. If you ever receive following error message using PERC H700 or H800 controllers</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Jun 15 14:00:17 db07 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 04(e0x20/s4) is not supported: Controller 0 (PERC H700 Integrated)
Jun 15 14:00:18 db07 Server Administrator: Storage Service EventID: 2334 Controller event log: Inserted: PD 05(e0x20/s5): Controller 0 (PERC H700 Integrated)
Jun 15 14:00:18 db07 Server Administrator: Storage Service EventID: 2335 Controller event log: PD 05(e0x20/s5) is not supported: Controller 0 (PERC H700 Integrated)
</code></pre></div></div>
<p>It is due to following</p>
<p><a href="http://www.standalone-sysadmin.com/blog/2010/04/dell-reverses-position-on-3rd-party-drives/">http://www.standalone-sysadmin.com/blog/2010/04/dell-reverses-position-on-3rd-party-drives/</a></p>
<p>Please note this will not be fixed until Q2 2011.</p>
Beauty of aggregate line graphs2010-06-05T15:27:41+00:00http://blog.vuksan.com/2010/06/05/beauty-of-aggregate-line-graphs<p>If you saw a graph like this</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/06/90thpercentile-consolidated-graph.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/06/90thpercentile-consolidated-graph.png" alt="90th percentile response time consolidated line graph" /></a></p>
<p>Would it mean anything to you :-) ? First time I was introduced to it I thought they were pointless since you couldn’t really see much. That was until I saw something like this</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/06/netstat-conn.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/06/netstat-conn.png" alt="Netstat consolidated line graph" /></a></p>
<p>This was was post release. Can you spot something wrong :-) ? Obviously color scheme is somewhat off in the last graph which we later reworked (visible in the top graph). We then have another set of graphs where you can drill down per host aggregations as we are running multiple Resin instances on the same machine so you could find the misbehaving instance.</p>
<p>You can make these graphs pretty easily by using Ganglia’s custom report graphs. I will try and post some of the ones we use in next couple days.</p>
<p>For those wondering what is 90th percentile response time you can read my <a href="http://blog.vuksan.com/2010/01/15/monitoring-your-site-via-90th-percentile-response-time/">Monitoring your website performance via 90th percentile response time</a>.</p>
Devops homebrew part deux2010-05-27T13:39:00+00:00http://blog.vuksan.com/2010/05/27/devops-homebrew-part-deux<p>This is the second part to the <a href="http://blog.vuksan.com/2010/04/09/devops-homebrew/">devops homebrew</a> post.</p>
<p>I forgot couple things in my first post so here are couple other observations</p>
<h3 id="change-is-an-ongoing-process"><strong>Change is an ongoing process</strong></h3>
<p>All the changes I talked about in the first post took a long time. It took more than a year to get issues assessed, discussed, designed, implemented and tested so don’t expect quick progress. It’s like an open heart surgery where you don’t have time stop everything and start from scratch.</p>
<h3 id="no-hardcoded-paths"><strong>No hardcoded paths</strong></h3>
<p>Perhaps this one should be obvious however it is really important to make the app relocatable ie. app should assume all the files it needs are within it’s container. This means that every file reference should be relative to the base container directory e.g. all the WARs and configuration files should be placed in /run/base and startup script would pass that as a variable ie. -DBASEDIR=/run/base. Application should then use BASEDIR instead of /run/base.</p>
<h3 id="tools-tools-tools">Tools, tools, tools</h3>
<p>One of the critical operations responsibilities is providing and building tools for use by other groups such as technical support, development, QA etc. This goes beyond using tools such as configuration management and deployment but also building tools that enable other groups to do their jobs more effectively. For instance at one job we used to interface to hundreds of external LDAP/IMAP sources for authentication/authorization purposes. This was fraught with problems since often these services would e.g. misconfigure firewalls (not whitelist the right IP), have expired or self-signed SSL certificates, use wrong LDAP base DNs etc. This would chew up a lot of professional services, dev and ops time since looking at the application logs often gave incomplete answers. Also it could take couple iterations to fix the problem chewing up even more time. We ended up building a simple web page that enabled professional services to quickly validate the service ie. does DNS resolve, can I open up a TCP connection to the target port, is SSL certificate expired etc. This greatly reduced work load and time to resolution. In another job technical support would often need production settings however due to compliance reasons couldn’t have unfettered access to the systems. For them we built a web app that allowed read-only view to the needed settings. I’m sure you can think of other cases where little automation can yield you huge efficiencies.</p>
<h3 id="use-underpowered-qa-environments">Use underpowered QA environments</h3>
<p>This may be controversial since lots of people are of the opinion that you should try to have as close to the exact replica of production in QA. This is true if you are doing performance tests however if you have an underpowered environment some issues are likely to crop up that otherwise wouldn’t. It is very hard to simulate production load so having underpowered environments gives you valuable data points. For example our primary QA environment ran on couple virtualized servers with modest disk space allocation ie. 10 GB. On more than one occasion we caught serious code deficiencies when the growing query log (turned on in QA) triggered low disk space alerts. If we had bigger disks we may have missed these. This doesn’t preclude having a separate environment just for running performance test just use the underpowered environment for everything else.</p>
<h3 id="dev-vs-ops">Dev vs ops</h3>
<p>There is often conflict between dev and ops due to stereotypes, poor communication but very often misaligned business goals. For instance I have very often seen/experienced conflict with devs when they were under intense pressure to deliver a feature on a tight deadline. This often happens in startups that cater to large businesses, universities or government organizations where a large sales deal is contingent on a particular feature being implemented. It leads to poor implementation, QA, production issues etc. which coupled with poor division of labor causes frustration and resentment. Being woken up numerous times in the middle of night due to a production issue quickly wears people out. Therefore it is important to strike a balance between ops and dev goals and overall business goals.</p>
<p>One of the possible approaches is to get together and discuss following issues</p>
<ul>
<li>
<p>Ops, dev and QA should jointly assess new product functionality and how it affects each of these groups. Very often product management and sales and marketing will discuss new features only with dev who may not appreciate the difficulty of certain ops decisions.</p>
</li>
<li>
<p>Division of responsibility - discuss whose responsibility is to fix things when they break. There is a spectrum here where ops can do first level troubleshooting then hand it off to developers to developers running and deploying in production and ops providing a supportive role running services and tools that enable the application</p>
</li>
<li>
<p>Off hours coverage - this is probably the most contentious one since no one likes being woken up at night however developers should be on hook for “pager duty”. It doesn’t have to be regularly but at least once in a while. That is really only way for them to walk in ops shoes. For some organizations this may be a non-issue since their stuff never breaks in off hours ;-).</p>
</li>
<li>
<p>Ops should involve devs in running the production by educating them about monitoring and performance gathering systems so that they can see effect of their coding first hand. For instance you can implement “monitoring duty” where each week someone different from either dev or ops team is tasked to review performance metrics looking for things that are out of whack.</p>
</li>
<li>
<p>Discuss how you can make each other life’s easier. There are always areas where you can complement each others skills and create something that helps everyone.</p>
</li>
<li>
<p>Most important don’t forget that a dose of humility goes a long way :-).</p>
</li>
</ul>
Vonage the new Baby Bell2010-05-13T18:11:28+00:00http://blog.vuksan.com/2010/05/13/vonage-the-new-baby-bell<p>It is sometimes amazing to me how new upstarts morph into their own arch enemies. Case in point is Vonage. For years I used to have Vonage service at home as a backup phone service. I was on a 500 minute plan for $14.99+taxes. This was a great plan for me as I didn’t use the phone much. However at some point they decided that was too little money and they hiked up the price to $16.99 (something like that). It may seem like a small difference but I figured I may be better of elsewhere. I ended up switching to <a href="http://galaxyvoice.com/">Galaxy Voice</a> which I am using to this day since they had more flexible calling plans.</p>
<p>We recently expanded our office space and we needed a phone line added to a conference room. Since I had my old Vonage adapter at home I figured I would bring it and we’d use it. I thought it would be as easy as going to Vonage’s web site, supplying the phone adapter ID and my credit card number and I would be set. It wasn’t so. After entering the phone ID I got this message</p>
<p>The MAC address you entered is associated with an existing Vonage account. Please call our Customer Care department at 1-866-293-5676 for immediate assistance.</p>
<p>I called the number and spoke to someone in Customer service. This took about 20 minutes while the person kept re-asking for the same data and concluded that they couldn’t help me and that I would have to talk to tech support. Tech support guy was equally unhelpful. Basically I could not activate a device that was ever used before since the system “knew” about it. Talk about having a piece of useless technological trash. At that point I was sufficiently frustrated to end the call. I tweeted about my experience and a day later I was contacted by Vonage’s Twitter team about having someone at customer service contact me. I thought I’d give it a go. I got a call and this experience was not a whole lot better than the previous ones. Person kept asking me for my personal information including name, billing address, what was the credit card number I used for paying bills and the e-mail address I used. Since this was more than a year ago and I have dozens of e-mail addresses I said I couldn’t remember. At that point I ended the call since I was sufficiently frustrated. I was willing to give these people money yet they were making me jump through all this hoops. I don’t get it.</p>
<p>It occurred to me later that this was very similar to experiences that I had with a local phone company when I would move and I would have to get through all these bureaucratic hoops to make sure all my features stayed the same after I moved.</p>
Installing RedHat 6 Enterprise DomU under Xen2010-05-11T21:48:57+00:00http://blog.vuksan.com/2010/05/11/installing-redhat-6-enterprise-domu-under-xen<p>Recently I downloaded RedHat 6 Enteprise beta (RHEL6). I wanted to install it as a Xen guest (DomU) on top of an existing Centos 5 Xen host. Unfortunately it did not work out of the box. I ran</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>virt-install --prompt
</code></pre></div></div>
<p>on the Xen host which let me install RHEL6 however when the install rebooted I was greeted with this error message</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fs = fsimage.open(file, get_fs_offset(file))
IOError: [Errno 95] Operation not supported
</code></pre></div></div>
<p>Fortunately Karanbir Singh had a blog post about this at</p>
<p><a href="http://www.karan.org/blog/index.php/2010/04/28/rhel6-xen-domu-on-a-centos-5-dom0#c5274">http://www.karan.org/blog/index.php/2010/04/28/rhel6-xen-domu-on-a-centos-5-dom0</a></p>
<p>Differences I found were that I had to make the root partition an ext2 filesystem as well. Also I found out that I couldn’t review the partition layout if I ran the installation in the text mode. I had to use VNC to be able to set proper partition types.</p>
Customizing iomega StorCenter ix4-200d with ipkg2010-04-28T14:16:25+00:00http://blog.vuksan.com/2010/04/28/customizing-iomega-storcenter-ix4-200d-with-ipkg<p>I have the iomega StorCenter ix4-200d. It is a nice little NAS with a number of decent features including rsync server etc. Unfortunately there were couple things I wanted fixed since for example rsync was at version 2.6.9 which does not support incremental updates. Machine runs a custom Linux distribution so I figured someone must have figured out how to customize it. I found part of the answer here</p>
<p><a href="http://www.krausam.de/?p=33">www.krausam.de/?p=33</a></p>
<p>To enable SSH you need to log in as administrator to your StorCenter then go to https://<storcenterIP>/support.html. Turn on SSH access. StorCenter will reboot. Then you will be able to ssh into the box as root where password is your admin password with soho prepended ie. if your web gui password is secret then root password is sohosecret.</storcenterIP></p>
<p>Post has a way to bootstrap Debian on the box however I found an easier solution ie. StorCenter ships with ipkg utility which is similar to apt-get and yum commands. To enable proper repositories I searched and found them here</p>
<p><a href="http://forum.synology.com/enu/viewtopic.php?f=40&t=5823">http://forum.synology.com/enu/viewtopic.php?f=40&t=5823</a></p>
<p>Easy way to add them is cut and paste following</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat <<EOF > /etc/ipkg.conf
src cross http://ipkg.nslu2-linux.org/feeds/optware/cs08q1armel/cross/unstable
src native http://ipkg.nslu2-linux.org/feeds/optware/cs08q1armel/native/unstable
EOF
</code></pre></div></div>
<p>Then type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ipkg update
</code></pre></div></div>
<p>After that you can check the list of available packages by typing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ipkg list | less
</code></pre></div></div>
<p>To install packages type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ipkg install <package_name>
</code></pre></div></div>
<p>Please note that packages are installed in /opt so adjust paths properly ie. screen is installed in</p>
<p>/opt/bin/screen</p>
<p>Hope this helps someone</p>
Tracking web clients in real time2010-04-21T01:35:42+00:00http://blog.vuksan.com/2010/04/21/tracking-web-clients-in-real-time<p>Most recently I have been working on being able to more quickly identify abusers of our service ie. spammers, crawlers etc. We already have a process that rotates web logs on all web servers hourly then processes them extracting per IP access info. On occasion abusers get quite aggressive and cause some of our alarms to go off by causing excessive number of log errors etc. Trouble is that due to logs being processed on the hour there is a window of time where we may spend extra time trying to track down the cause of log errors. I figured it would help if the IP tracker was real-time. Luckily we have already been using a package called Ganglia Logtailer</p>
<p><a href="http://bitbucket.org/maplebed/ganglia-logtailer/">http://bitbucket.org/maplebed/ganglia-logtailer/</a></p>
<p>which processes our web logs every minute and publishes metrics such as number of HTTP 200/300/400/500 hits, average and 90th percentile response time. All I had to do was send the IP data to a storage engine of my choice. Initially I thought I could use mySQL however decided against it due to following reasons</p>
<ol>
<li>
<p>Currently we can get up to 2500 hits/sec so processing them on the minute would result in roughly 150k inserts which mySQL may have some trouble processing in short amount of time.</p>
</li>
<li>
<p>I don’t need this data after couple hours.</p>
</li>
</ol>
<p>I looked at Redis which has some interesting features around sets however I decided to use memcached since we were already using it and if I ever wanted to use a more persistent storage engine I could replace it with memcachedb or Tokyo Cabinet with no changes to the code.</p>
<p><strong>Implementation</strong></p>
<p>Implementation consists of two pieces</p>
<ol>
<li>
<p>Modified Ganglia Logtailer class that inserts data into memcached. You can find a VarnishMemcacheLogtailer class on the Bit Bucker logtailer site which implements this. All you have to do is modify the location of the memcached server (set to localhost). Current implementation aggregates data per hour ie. all the numbers are hourly numbers. It would be trivial to do it for 10 minute or 1 minute periods.</p>
</li>
<li>
<p>Client application that displays data from memcached. I wrote a PHP interface that shows top 20 IPs from the web servers that can be downloaded from here
<a href="http://bitbucket.org/vvuksan/realtime-iptracker"></a></p>
</li>
</ol>
<p><a href="http://bitbucket.org/vvuksan/realtime-iptracker">http://bitbucket.org/vvuksan/realtime-iptracker</a></p>
<p>Tracker looks something like this</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/04/iptracker.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/04/iptracker.png" alt="" /></a>**Update: **I do realize Splunk would be great for this kind of a purpose. Trouble is that for the amount of logs we create we’d have to get a really large Splunk license and those are quite expensive.</p>
Devops homebrew2010-04-09T13:58:05+00:00http://blog.vuksan.com/2010/04/09/devops-homebrew<p>There has been quite a bit of discussion about Devops and what it means. <a href="http://twitter.com/blueben/status/11720129187">@blueben</a> has suggested we start a Devops patterns cookbook so people can learn what worked or didn’t work. This is the description of the environment we implemented at a previous job. Some of these things may or may not work for you. I will try to keep it short.</p>
<h3 id="environment-background">Environment background</h3>
<p>7 distinct applications/products that had to be deployed and tested ie. base/core application, messaging platform, reporting app etc. All applications were Java based running on either Tomcat or Jboss.</p>
<h3 id="application-design-for-deployment">Application design for deployment</h3>
<p>These are some of the key points</p>
<ol>
<li>
<p>Application should have a sane default configuration options. Any option should be overrideable by an external file. In most cases you only need to override database credentials (host, username, password). Goal is to be able to use the same binary across multiple environments.</p>
</li>
<li>
<p>Application should expose key internal metrics. We for instance asked for a simple key/value pairs web page ie. JMSenqueue=OK etc. This is important because there are lots of things that can break inside the application which external monitoring may miss like JMS message can’t be enqueued, etc.</p>
</li>
<li>
<p>Keep release notes actions to a minimum. Release notes are often not followed or partially followed thus make sure point 1. is followed and/or try to automate everything else.</p>
</li>
</ol>
<h3 id="continuous-integration">Continuous Integration</h3>
<p>We used CruiseControl for Continuous Integration. It was used solely to make sure that someone didn’t break the build.</p>
<h3 id="creating-releases">Creating releases</h3>
<p>Developers are in charge of building and packaging releases. This primarily because QA or Ops will not know what to do if a build fails (this is Java remember). Each release has to be clearly labeled with the version and tagged in the repository. For example Location 1.1.5 will be packaged as location-1.1.5.tar.gz. Archives should contain only WAR (Tomcat) or EAR (Jboss) files and DB patch files. Releases are to be deposited into an appropriate file share ie. /share/releases/location.</p>
<h3 id="deployment">Deployment</h3>
<p>In order to eliminate most manual deployment steps and support all the different applications we decided to write our own deployment tool. First we started off with a data model which roughly broke down to</p>
<ol>
<li>
<p>Applications – can use different app server containers ie. Tomcat/JBoss, may/will have configuration files that can be either key/value pairs or templates. For every application we also specified a start and stop script (hotdeploy was not an option due to bad experiences with our code).</p>
</li>
<li>
<p>Domains/Customers – we wanted a single Dashboard that would allow us to deploy to multiple environments e.g. QA staging (current release), QA development (next scheduled release), Dev playbox, etc. Each of these domains had their own set of applications they could deploy with their own configuration options</p>
</li>
</ol>
<p>First we wrote a command line tool that was capable of doing something like this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ deployer –version 1.2.5 –server web10 –domain joedev –app base –action deploy<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; white-space: normal; font-size: 13px;"> </span>
</code></pre></div></div>
<p>What this would do is</p>
<ol>
<li>
<p>Find and unpack the proper app server container e.g. jboss-4.2.3.tar.gz</p>
</li>
<li>
<p>Overlay WAR/EAR files for the name version e.g. base-1.2.5.tar.gz</p>
</li>
<li>
<p>Build configuration files and scripts</p>
</li>
<li>
<p>Stop the server on the remote box (if it’s running)</p>
</li>
<li>
<p>Rsync the contents of the packaged release</p>
</li>
<li>
<p>Make sure Apache AJP proxy is configured to proxy traffic and do Apache reload</p>
</li>
<li>
<p>Start up the server</p>
</li>
</ol>
<p>One of the main reason we started off with a command line tool is that we could easily write batch scripts to upgrade whole set of machines. This was borne out of pain of having to upgrade 200 instances via a web GUI at another job.</p>
<p>Once deployer was working we wrote a web GUI that interfaced with it. You could do things like View running config (what config options are actually on the appserver), Stop, Restart, Deploy (particular version), Reconfig (apply config changes) and Undeploy. We also added the ability to change or add configuration options to the application specific override files. Picture is worth thousand words. This is a tiny snippet how it approximately looked for one domain</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/04/deployer-11.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/04/deployer-11.png" alt="" /></a></p>
<p>This was a big win since QA or developers no longer needed to have someone from ops deploy software.</p>
<h3 id="db-patching">DB patching</h3>
<p>Another big win was “automated” DB patching. Every application would have a table called Patch with a list of DB patches that were already applied. We also agreed that every app would have dbpatches directory in the app archive which would contain a list of patches named with version and order in which they should be applied e.g.</p>
<ul>
<li>
<p>2.54.01-addUserColumn.sql</p>
</li>
<li>
<p>2.54.02-dropUidColumn.sql</p>
</li>
</ul>
<p>During deployment startup script would compare contents of the patch table and a list of dbpatches and apply any missing ones. If the patch script failed e-mail would be sent to the QA or dev in charge of particular domain.</p>
<p>A slightly modified process was used in production to try to reduce down time ie. things like adding a column could be done at any time. Automated process was largely there to make QA’s job easier.</p>
<h3 id="qa-and-testing">QA and testing</h3>
<p>When a release was ready QA would deploy the release themselves. If there was a deployment problem they would attempt to troubleshoot it themselves then contact the appropriate person. Most of the times it was an app problem ie. particular library didn’t get commited etc. This was a huge win since we avoided a lots of “waterfall” problems by allowing QA to self-service themselves.</p>
<h3 id="production">Production</h3>
<p>Production environment was strictly controlled. Only ops and couple key engineers had access to it. Reason was we tried to keep the environment as stable as possible. Thus ad hoc changes were frowned upon. If you needed to make a change you would either have to commit a change into the configuration management system (puppet) or use the deployment tool.</p>
<h3 id="production-deployment">Production deployment</h3>
<p>The day before the release QA would open up a ticket listing all the applications and versions that needed to be deployed. On the morning of the deployment (that was our low time) someone from ops, development and whole QA team engaged in deploying the app and resolving any observed issues.</p>
<h3 id="monitoring">Monitoring</h3>
<p>Regular metrics such as CPU utilization, load etc. were collected. In addition we kept track of internal metrics and set up adequate alerts. This is an ongoing process since over time you discover what your key metrics are and what their thresholds are ie. number of threads, number of JDBC connections etc.</p>
<h3 id="things-that-didnt-work-so-well-or-were-challenging">Things that didn’t work so well or were challenging</h3>
<ol>
<li>
<p>One of the toughest parts was getting developers’ attention to add “goodies” for ops. Specifically exposing application internals was often put off until eventually we would have an outage and lack of having the metric resulted in extended outage.</p>
</li>
<li>
<p>Deployment tool took couple tries to get right. Even as it was there were couple things I would have done differently ie. not relying on a relational database for the data model since it made it difficult to create diffs (you had to dump the whole DB). I’d likely go with JSON so that diffs could be easily reviewed and committed.</p>
</li>
<li>
<p>Other issues I can’t recall right now :-)</p>
</li>
</ol>
<h3 id="wrapup">Wrapup</h3>
<p>This is the shortest description I could write. There are a number of things I glossed over and omitted so that this is not too long. I may write about those on another occasion. Perhaps the key take away should be that Ops should focus on developing tools that either automate things or allow its customers (QA, dev, technical support, etc.) to self-service themselves.</p>
<p><strong>Update</strong>: There is a <a href="http://blog.vuksan.com/2010/05/27/devops-homebrew-part-deux/">second part to this posts</a></p>
Devops religion wars2010-04-06T16:04:56+00:00http://blog.vuksan.com/2010/04/06/devops-religion-wars<p>I have been trying to stay out of the devops arguments but it seems that they are slowly devolving into religious wars. It seems that each group ie. devops and non-devops is convinced that they are in possession of “eternal self-evident truths” and that everyone else is unenlightened hater or similar. Proof in point is following post</p>
<p><a href="http://brian.moonspot.net/devops-dealnews">http://brian.moonspot.net/devops-dealnews</a></p>
<p>Brian describes their devops process which seems reasonable to me. What is most important is that it works for him, his group and his site.</p>
<p>Unfortunately comments devolve from there. A non-devops person raises a good point about the process however does it with poor style and insulting language. Response is to compare devops and non-devops approach with giving man a fish vs. teaching someone to fish. It goes from there. It’s all just too silly. Firstly I am not aware of definite devops definition. Secondly every environment is different. What may work for you may not work everywhere else. I really doubt that continuous deployment would work if your web app was used in providing emergency medical care. That said things have changed and availability expectations have increased so cooperation between development and ops is critical. Therefore let’s try to stop with the silly arguments and try to learn from each other. Most of all avoid insulting language. I realize we all get frustrated at times but it really devalues your view.</p>
Building Redhat/CentOS KVM images on Ubuntu 9.102010-03-11T16:37:04+00:00http://blog.vuksan.com/2010/03/11/building-redhat-centos-kvm-images<p>This is a quick recipe on how to create a Redhat/CentOS KVM image on Ubuntu 9.10 (karmic). First make sure you have Virtualization (VT) turned on. For example Dell laptops will have it disabled by default. Go into BIOS and enable it. To check whether it is turned on run</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>egrep '(vmx|svm)' /proc/cpuinfo
</code></pre></div></div>
<p>If this comes out empty VT is not enabled and KVM will not work.</p>
<p>Install kvm packages</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get install qemu-kvm
</code></pre></div></div>
<p>Edit /etc/qemu-ifup to add <strong>virbr0</strong> as the bridge to which KVM guest should attach itself. Comment out line below and add lines below e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#/usr/sbin/brctl addif ${switch} $1
/usr/sbin/brctl addif virbr0 $1
</code></pre></div></div>
<p>Same change needs to be done in /etc/qemu-ifdown ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#/usr/sbin/brctl delif ${switch} $1
/usr/sbin/brctl delif virbr0 $1
</code></pre></div></div>
<p>Download CentOS 5.4 Boot ISO image e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://www.gtlib.gatech.edu/pub/centos/5.4/isos/x86_64/CentOS-5.4-x86_64-netinstall.iso
</code></pre></div></div>
<p>Create an empty image (last argument is the image size)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kvm-img create -f qcow2 centos5.img 10G
</code></pre></div></div>
<p>Launch install (-m is memory size)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo kvm -hda centos5.img -cdrom boot.iso -m 512 -boot d \
-net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap
</code></pre></div></div>
<p>Install CentOS however you like. When you are done your CentOS install will reboot and try to boot off the CD-ROM. At this point shut down the KVM guest by closing the window. To run it remove the cdrom references and boot option e.g.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo kvm -hda centos5.img -m 512 \
-net nic,vlan=0,model=e1000,macaddr=00:16:3e:de:00:01 -net tap
</code></pre></div></div>
<p>Note: I am setting a fixed MAC address. You can leave it off and it will be generated randomly every time you start up kvm instance.</p>
Password complexity madness2010-01-22T13:38:19+00:00http://blog.vuksan.com/2010/01/22/password-complexity-madness<p>You know the pitch. Each time you create an account for a “secure” site you are forced to come up with a complex password ie. you need to have a number, a capitalized letter, perhaps a special character such as + or -. Trouble is policies differ so on one site password has to be a minimum length, maximum length, some don’t allow special characters etc. The thing is at one point in time this made sense and was required to keep basic security but it may not make sense today.</p>
<p>Ages ago computer systems (in particular UNIX systems) used to store passwords in a hashed format (hash . You can read more on <a href="http://en.wikipedia.org/wiki/Cryptographic_hash_function">cryptographic hashes on Wikipedia</a>. The trouble is that these hashes were available for any user to see ie. you could copy a password file (/etc/passwd) or use YP/NIS tools to get a list of all passwords in an organization. Once you have the password file you do not know what the passwords are however you can take a word dictionary start computing hashes since a particular password will always convert to the same hash and compare it if there are any matches in your password file. If you find a match you know have “discovered” users password. This is often referred to as off-line password cracking since it allows you derive passwords without interacting with the target system. This has many advantages since you can try millions of passwords quickly and the target system’s administrator will not be alerted. Based on this fact password policies were instituted that mandated password complexity since passwords with complexity ie. 9pc_miu would be nearly impossible or very hard to break (it may take years to break it). This made sense then.</p>
<p>However it doesn’t make much sense now since on most systems regular users have no access to the password hashes. On UNIX systems “shadow” (/etc/shadow) is used to hide them or you may be using LDAP which has the capability of hiding password hashes, etc. The only users that have access to those hashes are administrator however they have other ways of acquiring your passwords. Thus your real exposures in order of importance are</p>
<ul>
<li>
<p>Trivial passwords or easily guessable password ie. 123456, 1234, date of birth</p>
</li>
<li>
<p>Using same password across different sites ie. this is a problem if e.g. site A.com gets hacked and hackers are able to determine your password and log into site B.com</p>
</li>
</ul>
<p>I actually feel that password complexity breeds poor security since people will write down complex passwords instead of remembering them. Just remember how many times have you seen passwords on post-it notes on someone’s monitor. Perhaps it is time to scrap the password complexity and use something simpler.</p>
Cool DNS tricks you can't use for fail-overs2010-01-20T14:34:08+00:00http://blog.vuksan.com/2010/01/20/cool-dns-hacks-you-cant-use-for-fail-over<p>At a previous job for availability and business continuity reasons we set up a geographically redundant data center because even the best data centers will have outages. No matter what a vendor tells you processes are never followed fully. You can also have a major disaster with critical pieces of your hardware that may cripple or disable your whole infrastructure ie. switch goes crazy etc.</p>
<p>Service we provided was critical so highest availability was imperative. Management wanted an active-active set up ie. use both data centers in a load-balanced fashion however that would have entailed extensive application rewrite due to the nature of our application and the level of database transactions involved. Thus we settled on a hot-cold configuration where we would have an active site that was serving customers and a cold site that was kept up to date via replication. In case of trouble (as determined by ops) we would fail-over our hot site to the cold site. This is fairly straight forward except for the part where you are actually failing things over ie. your hot site is down, you break off replication, change DNS entries, start up all the necessary services however due to <a href="http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/#jsid-1254638526-674">DNS caching</a> some of your customers are still pointing to your “dead” site. Depending on your browser this could be 30 minutes+. Did I mention this service was critical ?</p>
<p>We went through the list of possible options on how to resolve this</p>
<ol>
<li>
<p>Use an outside party load balancer(s) ie. an off-site load balancer(s) that would proxy traffic to the site that was live. This seemed like a plausible idea however we didn’t like the fact we were introducing yet another failure point and adding latency due to extra round-trip.</p>
</li>
<li>
<p>Changed DNS TTL to 2 minutes however that was also insufficient due to different browsers behavior. For example IE 6 (perhaps even higher) will cache DNS entries for 30 minutes</p>
</li>
</ol>
<p><a href="http://support.microsoft.com/kb/263558">http://support.microsoft.com/kb/263558</a></p>
<ol>
<li>Use <a href="http://en.wikipedia.org/wiki/Round_robin_DNS">round-robin DNS</a> aka. multiple DNS A records with a “twist”</li>
</ol>
<p>What we did there is put both of our data center’s IPs into the A record for our site ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>www.domain.com IN A 1.2.3.4
www.domain.com IN A 9.8.7.6
</code></pre></div></div>
<p>What happens with most browsers is that they will attempt the first IP and if they get a connection refused they will try the next (and next if you have more than 2). This actually works quite well e.g. even if the browser was getting requests from 1.2.3.4 if 1.2.3.4 all of the sudden goes down it will in sub-second time fail-over to 9.8.7.6. The “twist” we added was that we only answered on the active colo IP and returned connection closed on the inactive. If we needed to failover we’d just swap one colo and deactivate the other. Quick failovers here we come :-).</p>
<p>This all worked great for some time until we started receiving isolated reports that people weren’t able to access our site. Investigating the issue further we discovered that all of the people having connectivity issues were behind a transparent HTTP proxy. In this particular case the transparent proxy would not return connection refused but “page not found” or something similar neutralizing our clever hack :-(.</p>
<p>Obviously if you audience is different and you know your users don’t use proxies you could use this approach however this doomed it for us.</p>
Monitoring your website performance via 90th percentile response time2010-01-15T13:20:32+00:00http://blog.vuksan.com/2010/01/15/monitoring-your-site-via-90th-percentile-response-time<p>There are numerous ways to monitor the health and performance of your web site. Some of the popular ways are</p>
<ul>
<li>
<p>measure response time of a particular URL on your site. If it exceeds a threshold (which is site dependent) it is time to investigate</p>
</li>
<li>
<p>compare pertinent metrics such as the number of created sessions, http connections, etc.</p>
</li>
<li>
<p>watch CPU utilization/load of the machine</p>
</li>
</ul>
<p>Unfortunately most of these are flawed since they don’t provide you with the most important metric and that is how fast is the site for you customers. Above metrics are not useless and do help paint the picture but they may provide you a false sense of how fast your site is since the URL you are checking may be behaving quite fast however some other part of the site due to a newly introduced feature may be behaving terribly. I have found one of the best metrics to watch is the 90th percentile request response time. Basically, you take every request passing through your web servers, log the time it takes to serve them, sort them from fastest to slowest then take the 90th percentile time. Therefore if your 90th percentile is 1 second it means that 90% of the requests have been served in under a second and 10% in more than a second. You may be asking yourself “so what?”. Here is why ?</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/01/response_90th_percentile.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/01/response_90th_percentile.png" alt="" /></a></p>
<p>So for at least couple minutes 10% of your visitors/requests were waiting for more than 17 seconds to have their requests served. That can’t be good for business and you may want to investigate the cause.</p>
<p>You could also consolidate response times from different web servers on one graph and you get this.</p>
<p><a href="http://blog.vuksan.com/wp-content/uploads/2010/01/response_90th_percentile1.png"><img src="http://blog.vuksan.com/wp-content/uploads/2010/01/response_90th_percentile1.png" alt="" /></a></p>
<p>It may not look like much but it is pretty clear if an individual web server starts acting up.</p>
<p>How do you get on the fun ? You can look at the steps how to add Apache real-time metrics which also covers the 90th percentile response time on this URL</p>
<p><a href="http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats">http://vuksan.com/linux/ganglia/#Apache_Traffic_Stats</a></p>
<p>I want to thank Ben Hartshorne (@<a href="http://twitter.com/maplebed">maplebed</a>) for making me aware of this metric.</p>
Quick way to determine SSL certificate expiration2009-12-01T19:09:58+00:00http://blog.vuksan.com/2009/12/01/quick-way-to-determine-ssl-certificate-expiration<p>If you need a quick way to determine when a certain SSL certificate expires you can utilize following approaches. In both examples server I am trying to check is called webserver.domain.com.</p>
<p>If you have Nagios plugins installed you could type</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># /usr/lib/nagios/plugins/check_http -p 443 -S -C 15 webserver.domain.com
CRITICAL - Certificate expired on 11/01/2009 11:23.
</code></pre></div></div>
<p>That’s easy. However what if you don’t have Nagios plugins. In that case you can do the same with OpenSSL and s_client. Look for notAfter field.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># echo | openssl s_client -connect webserver.domain.com:443 | openssl x509 -noout -dates
...
notBefore=Nov 1 11:23:30 2008 GMT
notAfter=Nov 1 11:23:30 2009 GMT
</code></pre></div></div>
<p>Easy :-).</p>
Don't let mySQL substitute engines2009-11-05T18:16:46+00:00http://blog.vuksan.com/2009/11/05/dont-let-mysql-substitute-engines<p>Word of warning to all who use mySQL (yes you poor souls). By default mySQL 5.0 and 5.1 will substitute storage engines if the one you requested is not available. It doesn’t happen too often but when it does happen it is quite bad. For instance when setting up a new mySQL database something went wrong during creation of InnoDB logs and thus mySQL decided to DISABLE InnoDB storage. Unfortunately this was not caught and DBs were built that really needed InnoDB storage engine since they required foreign keys and other fun stuff. In their “awesomeness” mySQL developers decided that the default behavior should be to simply substitute (replace) InnoDB with myISAM. There is a warning however no error message is displayed and an import will continue unabated. Thus in my case things worked for a while until oddities were discovered which were traced back to the engine substitution. Unfortunately at that point it is fairly difficult to fix the problems since some of the constraints may be broken.</p>
<p>To avoid such a situation make sure you add following statement to my.cnf</p>
<p>sql_mode=”NO_ENGINE_SUBSTITUTION”</p>
<p>To verify what engines are active on mySQL shell prompt type</p>
<p>SHOW ENGINES</p>
Infrastructure redundancy is not cheap2009-10-06T11:43:02+00:00http://blog.vuksan.com/2009/10/06/infrastructure-redundancy-is-not-cheap<p>There was quite a discussion on Twitter about the BitBucket outage which initially appeared to be failure of Amazon EC2/EBS. More about the outage can be found <a href="http://blog.bitbucket.org/2009/10/04/on-our-extended-downtime-amazon-and-whats-coming/">here</a>. Brett Piatt was kind enough to write up his view of the situation</p>
<p><a href="http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/">http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-fundamental-design-concept/</a></p>
<p>In principal I do agree with his suggestions and his conclusion ie. that availability is a fundamental design concept. I do however disagree that “warm” redundancy is cheap. In my own view and experience redundancy is extremely expensive if you are going to do it right. Redundancy is not just being able to add more hardware, systems and monitoring software and failover policies but a matter of process where you continuously have to make sure that the redundancy works. For instance successful backup strategy doesn’t consist of simply getting a backup device yet never testing the backups by doing an actual restore. As many organizations have discovered backups do break, media gets corrupted, etc. and you can suffer a devastating blow. So if you want to do redundancy right you have to invest lots and lots of time practicing. For example running fire drills is a useful tool or doing periodic site failovers ie. run on site A for two weeks, then during low traffic times failover to site B, run for two week then back to site A and on and on. That certainly ain’t cheap.</p>
<p>I’d also point out that “warm” redundancy is in lots of instances riskier than “hot” redundancy since you may discover that redundancy doesn’t work when you have to failover whereas in “hot” redundancy issues may crop up much earlier allowing you to stay on top more readily.</p>
<p>That said the discussions over how you are responsible for your own availability reminds of “individual responsibility” (for my international readers this is something that is a hot topic in the United States). Sure you should “own” your redundancy however that may often be impractical or too expensive. Not everyone is blessed with copious resources.</p>
Keeping an eye on binary log growth2009-10-02T02:03:52+00:00http://blog.vuksan.com/2009/10/02/keeping-an-eye-on-binary-log-growth<p>Recently I got a report that some pages on the site were extremely slow. Looking at the web server metrics didn’t show anything new however mySQL DB metrics showed a definite change</p>
<p><img src="http://blog.vuksan.com/wp-content/uploads/2009/10/mysqlcpu.png" alt="MySQL server CPU utilization" /></p>
<p>ie. at the end of Week 38 there is an increase in CPU utilization. Nearly 60% increase. Interestingly enough there was a new software release at the end of Week 38 which pointed to either a bug or a new feature. Luckily I have been collecting mySQL metrics <a href="http://vuksan.com/linux/ganglia/#mySQL_server_stats">using this gmetric script</a>. This led me to these two graphs</p>
<p><img src="http://blog.vuksan.com/wp-content/uploads/2009/10/mysqlupdate.png" alt="mysqlupdate" /></p>
<p><img src="http://blog.vuksan.com/wp-content/uploads/2009/10/mysqlinsert.png" alt="mysqlinsert" /></p>
<p>So nearly double number of inserts and nearly triple the updates. Using mysqlbinlog I analyzed the update and insert statements and was able to identify the two culprit INSERT and UPDATE statements then sent it off to developers.</p>
<p>I also observed that had I watched the binary log growth I may have identified this earlier since there were a lot more binary logs for the period since the release. Thus <a href="http://vuksan.com/linux/ganglia/#mySQL_binary_log_growth_rate">mysql average binary log growth rate gmetric</a> was born :-). Now all I need to do is find out what normal growth rate is and if it goes outside of that norm use Nagios to send me a non-urgent alert.</p>
Nagios alerts based on Ganglia metrics2009-09-14T18:21:44+00:00http://blog.vuksan.com/2009/09/14/nagios-alerts-based-on-ganglia-metrics<p>Have you ever wanted to alert based on Ganglia metrics. Well you can :-)</p>
<p>You can find the source code <a href="http://vuksan.com/linux/ganglia/check_ganglia_metric.phps">here</a> for the plug in here.</p>
<p>Instructions how to set it up are <a href="http://vuksan.com/linux/nagios_scripts.html#check_ganglia_metrics">here</a>.</p>
Software doesn't run itself2009-09-13T23:11:58+00:00http://blog.vuksan.com/2009/09/13/software-doesnt-run-itself<p>Perhaps I should no longer be surprised but I am by the article mentioned in this blog post</p>
<p><a href="http://www.nakedcapitalism.com/2009/09/another-lehman-mess-no-one-can-run-the-software.html">http://www.nakedcapitalism.com/2009/09/another-lehman-mess-no-one-can-run-the-software.html</a></p>
<p>In particular this</p>
<p>Once it went bankrupt, the staff who supported these systems “evaporated”, according to Steven O’Hanlon, president of Numerix, a pricing and valuation company which is working with Lehman Brothers Holding Inc to unwind the derivatives portfolio.</p>
<p>These days computer systems are the blood of your company so allowing critical technical staff to simply “evaporate” is mind boggling. Granted company imploded but still I would think that someone should have figured out going into bankruptcy that they should set aside money to pay for their maintenance.</p>
<p>Ultimate problem as pointed out in the blog post on Naked Capitalism that documentation is usually skimped on since it “doesn’t provide value”. Although I would also add that when people say “code is documented” they don’t usually mention their systems infrastructure is documented. That can sometimes be even bigger impediment. At a previous job there was a Perl CGI script that most people didn’t know about and even fewer understood. If that script didn’t work our whole load balancing infrastructure would “mysteriously” fail since app servers wouldn’t register themselves to web servers and leading to a full blown outage. It was such an obscure “feature” that you could literally spend weeks chasing other avenues since this was so non-obvious.</p>
<p>Also I would not take comfort in having source code to an application. Lot of customers of startups will write in their contracts that if a startup goes bust they get access to the source code. That may sound nice but it doesn’t mean you will necessarily be able to run it. There are so many “secret” recipes, undocumented workarounds that are often involved in running most complex pieces of software that you should really be cautious.</p>
<p>In closing if you care that your software runs make sure you keep at least couple folks who have run it around.</p>
<p>http://www.nakedcapitalism.com</p>
<p>/2009/09/another-lehman-mess-no-one-can-run-the-software.html</p>
Simple "web service" for Ganglia metrics2009-09-11T20:58:06+00:00http://blog.vuksan.com/2009/09/11/simple_web_service_for_ganglia_metrics<p>Here is a simple PHP script to allow you to get current Ganglia metrics. You will need Ganglia web installation. Drop this script somewhere. Then invoke it via e.g.</p>
<p>http://mygangliaserver/ganglia/metric.php?server=web1&metric_name=load_one</p>
<p>Where server is the name of the server for which you want metrics and metric_name is the exact name of the metric you are looking for e.g. load_one, disk_free etc. Only thing that is returned is either ERROR message or actual value.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><?php
$GANGLIA_WEB="/var/www/html/ganglia";
include_once "$GANGLIA_WEB/conf.php";
include_once "$GANGLIA_WEB/get_context.php";
# Set up for cluster summary
$context = "cluster";
include_once "$GANGLIA_WEB/functions.php";
include_once "$GANGLIA_WEB/ganglia.php";
include_once "$GANGLIA_WEB/get_ganglia.php";
# Get a list of all hosts
$ganglia_hosts_array = array_keys($metrics);
$found = 0;
# Find a FQDN of a supplied server name.
for ( $i = 0 ; $i < sizeof($ganglia_hosts_array) ; $i++ ) {
if ( strpos( $ganglia_hosts_array[$i], $_GET['server'] ) !== false ) {
$fqdn = $ganglia_hosts_array[$i];
$found = 1;
break;
}
}
if ( $found == 1 ) {
if ( isset($metrics[$fqdn][$_GET['metric_name']]['VAL']) ) {
echo($metrics[$fqdn][$_GET['metric_name']]['VAL']);
} else {
echo("ERROR: Metric value not found");
}
} else {
echo "ERROR: Host not found";
}
?>
</code></pre></div></div>
<p>Nothing fancy. It contains rudimentary error checking so please be gentle :-). Feel free to extend it satisfy your needs. Also this is likely not scalable if you have hundreds of hosts and tons of requests.</p>
Broken hostname resolution and PAM don't mix2009-09-09T13:58:53+00:00http://blog.vuksan.com/2009/09/09/broken-hostname-resolution-and-pam-dont-mix<p>I don’t mean PAM the cooking spray but Pluggable Authentication modules. I was asked to change some DNS settings for a set of hosts ie. move them from one domain to another e.g. from them being in domain.com to be in domain.net. At the end of the process head node all of the sudden started refusing logins with following error message</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fatal: Access denied for user vvuksan by PAM account configuration
</code></pre></div></div>
<p>It took some hair pulling but after a while I concluded that the headnodes hostname was set to the old name e.g. server5.domain.com which was no longer resolvable. As soon as hostname was changed ie.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% hostname server5.domain.net
</code></pre></div></div>
<p>Things automagically started working again. Hope this prevents someone from going bald :-).</p>
Howto install a SSL certificate with intermediate certificate on a Cisco load balancer2009-08-27T15:43:08+00:00http://blog.vuksan.com/2009/08/27/howto-install-a-ssl-certificate-with-intermediate-certificate-on-a-cisco-load-balancer<p>This is a common problem across many different platforms. You generate a CSR, get a certificate but forget or don’t realize that besides installing the signed certificate you need to install the CA (Certificate Authority) Intermediate certificate. As a result some of the older browsers may complain about an invalid certificate or Java code will fail with following error message</p>
<p>Exception in thread “main” javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: CA key usage check failed: keyCertSign bit is not set</p>
<p>Solution is to download the intermediate certificate from the CA e.g. Verisign and GoDaddy and include it with the certificate. For instance in Apache you need to include <strong>SSLCertificateChainFile</strong> directive with the path to the intermediate certificate. On Cisco loadbalancer you would need to use <a href="http://docwiki.cisco.com/wiki/SSL_Termination_on_the_Cisco_Application_Control_Engine_Using_an_Existing_Chained_Certificate_and_Key_in_Routed_Mode_Configuration_Example">following Cisco document</a>. Specifically this directive</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ACE-1/routed(config)# crypto chaingroup intermed-1
ACE-1/routed(config-chaingroup)# cert intermediate.pem
</code></pre></div></div>
<p>The chaingroup needs to be applied to the ssl-proxy service in addition to the already configured certificate and key.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ACE-1/routed(config)# ssl-proxy service proxy-1
ACE-1/routed(config-ssl-proxy)# chaingroup intermed-1
</code></pre></div></div>
<p>If you got your certificate from Verisign you can check whether you installed it properly here</p>
<p><a href="https://knowledge.verisign.com/support/ssl-certificates-support/index?page=content&actp=CROSSLINK&id=ar1130">https://knowledge.verisign.com/support/ssl-certificates-support/index?page=content&actp=CROSSLINK&id=ar1130</a></p>
<p>You can always of course misconfigure things causing lots of time to be wasted. For instance on one occasion a well known managed hosting provider that was in charge of configuring the Cisco load balancer configured the load balancer as follows</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>crypto chaingroup some.domain.com
cert some.domain.com.cert
cert intermediate.cert
ssl-proxy service vip-1.2.3.4-ps
key some.domain-com.key
cert some.domain.com.cert
chaingroup some.domain.com
ssl advanced-options ssl-parameter-map-1
</code></pre></div></div>
<p>This is incorrect as the server certificate SHOULD NOT be included in the intermediate certificate chain. Otherwise the helpful Verisign test applet will complain with following message.</p>
<p>Two certificates were found with the same common name. The certificate installation checker cannot determine which is the correct certificate for the Web server. Remove the incorrect certificate and then test again.</p>
<p>Most browsers will work correctly however Java code will exhibit errors from the top of the article. Solution for the above problem is this</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>crypto chaingroup verisign-intermediate
cert intermediate.cert
</code></pre></div></div>
<p>Then included that chaingroup in the ssl-proxy directive. Once that was done the issue went away. Hope this saves someone some debugging time.</p>