Performance case for private clouds
Friday, March 12th, 2010Couple weeks ago I read this post on the memcached mailing list. Key quote
We currently run a cluster of aproximately 40 memcache servers with about 6.5 gb of ram each machine using m1.medium ec2 instances. I was in the process of reducing the number of servers while increasing the memory size for each from 6 to about 30gb. Now i've started noticing that some servers seem to hit certain bandwidth limitations not consistenly though since i have some servers pushing 6mb/sec and some at 4mb having packet los and tcp timeouts.
.....
I've replaced the instances hoping this will give me an instance on a better area or on a less congested switch but i still have the issue on the same server.
This surprised me at first since my understanding was that as EC2 instances get bigger there are less and less "neighbors" on compute nodes you have to deal with and in theory less chance they may impact your performance. Yet this person was having more issues with bigger instances than smaller. Thinking about it some more I realized that the reality may be exactly opposite ie. bigger the size the likelihood is that the instance is going to be used for a "big" workload e.g. a busy relational database. This could lead to inconsistent node performance. Inconsistent node performance is a bad thing since it makes troubleshooting problems much harder and also provides poor end user experience. Providing slow/substandard performance to fraction of your visitors may not seem like much but if you are a retailer it's lost sales.
Another thing to note is that lots of performance problems are subtle. Just the other day we had an issue when upgrading our F5 load balancers. We upgraded from version 9.4.3 to 10.1. Upgrade was mostly uneventful and everything seemed to work however after about a day we observed a fall in traffic going to our web caching tier. It looked like this
We also had another graph from a different source that "corroborated" this behavior. We spent a lot of time trying to identify what the problem was since F5 wouldn't believe there was a problem since the only evidence were couple graphs. To cut the long story short the upgrade "fixed" a behavior where certain objects were served out of F5 memory instead of being passed onto our web caching tier. It was apparently broken in the previous release and we didn't even know it. There are plenty of other cases where things have "broken" and only by observing metrics we were able to determine that there is an actual problem. Having inconsistent behavior makes that job extremely difficult if not impossible since it may be much harder to isolate problems.
Getting back to the initial problem one the obvious strategies is to keep cycling machines until you get more performance but as evidenced by the poster that was less than successful. Also what happens if after you have filled up your 30 GB memcache your performance degrades. What then ? You could try and launch another machine but that may spike up the load on your database server. Not a pleasant set of options.
Instead what you could do is following
- Find cloud providers that don't use virtualization (I have heard that they exist even though they are like the Bigfoot, hard to find) but deploy directly onto raw hardware. This will eliminate most of inconsistent node performance issues. The downside is it may be more expensive.
- Stick with virtualization but implement a private cloud where you have more insight into load on underlying hardware and control how images are deployed onto host machines. More on this point later.
- Hybrid between approach 1. and 2.
I personally think hybrid approach may be best since some workloads are best handled without going through the virtualization layer. As far as virtualization is concerned it is best to strategically place services based on resource utilization. Network services will fit into 3 broad levels of utilization ie. high resource utilization services such as relational databases, application servers, medium resource utilization services such as web servers and low resource utilization services such as DNS, monitoring, memcache etc. Trouble with public clouds is that you have no insight or say on what type of workloads are run. Instead if you had deployment control you could pair a relational DB image and memcache image on the same physical piece of hardware. That would likely work fine. If the performance of one degraded you could take appropriate action ie. move memcache image or look for the root cause of performance degradation. Since you have access to the underlying hardware you can isolate problems which will surely help in getting down to the root of the problem. The cons of the approach are increased complexity, cost and additional management overhead.
Even if you choose to adopt the above approach you still could use public clouds for things like static content storage, image resizing, development and QA systems etc. For really critical operations I would stick with raw hardware and/or private clouds.
