I wrote his article back in August of 2011 but for whatever reason did not publish it.
So I took a job at a mobile engagement management company (now some seven months ago) as the head of operations (for those of you drinking the cool aid, devops).
We track billions of events, summarize data, and report it along with value added suggestions to optimize the advertising campaigns of our paying customers. The challenge of course is scaling the organization. We hit a brick wall at AWS. AWS was becoming none of the things that were on the bill of sale. It was slooow. It was no longer cheap. It wasn’t reliable and it certainly wasn’t easy given the reliability problems.
One reason scaling was difficult is that the organization was sucked into the cloud. It was just like the USS Enterprise fighting off the planet killing monster in ‘The Doomsday Machine’. Once you get into the Doomsday machine, it is very very hard to get out. The cloud can be the same way.
Look up in the sky, it is a bird, it’s a plane, it’s a cloud thing. You see billboards for it at bus stops (eg. 2nd & Howard). The Could, advertised to leap tall buildings in a single bound, cut your costs, make you toast in the morning, yes the cloud. That undefined, highly variable, cloud thing. What is it? No one can quite tell you because it is just a marketing term.
Isn’t the cloud supposed to be good? That’s what all the talking heads say on TV. Consultants certainly pitch ‘the cloud.’ Even Microsoft would have you believe that the cloud can fix their software. Everything runs better in the cloud, doesn’t it?
I had an idea, I would measure it. Measuring is good right? Making observations is fundamental to the scientific process. It yields hard data from which intelligent decisions can be made.
What did I find when I ran iostat in cloud? I found goose eggs.
bash-3.2$ iostat -nMx 5 5 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d0 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0 1 c7d1 0.0 0.0 0.0 0.0 0.1 0.2 0.0 0.0 1 1 c7d2 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d2 extended device statistics r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7d2
The cloud didn’t seem very forthcoming in sharing its activities.
I knew I was hammering whatever AWS presented to me as a disk. I certainly didn’t like seeing goose eggs in my normal reporting tools — Yuck.
The cloud can’t be measured in a lot of common tools. Why? It simply does not map well into low level reporting tools. Most system level tools such as iostat, vmstat, or many things that look at raw disk I/O are simply broken. The first question that came to mind is how am I going to measure what is going on these systems?
Thankfully, two tools did work. fsstat and zpool iostat (DTRACE worked of course but that is a whole other discussion). fsstat at least gave me some insight into what was going in terms of throughput. To measure throughput of ZFS using fsstat, you can use this incantation, fsstat zfs 1. That will show you the absolute numbers reporting during the reporting interval (1 second, in this case). So this was the beginning of enlightenment.
Next I decided to run a test. A simple test. A test designed to stress the storage subsystem and at the same time mimic batch jobs. I used the pgbench and the mighty for() loop.
The setup was easy. I created a database with one billion rows (scale factor 10,000) and I set pgbench loose on it. It looked similar to this:
createdb pgbench_1b pgbench -i -s 10000 pgbench_1b for i in `seq 1 10000`; do pgbench -T 58 -c 30 -j 10 -N pgbench_1b sleep 2; done | tee ~/58s-pgbench_1b.$$
So I let that run for hours. Dont worry if you dont speak postgres, my test is only doing inserts. Initially, AWS returned 110 TPS but when I checked back just four hours later, they were only doing about 60 TPS. My database operation was crushing the AWS write back cache and exposing their real throughput.
I ran a similar setup on some eval gear sitting on my desk, graciously provided by DataOn Storage (with a $10k security deposit). I ran the same sequence of commands. The initial result was 1100 TPS for the first minute. After that, something strange happened. My setup didn’t get slower over time like AWS. It got faster. At four hours, it was doing 3500 TPS.
Making the performance calculation was simple. 3500 / 60 is 58. The setup on my desk was 58x faster than Amazon? Whaaaat? I was paying big money to Amazon each month and this box valued at $13,500 is 58x faster than one of their instances? Wow. Amazon is screwing me. Is Amazon screwing you?
By now you want to know what is under the hood.
1 Intel S5520UR motherboard 2 Intel L5630 40w CPUs 96GB of RAM 1 Newisys NDS-2241 disk shelf with 24 drives configured as 11 mirrors (raid10ish style) and two hot spares 1 Kingston 128GB Value SSD for L2ARC 3 Crucial 128GB M4 for L2ARC 1 Intel x25-m to boot from 1 DDRdrive X1 for a log device 1 LSI 9205-8e SAS controller to talk to the Newisys shelf 1 LSI 9211-8i SAS controller to handle the SSD drives
Running at full speed the setup comes in at about 420 watts. I have had car stereos that used more power than that and they didn’t make me money, but I digress.
Some of you bit monkeys are probably wondering what I think of the Crucial M4. All I can tell you is, for 8k writes, it does 3000 IOPs +/- 2 IOPs very very consistently as reported by filebench. It is probably twice as fast as the Kingston series V at the same price point (normalized for total storage) and much much more predictable. It is MLC and has a 3 year warranty with no wear level restrictions. So far, I like it. That’s all I have on the M4 at this time.
So, why was my setup getting faster while AWS was getting slower? It is all in the cache baby. ZFS was plumbing my L2ARC with disk blocks containing the b-tree structures and other related meta data required to execute the inserts. When the test first began, the Seagate disks were doing 180 reads/second/disk and about 30 or so writes/second/disk. At four hours, the reads were down to about 20 reads/second/disk but the SSDs were turning in 300-700 reads/second each while writes to the Seagate drives were up to about 220 writes/second range with some variation. Go L2ARC!
The throughput will likely go higher with a larger L2ARC, as by this time my 192GB read cache was full. I suspect another $200 M4 would boost the writes even more.
So my next question was, “What happens if I go to 2 billion rows?” I mean, who of us collects less data?
To answer this, I created a 2BN row table with pgbench on AWS and set pgbench lose on it. Once again, AWS returned 108 TPS in the first iteration. Then it went as follows:
|Minute||TPS including connections||TPS excluding connections|
After that, performance hovered some where around 60-70 TPS. The results were similar to the 1BN row test but required significantly less time to achieve them.
Which would you rather have, one pimp machine for $13.5k or 58 m1.large instances for about $40k/month?
I bought cage full of pimp machines and bailed out of AWS. Read how I transferred and maintained synchronization of terabytes of data between AWS in on east coast and a data center in San Francisco bay area..