Network performance at AWS, Google, Rackspace and Softlayer
Published in Cloud.
Originally published on GigaOm.
Compute and storage are essentially commodity services, which means that for cloud providers to compete, they have to show real differentiation. This is often achieved with supporting services like Amazon’s DynamoDB and Route 53, or Google’s BigQuery and Prediction API, which complement the core infrastructure offerings.
Performance is also often singled out as a differentiator. Often one of the things that bites production usage, especially in inherently shared cloud environments, is the so-called “noisy neighbor” problem. This can be other guests stealing CPU time, increased network traffic and, particularly problematic for databases, i/o wait.
In this post I’m going to focus on networking performance. This is very important for any serious application because it affects the ability to communicate and replicate data across instances, zones and regions. Responsive applications and disaster recovery, areas where up-to-date database replication is critical, require good, consistent performance.
It’s been suggested that Google has a massive advantage when it comes to networking, due to all the dark fibre it has purchased. Amazon has some enhanced networking options that take advantage of special instance types with OS customizations, and Rackspace’s new Performance instance types also boast up to 10 Gbps networking. So let’s test this.
I spun up the listed instances to test the networking performance between them. This was done using the iperf tool on Linux. One server acts as the client and the other as the server:
iperf -f m -s
iperf -f m -c hostname
The OS was Ubuntu 12.04 (with all latest updates and kernel), except on Google Compute Engine, where it’s not available. There, I used the Debian Backports image.
The client was run for three tests for each type – within zone, between zones and between regions – with the mean average taken as the value reported.
Amazon networking performance
|t1.micro (1 CPU)||c3.8xlarge (32 CPUs)|
|us-east-1 zone-1a <-> us-east-1 zone-1a||135 Mbits/sec||7013 Mbits/sec|
|us-east-1 zone-1a <-> us-east-1 zone-1d||101 Mbits/sec||3395 Mbits/sec|
|us-east-1 zone-1a <-> us-west-1 zone-1a||19 Mbits/sec||210 Mbits/sec|
Amazon’s larger instances, such as the c3.8xlarge tested here, support the enhanced 10 GB networking, however you must use the Amazon Linux AMI (or manually install the drivers) within a VPC. Because of the additional complexity of setting up a VPC, which isn’t necessary on any other provider, I didn’t test this, although it is now the default for new accounts. Even without that enhancement, the performance is very good, nearing the advertised 10 Gbits/sec.
However, the consistency of the performance wasn’t so good. The speeds changed quite dramatically across the three test runs for all instance types, much more than with any other provider.
You can use internal IPs within the same zone (free of charge) and across zones (incurs inter-zone transfer fees), but across regions, you have to go over the public internet using the public IPs, which incurs further networking charges.
Google Compute Engine networking performance
|f1-micro (shared CPU)||n1-highmem-8 (8 CPUs)|
|us-central-1a <-> us-central-1a||692 Mbits/sec||2976 Mbits/sec|
|us-central-1b <-> us-central-1b||905 Mbits/sec||3042 Mbits/sec|
|us-central-1a <-> us-central-1b||531 Mbits/sec||2678 Mbits/sec|
|us-central-1a <-> europe-west-1a||140 Mbits/sec||154 Mbits/sec|
|us-central-1b <-> europe-west-1a||137 Mbits/sec||189 Mbits/sec|
Google doesn’t currently offer an Ubuntu image, so instead I used its backports-debian-7-wheezy-v20140318 image. For the f1-micro instance, I got very inconsistent iperf results for all zone tests. For example, within the same us-central-1a zone, the first run showed 991 Mbits/sec, but the next two showed 855 Mbits/sec and 232 Mbits/sec. Across regions between the US and Europe, the results were much more consistent, as were all the tests for the higher spec n1-highmem-8 server. This suggests the variability was because of the very low spec, shared CPU f1-micro instance type.
I tested more zones here than on other providers because on April 2, Google announced a new networking infrastructure in us-central-1b and europe-west-1a which would later roll out to other zones. There was about a 1.3x improvement in throughput using this new networking and users should also see lower latency and CPU overhead, which are not tested here.
Although 16 CPU instances are available, they’re only offered in limited preview with no SLA, so I tested on the fastest generally available instance type. Since networking is often CPU bound, there may be better performance available when Google releases its other instance types.
Google allows you to use internal IPs globally — within zone, across zone and across regions (i.e., using internal, private transit instead of across the internet). This makes it much easier to deploy across zones and regions, and indeed Google’s Cloud platform was the easiest and quickest to work with in terms of the control panel, speed of spinning up new instances and being able to log in and run the tests in the fastest time.
Rackspace networking performance
|512 MB Standard (1 CPU)||120 GB Performance 2 (32 CPUs)|
|Dallas (DFW) <-> Dallas (DWF)||595 Mbits/sec||5539 Mbits/s|
|Dallas (DFW) <-> North Virginia (IAD)||30 Mbits/sec||534 Mbits/s|
|Dallas (DFW) <-> London (LON)||13 Mbits/sec||88 Mbits/s|
Rackspace does not offer the same kind of zone/region deployments as Amazon or Google so I wasn’t able to run any between-zone tests. Instead I picked the next closest data center. Rackspace offers an optional enhanced virtualization platform called PVHVM. This offers better i/o and networking performance and is available on all instance types, which is what I used for these tests.
Similar to Amazon, you can use internal IPs within the same location at no extra cost but across regions you need to use the public IPs, which incur data charges.
When trying to launch x2 120 GB Performance 2 servers at Rackspace, I hit our account quota (with no other servers on the account) and had to open a support ticket to request a quota increase, which took them about an hour and a half to approve. For some reason, launching servers in the London region also requires a separate account, and logging in and out of multiple control panels soon became annoying.
Softlayer networking performance
|1 CPU, 1 GB RAM, 100 Mbps||8 CPUs, 2 GB RAM, 1 Gbps|
|Dallas 1 <-> Dallas 1||105 Mbits/sec||911 Mbits/s|
|Dallas 1 <-> Dallas 5||105 Mbits/sec||921 Mbits/s|
|Dallas 1 <-> Amsterdam||29 Mbits/sec||61 Mbits/s|
Softlayer only allows you to deploy into multiple data centers at one location: Dallas. All other regions have a single facility. Softlayer also caps out at 1 Gbps on its public cloud instances, although its bare metal servers do have the option of dual 1 Gbps bonded network cards, allowing up to 2 Gbps. You choose the port speed when ordering or when upgrading an existing server. They also list 10Gbit/s networking as available for some bare metal servers.
Similarly to Google, Softlayer’s maximum instance size is 16 cores, but it also offers private CPU options which give you a dedicated core versus sharing the cores with other users. This allows up to eight private cores, for a higher price.
The biggest advantage Softlayer has over every other provider is completely free, private networking between all regions whereas all other provider charge for transfer out of zone. When you have VLAN spanning enabled, you can use the private network across regions, which gives you an entirely private network for your whole account. This makes it very easy to deploy redundant servers across regions and is something we use extensively for replicating MongoDB at Server Density, moving approx 500 Mbits/sec of internal traffic across the US between Softlayer’s Washington and San Jose data centers. Not having to worry about charges is a luxury only available with Softlayer.
Who is fastest?
|Fastest (low spec)||Fastest (high spec)||Slowest (low spec)||Slowest (high spec)|
Amazon’s high spec c3.8xlarge really gives the best performance across all tests, particularly within the same zone and region. It was able to push close to the advertised 10 GB throughput, but the high variability of results may indicate some inconsistency in the real-world performance.
Yet for very low cost, Google’s low spec f1-micro instance type offers excellent networking performance: ten times faster than the terrible performance from the low spec Rackspace server.
Softlayer and Rackspace were generally bad performers overall, but at least Rackspace gets some good inter-zone and inter-region performance and performed well for its higher instance spec. Softlayer is the loser overall here with low performance plus no network-optimized instance types. Only their bare metal servers have the ability to upgrade to 10 Gbits/sec network interfaces.
Mbits/s per CPU?
CPU allocation is also important. Rackspace and Amazon both offer 32 core instances, and we see good performance on those higher spec VMs as a result. Amazon was fastest for its highest spec machine type with Rackspace coming second. The different providers have different instance types, and so it’s difficult to do a direct comparison on the raw throughput figures.
An alternative ranking method is to calculate how much throughput you get per CPU. We’ll use the high spec inter-zone figures and do a simple division of the throughput by the number of CPUs:
|Provider||Throughput per CPU|
The best might not be the best value
If you have no price concerns, then Amazon is clearly the fastest, but it’s not necessarily the best value for money. Google gets better Mbits/sec per CPU performance, and since you pay for more CPUs, it’s a better value. Google also offers the best performance on its lowest spec instance type, but it is quite variable due to the shared CPU. Rackspace was particularly poor when it came to inter-region transfer, and Softlayer isn’t helped by its lack of any kind of network-optimized instance types.
Throughput isn’t the end of the story though. I didn’t look at latency or CPU overhead and these will have an impact on the real world performance. It’s no good getting great throughput if it requires 100 percent of your CPU time!
Google and Softlayer both have an advantage when it comes to operational simplicity because their VLAN spanning-like features mean you have a single private network across zones and regions. You can utilize their private networking anywhere.
Finally, pricing is important, and an oft-forgotten cost are the network transfer fees. This is free within zones for all providers, but only Softlayer has no fees for data transfer across zones and even across regions. This is a big saver.