AWS Hasn’t Been Telling You The Whole Story About Newer Instance Types!

DALL-E 3

So I’ve been using AWS every day on the job for years now. Every so often they get a new batch of hardware in and start rolling it out to their regions. Currently we are up to generation ‘7’ instances. It’s fairly common to have instances floating around in AWS that have been around for a few years, or in the case of an EKS cluster, the node group definitions have been fairly static for that long. There’s nothing ‘wrong’ with still having those m5.xlarge class instances in use, after all, they ‘do the job’. Right?

Lets take a look at what AWS tells us about these newer instance classes, specifically in the realm of cpu. We will focus on the middle of the road ‘m’ class instances, which “provide a balance of compute, memory and networking resources

  • M5 class: “Up to 3.1 GHz Intel Xeon Scalable processor (Skylake 8175M or Cascade Lake 8259CL) with new Intel Advanced Vector Extension (AVX-512) instruction set
  • M6i class: “Up to 3.5 GHz 3rd Generation Intel Xeon Scalable processors (Ice Lake 8375C)” – also “Up to 15% better compute price performance over M5 instances
  • M7i class: “are powered by 4th Generation Intel Xeon Scalable processors and deliver 15% better price performance than M6i instances” – also “Up to 3.2 GHz 4th Generation Intel Xeon Scalable processor (Sapphire Rapids 8488C)

Looking at the above it seems that, yes going from m5 to m6 and m7 we are getting ‘newer’ stuff. But the wording isn’t very precise! “Up to 3.5 GHz” on the m6i… Does that mean we “might” get a slower server? What the heck does “15% better price performance” mean? Sounds like “more cpu bang for your buck” to me. It’s not very clear here, just how much “better” the newer instance classes might be. It sounds to me like we are talking about 10-15% faster maybe? Since the wording seems to always be around ‘price performance’ it could very well be that the instances just cost a bit less for the amount of cpu you are getting, i.e. move your stuff to them but expect the instance to be cheaper, not faster, at least not by a huge amount. Its clear though if you read the whole blurb on the aws instances page that you are also getting better networking and other performance boosts when you move from generation 5 instances to 6 or 7. So why not upgrade?

DALL-E 3

Well… In the real world you have a job to do. Too many JIRA tickets, too little time. That cluster you upgraded last spring is still running great and you have developers to assist, weird issues to track down, and a broken pipeline. If you are lucky your team headcount is still what it was when you were hired. If not then you are doing more work than you used to, and only the most important items get your attention. In between dealing with all that and the random outages of your favorite SAAS providers someone pings you about a critical service in production that is running slower than desired. The devs are on it but they want your eyeballs too. You have a look.

Cluster looks fine. No issues, but the service is very cpu hungry. “What are we on? Hmm… m5”

Sure I know m5 is ‘older’ and probably a bit slower but it’s just not something I think about that much. It’s not like the TAM tells us to upgrade to the newer instances on our bi-weekly calls. I’m not going to spend the next 3 months digging into the code base looking for something we can optimize, either. I assume the devs are on that already. So if we want to move our service from m5 to, say, m7 I have to assume there will be ‘some’ improvement, I’m just not sure how much.

It occurs to me that I’ve never really tried to benchmark instances in AWS, at least not for basic perf metrics. Digging around I find this:

Geekbench m5.xlarge benchmarks versus Geekbench m6i.xlarge benchmarks

Wait a sec. The m5.xlarge looks to be around 1K on the single core test, while the m6i sits around 1.5K? That can’t be right. That implies a 50% performance boost. There’s no way AWS would have servers THAT much better and not be telling us all about it. The benchmark on the site above seems to test a wide number of things, so maybe what we need here is something more specific to our use case.

I’m not an expert on bechmarking python, but I decided to use the pyperformance benchmarking for this test. I created a m5.large in Ohio, ran the test, and then switched to m6i and m7i and re-ran the same tests.

pyperformance run -o py39_m5large.json
...
...
.....................
coverage: Mean +- std dev: 50.7 ms +- 0.3 ms
[16/72] crypto_pyaes...
# /root/venv/cpython3.9-69fb45d284b9-compat-2d3356be745c/bin/python -u /usr/local/lib/python3.9/site-packages/pyperformance/data-files/benchmarks/bm_crypto_pyaes/run_benchmark.py --output /tmp/tmp9au55rjg --inherit-environ PYPERFORMANCE_RUNID
.....................
crypto_pyaes: Mean +- std dev: 117 ms +- 1 ms
[17/72] dask...
# /root/venv/cpython3.9-69fb45d284b9-compat-2d3356be745c/bin/python -u /usr/local/lib/python3.9/site-packages/pyperformance/data-files/benchmarks/bm_dask/run_benchmark.py --output /tmp/tmpkdwkxgv8 --inherit-environ PYPERFORMANCE_RUNID
.....................
...and so on...

Numerous hours later…

pyperformance compare py39_m5large.json py39_m6large.json

Overall the m6i instance was ~1.4X faster on most tests, which is astonishing. Some tests were a bit faster and some were a bit slower, but for the most part they all averaged around ~1.4 to ~1.5X faster than the m5.xlarge (!!)

DALL-E 3

Now I had to see how the m5 compared to the m7i….

The m7i comes in anywhere from 1.75X to 2.2X faster on the same python test suite than the m5. How fast will depend on what operations and libraries you are using of course, but this is astonishing, overall.

The final factor in my decision is… cost. Lets see how they compare in us-east-2:

m5.large: $0.096 USD per hour

m7i.large: $0.1008 USD per hour

5%. It costs FIVE PERCENT more to run your python application on an instance that from testing will probably perform most cpu related tasks at roughly twice the speed. Mandatory immediate upgrade.

DALL-E 3

Now we don’t have any assurances of this kind of wild performance gain at all from AWS. There could be reasons for this, like perhaps the cpu slows down when more VMs are running full bore on their hypervisor, or they adjust hardware and settings in different regions for different reasons. Do you own due diligence, language and/or library testing, and cost analysis but at the end of the day, if the benefits are there, make this type of upgrade a priority. The devs will thank you.