Benchmarking Graviton2 processors with Apache Spark workloads

Dima Statz
ITNEXT
Published in
7 min readDec 28, 2020

--

Introduction

Amazon EC2 provides a broad portfolio of compute instances, including many that are powered by the latest-generation Intel and AMD processors. AWS Graviton2 processors add even more choice. AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to enable the best price-performance for workloads running on Amazon EC2.

On Oct 16, 2020, Amazon announced that EMR now supports Amazon EC2 M6g and provides up to 35% lower cost and up to 15% improved performance for Apache Spark workloads on Graviton2-based instances versus previous generation instances.

Objectives

Obviously, these are great news for all EMR users. Especially for those that are running really heavy workloads on EMR and the monthly cost of underlying EC2 machines is significant. The main goal of this article is to explore the migration process from EMR based on Intel’s M5 instances to EMR based on AWS Graviton2 M6g and to create an M6g vs M5 performance benchmark.

Migration

Migration to EMR based on M6g is straightforward. Amazon EMR supports Amazon EC2 M6g instances on EMR Versions 6.1.0 and 5.31.0 and above in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), EU (Frankfurt), and Asia Pacific (Tokyo).

So if you are running on one of these regions, just create a new EMR cluster, click on Advanced Options, and navigate to the Hardware step

Cluster Composition

In the “Cluster Nodes and Instances” section click on instances type and choose M6g type.

M6g instance types

Choose M6g for all node types of your cluster and press “Create Cluster”. When the cluster is up, ensure that all running instances are M6g.

All the above can be done also by the CloudFormation template for EMR.

Cost

I will use the same setup for both EMR clusters: 1 master node (4xlarge), 2 core nodes (4xlarge), 4 on-demand task nodes (8xlarge).

Given this setup the hourly price of the EMR cluster based on M6g

M6g Cost Table

will be 3*(0.616+0.154) + 4*(1.232+0.308) = 2.31 + 6.16 = 8.47$

The same setup of EMR based on M5

M5 Cost Table

will cost 3*(0.768+0.192) + 4*(1.536+0.27) = 10.1$

All the above prices are based on Amazon EMR Pricing Table for us-west-2. And we can see that the EMR cluster based on M6g instances is cheaper by 17% then the EMR cluster based on M5 instances. Saving 17% percent is great but it is pretty far from the 35% that was announced. Another open question is the availability of spot instances from both types and the amount of discount that AWS provides on M6g and M5. Now I can see that the saving-over-on-demand on M6g spot instances is up to 54% (us-west-2)

M6g Spot Instance

At the same time, you can get up to 65% discount on M5 machines

M5 Spot Instance

I will leave the discussion about spot instances out of scope, but be aware that depending on the AWS region and current availability, the price of M5 spot instances can be lower than the price of M6g spots. In this case, using EMR instances fleets can be a good idea.

Benchmark

Now let’s create a head-to-head performance comparison between EMR(M5) and EMR(M6g) by running the same job on the same input. The input data is 150GB of gzipped CSV files that contain 885743562 rows. I will minimize the IO time by pre-caching the input DataFrame into the spark memory and then I will run heavy transformations on the entire dataset while measuring processing time, CPU, and Memory Usage.

Source code

For dataset transformations, I will use Apache Spark built-in functions like sha2, from_unixtime, regexp_extract, text to parquet writer. All these functions are compute-intensive and reliable for comparing the performance of different processors.

Test 1(line 10): read, unzip, cache.

This test EMR(M5) completed in 29 min and EMR(M6g) in 24 minutes only which is 21% faster. We remember that this test is not a hundred percent reliable since it involves a massive IO (reading 150GB from AWS S3 into the memory of the clusters).

M6g
M5

Winner: M6g

Test 2(line 14): calculate column’s SHA-2 checksum.

Starting from this test all data is cached to the memory, no IO involved. So all the following tests will be great indicators for CPU power. Both clusters completed the checksum calculation in ±4.5 min. In fact, M6g was a bit faster, but it was really close. No winner here.

M6g
M5

Winner: Tie

Test 3(line 17): converting Unix timestamp to the date-time string

In this test M6g performs much better than M5. Converting Unix timestamps on M6g machines takes 2.6 min compared to 3.2 on M5, which is 19% faster.

M6g
M5

Winner: M6g

Test 4(line 22): extracting a group that matches a regular expression

This test M6g takes by a knockout. Performing regular expression is 2.2 times faster on M6g than on M5.

Winner: M6g

Test 5(line 27): creating parquet files

Just like Test 1, this test is not 100% reliable since it contains both a CPU-intensive task (create a parquet formatted data) and an IO-intensive task (write S3 objects). M6g based EMR performs slightly better, finishing this test in 3.3 minutes compared to 3.5 minutes of M5 based EMR.

Winner: M6g

Resources

35 executors were allocated on each cluster for these tests. Each executor was configured with 4 Cores and 14GB of RAM on M5 and 13GB of RAM on M6g. Totally: 140 Cores, 490GB/455GB of accessible RAM, and ±50 GB of RAM for Yarn overhead. During the tests, the average CPU utilization was around 30 percent on both clusters.

M5
M6g

Summary

My conclusion is that the M6g is doing very well in terms of performance, in comparison to the M5. During the Apache Spark workloads tests, M6g took 4 out of 5 rounds. When removing outliers from results, M6g provides up to 18% better performance compared to M5.

In terms of the on-demand price for EMR, M6g provides a cost reduction of ±16%. Cost reduction will be lower for bigger machines like 12XLarge because for some reason the ERM fraction of cost on M6g machines is much higher than on M5 machines (0.462 vs 0.27).

M6g
M5

Another issue is the cost of spot instances. The cost discount over on-demand machines on M6g is up to 54%, on M5 it is up to 65%. So if you are heavily relying on spot instances, the total price can be lower when using M5s.

So, just to summarize, the total improvement on the cost-performance ratio was around 30% when running on on-demand instances and it is only around 8% when fully running on spot instances.

Cost-Performance Ratio

I hope that this benchmark is useful to you. I’d be very interested to hear about the Apache Spark workloads that you are running on Graviton2, especially if you have any interesting findings on the stability, performance, or pricing.

--

--