How can I know in advance the EMR resources needed to perform a join with Big Data?

Question

I want to throw an EMR step from which I only know the following:

It will have to read several files of size X GB from S3
I also know the step will need to perform a join among subsets of data from those files.

Is there a logic/formula for computing in advance the amount of memory/disk, number of vCores and all the other EMR resources so that I can be sure that the step won't fail once the cluster is up?

In this case, we want to perform a join between a dataset of 73.9 GB (2200e6 rows) and one of 5.8 GB (153e6 rows) (persistence is not used). Our actual EMR is comprised of the following resources:

2 c5.4xlarge instances (16 vCore, 32 GiB memory, EBS only storage EBS Storage:50 GiB)
1 m3.xlarge as the master (8 vCore, 15 GiB memory, 80 SSD GB storage EBS Storage:none)

I expect the cluster to perform a join between the two datasets and not to obtain the error "No space left on device", which is caused due to the lack of free space in disk.

jmng · Accepted Answer · 2019-07-30T08:07:54.727

That's hard to predict without testing, because processing a 10GB dataset will likely require more than just 10GB of usable cluster memory, namely due to overhead. It also depends on how you're processing it, but if it's just a join it's less complex to estimate.

In any case, the cluster that you described doesn't have enough RAM for the dataset you mentioned, so that's already a warning sign that you'll need to allow Spark to spillover to disk to avoid OOM errors (and take the performance hit that comes with disk I/O).

An incremental way to approach this problem would be generate some sample datasets - e.g. 3 datasets containing 10%, 20% and 50% of the whole dataset - and process them individually on a large cluster to measure the resources each iteration uses. By "large cluster", in this case, I mean something with usable RAM = ~150% of the full dataset size.

From there, it's easier to try and extrapolate the resources needed for 100% of the data. Still, the relationship between dataset size and cluster resources isn't linear - hence the need to estimate and test - so you should provision some extra resources to account for edge cases or the fact that this is simply an estimate.

If iterating like this doesn't fit your method, you could simply provision a very large cluster (e.g. RAM > 2x the dataset size) and see how that specific workload runs.

You should probably also test and measure different approaches to joining those datasets, like using RDD, Dataframes + SparkSQL, etc.

Edit: as far as I know, there is no way to reduce this to a simple, repeatable and exact formula because there are simply too many variables that depend purely on your workload and how you're coding it, like what you're doing with the data after the join (write formats), repartitionings, different Spark APIs, shuffles, reducer choices, serialization choices, etc, etc. Like I wrote above, you need to run your code with increasingly larger datasets and analyze how it behaves.

Avoiding OOM errors can be addressed both by adding more hardware as well as by optimizing code; it depends on the situation itself.

As stated on Spark's website:

How much memory you will need will depend on your application. To determine how much your application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the Storage tab of Spark’s monitoring UI (http://:4040) to see its size in memory. Note that memory usage is greatly affected by storage level and serialization format – see the tuning guide for tips on how to reduce it.

The fact is that I was wondering if there was a formula that could work as an approximation (always as a lower bound) for computing the amount of resources needed given X GB of dataset and X expensive actions (such as joins) before throwing any step. However, your advice is very usefull and will make me save a lot of time in optimizing the cluster. — Lluc, Jul 30 '19 at 07:39
Well, you stated that you 'expect the cluster to perform a join between the two datasets and not to obtain the error "No space left on device"'; my response describes a way to measure and estimate how much memory is needed for a specific workload, that can certainly be applied to your case. AFAIK, there is no way to reduce this to a simple repeatable formula because there are simply too many variables that depend purely on your workload and your code, like what you're doing with the data after the join (write formats), repartitionings, different APIs, shuffles, serialization choices, etc, etc. — jmng, Jul 30 '19 at 07:56
Like stated on Spark's website: "How much memory you will need will depend on your application. To determine how much your application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the Storage tab of Spark’s monitoring UI to see its size in memory." in https://spark.apache.org/docs/latest/hardware-provisioning.html — jmng, Jul 30 '19 at 07:56

How can I know in advance the EMR resources needed to perform a join with Big Data?

1 Answers1