In which scenario should one prefer to create Spark cluster on EC2 machines instead of using Elastic Map Reduce?

Question

Between processing realtime data using Spark cluster on EC2 machines and using Elastic map reduce, some of the differences are:

In Elastic Map Reduce, one would not have to manage the infrastructure and cluster as compared to Spark cluster on EC2 machines where one has to create the cluster and manage it.
In case of Spark cluster on EC2, one has more control over the cluster as compared to Elastic Map Reduce which is a PAAS component.

I went through the below related link:

Hadoop on EC2 vs Elastic Map Reduce

I understand that going with Elastic Map reduce would give the advantage of not having to manage the infrastructure and cluster. What I want to know is that when should one prefer the other option, that is to create Spark cluster on EC2 machines instead of using Elastic Map Reduce? Thanks.

A.B · Answer 1 · 2020-10-22T17:44:38.767

You and the answer you shared have have summed pretty much the advantages and disadvantages for both. But i would like to mention few things

Someone mentioned in comment on the answer you share (and there is infact impression in people) that EMR adds some cost on top of ec2 nodes (which is underlying master/compute nodes of spark) and provides just the cluster, which isnt the case.

But what elastic map reduce is focused on is elastic and scalability part , meaning to provide scalability for your jobs, where scalability is not just number of node in cluster but different parameters like

Dynamically resizing the cluster with running jobs
Reduces and optimizes spin time , provides efficient resubmitting steps and option like automatic termination on step completion
Configuration, management and updation time. Just as an small you have things like release version that automatically handles spark/hadoop/other-application versions providing you way to easy update the version which you have to do manually with ec2.
the ecosystem availability. EMR ecosystem is growing,it doesnt reflect when you start but for example when your requirements grow, for example when you start to integrate other systems stream processing with flink for example) then it is more easier to just select at time of launching flink, pig , hive and moany more etc if you need to use other things in future.
There are already implementing libraries with AWS SDK like boto3 in python that help you to submit steps, poll for completion etc, which are very helpful when you need to scale. Also, you have integration of emr with orchestration frameworks like airflow where can can sense the state, resubmit, one command spin the cluster within the pipeline.
Expanding on previous point, EMR notebook for example provide you the quick and interactive way to submit spark jobs from Jupiter notebook and see the result, progress of jobs immediately which can boost your productivity.
This point is most important from my experience, Sometimes, scaling up the jobs with more nodes save you more money then long running jobs with low number of nodes. Because the adding node cost sometime cost you low than the normalized hours you will be spending with ec2 or small emr cluster. Just to share my experience, we had a job that used to run for 3 days, we satrted to run it with bigger EMR cluster that reduced it to 6-8 hours and it still was in the same cost and was infact a bit less.

Thanks. Your answer is more on what advantages EMR would give over the other. My question instead is the reverse. That it to ask when should one prefer the other option, that is to create Spark cluster on EC2 machines instead of using Elastic Map Reduce? — Saurabh Rana, Oct 23 '20 at 06:58
I think you consider advantages and disadvantages when choosing between two options. How can preference be based on something different? — A.B, Oct 24 '20 at 21:56
Yes, that is correct and I can see the advantages you have mentioned for going for EMR. My question is for the other case., that is advantages of going for Spark cluster on EC2 instead of going for EMR. — Saurabh Rana, Oct 26 '20 at 14:15

In which scenario should one prefer to create Spark cluster on EC2 machines instead of using Elastic Map Reduce?

1 Answers1