Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?

Question

In order to reduce the time for provisioning, we've decided to keep up a dedicated EMR cluster with 5 instances (we expect to need about 5). In case we need more, we think we'll need to implement some sort of autoscaling.

I'm not familiar at all with EMR- does it support autoscaling? I found this in the docs: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-manage-resize.html

Is that the correct place to look for autoscaling or am I misunderstanding what they mean by "resize". I've read that one benefit of EMR is the "on demand processing" and I think that it splits the load between ec2 instances without you specifying how many instances so this gives me the impression that it does the scaling of ec2 instances on its own, meaning we don't need to autoscale ourselves. Am I misunderstanding what "on demand processing" means?

If the resizing link I provided is appropriate for what I'm trying to do, does anyone have experience with determining when to resize? The doc only describes how but not, for example, how to have an alarm for when to resize. I've used their regular autoscaling service and it allows you to resize based on certain conditions but I'm not seeing this here.

I'm still unsure if autoscaling EMR is a bad idea- is it too involved (since there are entire companies like Qubole that provide this) or maybe not very useful since EMR already uses whatever computing power it needs? I don't know very much about what EMR actually provides so maybe that's why I'm confused.

score 7 · Answer 1 · answered Dec 12 '14 at 13:28

The page you linked showed ways of either manually or programmatically increasing the nodes in your cluster. I couldn't find anything else about autoscaling for EMR.

Unless we're missing some facts, you’d still have to come up with your own scaling algorithm and process. If you’re taking factors into account such as your job backlog, the units of time you’re paying for, the use of less-expensive “spot” instances, multiple clusters, etc, this is probably not a trivial exercise.

In addition to increasing size of your cluster, there is also downsizing. EMR allows this (manually or programmatically) for task nodes, but they state they don't for core nodes. You'd have to terminate the core node through AWS functionality and risk losing data. If your workloads increase and decrease over time, core node downsizing would be valuable for keeping your costs lower.

Qubole automatically takes care of all of these things out of the box. You run your jobs from the UI or API and it starts, sizes or resizes the cluster. When you're finished, it downsizes or terminates the cluster. It also allows you to have a minimum number of nodes constantly running at one time. I've also heard that the startup time for Qubole nodes is significantly faster than EMR.

Hope this helps you.

I can confirm this. Although it appears that EMR is moving in the direction of offering intelligent autoscaling, Qubole seems to have a bit of a head start with this. Their UI (or API) provides you configuration points to set boundaries on the cluster minimum and maximum sizes, and cost boundaries as well. You can test it pretty quickly with a trial account (https://api.qubole.com/users/sign_up), just sign in, configure your AWS tokens, and if you need sample data, look for it at: s3://paid-qubole/default-datasets/ -- Probably would take less than an hour to set up your test. — agentv, Aug 03 '15 at 00:46

score 1 · Answer 2 · answered Nov 19 '16 at 01:36

AWS does currently (as of late 2016) not support autoscaling out of the box as part of EMR. However, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.

Basically, there are two main options to implement autoscaling for EMR clusters:

Autoscaling Loop: A process that is running on a server and continuously monitors the cluster for its current load. Performance metrics (memory, CPU, I/O, etc) can be collected in regular intervals and stored in a database. Autoscaling rules are evaluated against the performance metrics, and the cluster's task nodes are scaled up or down if required.
Event-Based Autoscaling: Using CloudWatch metrics (e.g., metrics for EMR, or metrics for EC2), you can programmatically define triggers that are fired under certain conditions (for instance, add nodes if average CPUUtilization of all nodes exceeds 80%).

Both options have their pros and cons. The advantage of option 2 is that it is a server-less approach (does not require to run your own server). The downside is that CloudWatch metrics are collected in batches (typically five-minute intervals) and hence the data may be slightly delayed or less precise. Also, the event-based approach may not provide the required tools to inspect the current and historical state of your cluster scaling. Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.

You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.

Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?

2 Answers2

Linked