Autoscaling a running Hadoop cluster setup on AWS EC2

Question

My goal is to understand how can I auto-scale a Hadoop cluster on AWS EC2. I am exploring AWS offerings from elastic scaling perspective for a Hadoop as service (EMR) and Hadoop on EC2.

For EMR, I gathered that using CloudWatch, performance metrics can be monitored and the user can be alerted once they reach the set threshold, thereafter the cluster can be scaled up or down depending on its utilization state. This approach would require some custom implementation to automate the steps.(correct me if I am missing anything here)

For Hadoop on EC2, I came across with the auto scaling option which can add or remove instances as per configured scaling policies. But I am not clear how a newly added node would get bootstrapped to the cluster automatically? How would YARN know that it can spawn a new container on this newly added node? Does auto-scaling work for master-slave kind of setup as well or is limited to the web application?

There is 'Qubole' offering services to manage Hadoop on AWS as well....should that be used for automatically managing scaling the cluster?

Scaling of running Hadoop cluster is an art by it self (down sizing mainly). Forget to do it on HDFS node (EMR don't allow it) and for YARN only node you will need to manage running container on the machines shunting down. EMR doc on Resize: http://docs.aws.amazon.com//ElasticMapReduce/latest/ManagementGuide/emr-manage-resize.html — Kazaag, Sep 16 '16 at 10:04
A related SO discussion: [Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?](https://stackoverflow.com/q/26747528/320399) — blong, Aug 04 '17 at 17:06

Autoscaling a running Hadoop cluster setup on AWS EC2

0 Answers0