3

I am new to AWS. I have created a EMR cluster using Auto scaling policy through AWS console. I have also created a data pipeline which can use this cluster to perform the activities.

I am also able to create EMR cluster dynamically through data pipeline. But while doing it I am not able to assign Auto scaling Rule to the EMR cluster . Is there a way to configure auto scaling role and other required configurations to EMR cluster through data pipeline

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Bharani
  • 429
  • 1
  • 8
  • 18

1 Answers1

0

It is not possible to have AWS Data Pipeline launch an Amazon EMR cluster with Auto Scaling.

Nor is really necessary.

AWS Data Pipeline launches an Amazon EMR cluster to perform some work, such as transforming data or moving data between systems. Once such a task is complete, the cluster is terminated. This is known as a transient cluster.

This is a very different use-case to a long-running Amazon EMR cluster that accepts ad-hoc jobs throughout the day and can take advantage of Auto Scaling to add/remove capacity based upon demand.

Thus, there really isn't a need to add Auto Scaling to an EMR cluster launched by Data Pipeline. Instead, specify the capacity up-front and it will be used for the duration of the job.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 2
    Agreed. But assume I am doing transformation on some data on weekly basis and the data size keeps changing every week. So I am not sure on how many nodes are required in my cluster for better performance. If I have Auto scaling, then based on number of params I can try to auto scale my cluster. – Bharani Aug 01 '17 at 10:16