Limiting EC2 resources used by AWS data pipeline during DynamoDB table backups

Question

I need to backup 6 DynamoDB tables every couple of hours. I've created 6 pipeliness from templates and it ran great, except that it created 6 or more virtual machines which were mostly staying up. That's not the economy I can afford.

Does anyone have experience optimizing this kind of scenario?

You would need to use third option suggested by Rohit below. Single pipeline with multiple activities running on the same EMR cluster. You can then control the size of cluster to adjust throughout. — panther, Jun 24 '15 at 01:41

score 0 · Answer 1 · edited May 23 '17 at 10:26

Some solutions that come to mind are:

One: To ensure that EC2 resources are being terminated, you can set the terminateAfter property on the EC2 resource definition. The semantics of terminate after are discussed here - How does AWS Data Pipeline run an EC2 instance?.

Two: This thread on the AWS forum discusses how existing EC2 instance may be used by data pipeline.

Three: Using the backup pipeline template always creates a single pipeline with a single Activity for the backup that reads from a single source and writes to a single destination. You can view the JSON source of the pipeline in the AWS console and write a similar pipeline with multiple Activity instances - one for each table you want to backup. Since the pipeline definition will only have one EMR resource, only that EMR resource will do the work of all the activities.

score 0 · Answer 2 · answered Jun 23 '15 at 17:57

You can set the field maxActiveInstances on the Ec2Resource object.

maxActiveInstances The maximum number of concurrent active instances of a component. For activities, setting this to 1 runs instances in strict chronological order. A value greater than 1 allows different instances of the activity to run concurrently and requires you to ensure your activity can tolerate concurrent execution.

See this: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html

Aravind. R

Limiting EC2 resources used by AWS data pipeline during DynamoDB table backups

2 Answers2