Passing variables between EC2 instances in multi-step AWS data pipeline

Question

I have a pipeline setup wherein I have 3 main stages:

1) take input from a zipped file, unzip this file in s3. run some basic verification on each file to guarantee its integrity, move to step 2

2) kick off 2 simultaneous processing tasks on separate EC2 instances (parallelization of this step saves us a lot of time, so we need it for efficiency sake). Each EC2 instance will run data processing steps on some of the files in s3 which were unzipped in step 1, the files required are different for each instance.

3) after the 2 simultaneous processes are both done, spin up another EC2 instance to do the final data processing. Once this is done, run a cleanup job to remove the unzipped files from s3, leaving only the original zip file in its place.

So, one of the problems we're running into is that we have 4 EC2 instances that run this pipeline process, but there are some global parameters we would like for each EC2 instance to have access to. If we were running on a single instance, we could of course use shell variables to accomplish this task, but really need the separate instances for efficiency. Currently our best idea is to store a flat file in the s3 bucket which has access to these global variables and just read them on initialization and write back to them if they need to change. This is gross and there seems like there should be a better way, however we can't figure one out yet. I saw there's a way to set parameters which can be accessed at any part of the pipeline, but it looks like you can only set this on a per pipeline level, not on the granularity of each run of the pipeline. Does anyone have any resources that could help here? Much appreciated.

Bump. Anybody have any experience with this? – Jason Tham Jul 19 '16 at 18:34 — Jason Tham, Jul 19 '16 at 18:34

score 0 · Accepted Answer · answered Aug 08 '16 at 14:59

We were able to solve this by using DynamoDB to keep track of variables/state. The pipeline itself doesn't have any mechanism to do this, other than parameter values, which unfortunately only work per pipeline, not per job. You'll need to setup a DynamoDB instance and then use the pipeline job ID to keep track of state, connecting via the CLI tools or some SDK.

Passing variables between EC2 instances in multi-step AWS data pipeline

1 Answers1