0

My team at work is currently looking for a replacement for a rather expensive ETL tool that, at this point, we are using as a glorified scheduler. Any of the integrations offered by the ETL tool we have improved using our own python code, so I really just need its scheduling ability. One option we are looking at is Data Pipeline, which I am currently piloting.

My problem is thus: imagine we have two datasets to load - products and sales. Each of these datasets requires a number of steps to load (get source data, call a python script to transform, load to Redshift). However, product needs to be loaded before sales runs, as we need product cost, etc to calculate margin. Is it possible to have a "master" pipeline in Data Pipeline that calls products first, waits for its successful completion, and then calls sales? If so, how? I'm open to other product suggestions as well if Data Pipeline is not well-suited to this type of workflow. Appreciate the help

jpavs
  • 648
  • 5
  • 17
  • Your topic talks about defining a dependency between different pipelines. This is not possible today directly (indirectly possible with the use of preconditions). But you can have what you want in a single pipeline, i.e. what you call "master" pipeline in your problem description. – panther Apr 18 '15 at 00:33
  • @panther I have many more datasets to load, so this would get pretty messy I think. It also makes rerunning certain datasets problematic - a failure in the middle of the pipeline would require a complete rerun – jpavs Apr 22 '15 at 19:50
  • In that case, until data pipeline has a better solution, you can have marker files in S3 that are used to trigger other pipelines. For single pipeline model, you can use [cascade failure and rerun](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-manage-cascade-failandrerun.html) – panther Apr 22 '15 at 20:57

2 Answers2

1

I think I can relate to this use case. Any how, Data Pipeline does not do this kind of dependency management on its own. It however can be simulated using file preconditions.

In this example, your child pipelines may depend on a file being present (as a precondition) before starting. A Master pipeline would create trigger files based on some logic executed in its activities. A child pipeline may create other trigger files that will start a subsequent pipeline downstream.

Another solution is to use Simple Workflow product . That has the features you are looking for - but would need custom coding using the Flow SDK.

user1452132
  • 1,758
  • 11
  • 21
1

This is a basic use case of datapipeline and should definitely be possible. You can use their graphical pipeline editor for creating this pipeline. Breaking down the problem:

There are are two datasets:

  1. Product
  2. Sales

Steps to load these datasets:

  1. Get source data: Say from S3. For this, use S3DataNode
  2. Call a python script to transform: Use ShellCommandActivity with staging. Data Pipeline does data staging implicitly for S3DataNodes attached to ShellCommandActivity. You can use them using special env variables provided: Details
  3. Load output to Redshift: Use RedshiftDatabase

You will need to do add above components for each of the dataset you need to work with (product and sales in this case). For easy management, you can run these on an EC2 Instance.

Condition: 'product' needs to be loaded before 'sales' runs

  • Add dependsOn relationship. Add this field on ShellCommandActivity of Sales that refers to ShellCommandActivity of Product. See dependsOn field in documentation. It says: 'One or more references to other Activities that must reach the FINISHED state before this activity will start'.

Tip: In most cases, you would not want your next day execution to start while previous day execution is still active aka RUNNING. To avoid such a scenario, use 'maxActiveInstances' field and set it to '1'.

panther
  • 767
  • 5
  • 21