0

Usecase to execute-

  1. Excel with multiple tabs uploaded to cloud storage
  2. Cloud function trigger calls cloud datafusion pipeline
  3. Pipeline reads the file uses wrangler to read the individual sheets and write to separate tables as per the sheet

Though I am using params to define the sheetname in wrangler(as in usage for usable pipelines), I am unable to figure out how to either iterate to run for each tab, or use cloud function to span 20 calls to same pipeline with parameter as each tab name. I do not think running parallel paths in same pipeline is going to help or is a good solution. Any help appreciated.

RaptorX
  • 113
  • 10
  • 2
    If your Data Fusion instance is public, you can use the REST API to programmatically start pipelines: https://cloud.google.com/data-fusion/docs/reference/cdap-reference#start_a_batch_pipeline – Dennis Li Nov 15 '21 at 21:54
  • I am using API for the call, trouble is the API could end up initiating the same pipeline with 15 parameters, one for each tab, at same time. That is not going to work in data fusion. – RaptorX Nov 16 '21 at 15:24
  • 1
    Is there a particular reason you don't want to run the pipelines in parallel? If you need them to run serially you can consider making a pipeline per tab and hooking them together using triggers: https://cloud.google.com/data-fusion/docs/how-to/using-triggers – Dennis Li Nov 17 '21 at 01:51
  • I guess I misunderstood the concept of parallel execution. My pipelines were not running due to config issues. Your first input suffices to solve the issue Dennis. Also thanks for the trigger related feedback, as I will be using the concept for a different implementation. – RaptorX Nov 17 '21 at 12:44

1 Answers1

1

To run parallel pipelines, we can iterate through API call using cloud functions. Though I am not sure how many instances are allowed in a single call to CDF, I could run three parallel runs of same pipeline.

RaptorX
  • 113
  • 10