1

I have two datafusion pipelines which are independent of each other. I want to run these two pipelines in a single pipeline job which will have both pipelines on it (running independently of each other). Is this achievable? Can anyone help me on how to do this?

Fernando C.
  • 201
  • 1
  • 11
code tutorial
  • 554
  • 1
  • 5
  • 17
  • What would you like to achieve with your pipelines? – aga Feb 28 '20 at 14:01
  • One pipeline which has two workflows inside it but not connected to each other @muscat – code tutorial Feb 29 '20 at 18:54
  • Can you explain the usecase you are trying to solve? Why does it have to be within a single pipeline even though it is actually two pipelines? – Edwin Elia Mar 03 '20 at 22:57
  • @EdwinElia, Flow 1 (GCS to BQ), Flow 2(GCS to HTTP) Logically these two flows are independent of each other. I want these two flows inside a single datafusion pipeline – code tutorial Mar 04 '20 at 13:51
  • Why does it need to be in a single Data Fusion pipeline? – Edwin Elia Mar 06 '20 at 00:27
  • @EdwinElia, That will logically give entire workflow of the usecase I am working on. Can this be achievable ? – code tutorial Mar 06 '20 at 07:05
  • There is currently no way to create a single pipeline with two independent pipelines that doesn't relate to each other or doesn't interconnect at some point in the pipeline. We have not encounter a usecase where this is absolutely necessary, as that is the same as creating two different pipelines. If you provide us the use case and why that is necessary, we can perhaps provide an alternative or create a feature request. – Edwin Elia Mar 08 '20 at 02:51
  • @EdwinElia,The example usecase is as follows. Consider there is a flow1 (GCS - BQ) and flow2 (GCS - HTTP) . Both are independent of each other and do not interconnect at any point of time. Two flows needs to be executed in parallel not sequentially. If there are two independent pipelines then each pipeline has to spinup its own compute resources which is an overhead for execution. If these two flows come under same job, then same compute resources will be used and also both will run in parallel. – code tutorial Mar 09 '20 at 14:34
  • You can get around spinning up separate clusters for multiple pipelines by pre-provisioning the Dataproc cluster. Then use the remote hadoop compute profile to submit the job to the existing cluster. To create a new compute profile, go to System Admin > Configuraiton > System Compute Profiles > Create New Profile > Remote Hadoop Provisioner – Edwin Elia Mar 11 '20 at 00:13

1 Answers1

0

If you're just looking to start multiple pipelines at the same time, afaik, you would have to start them individually. What is the use-case you're trying to achieve here?

Perhaphs scheduling a pipeline would be useful?

itsanudeep
  • 26
  • 2
  • Flow 1 (GCS to BQ), Flow 2(GCS to HTTP) Logically these two flows are independent of each other. I want these two flows inside a single datafusion pipeline. – code tutorial Mar 02 '20 at 20:30