How can I speed up the GCP datafusion(datapipeline)?

Question

About 300T of data is being transferred to Big Query using Google Cloud platform datafusion (option: dev).

It currently took 34 minutes to process approximately 16GB. It takes about 10 days to process 6T data.

What settings can be modified in datafusion to quickly perform ETL operations in the data pipeline?

Thank you for reading.

score 1 · Accepted Answer · answered Oct 19 '20 at 10:02

1

What you can do is changing the compute profile settings, which specifies how and where a pipeline is executed. For example, a profile includes the type of cloud provider, the service to use on the cloud provider (such as Dataproc), resources (memory and CPU), image, minimum and maximum node count, and other values.

Learn more about profiles on the CDAP documentation site.

One of the option is to create a new compute profile with a higher limit on worker memory or overriding worker memory for a run of the pipeline:

Click on System Admin in the top right and then click on the Configuration tab
Click System Compute profiles
Click on create new profile
Choose Cloud Dataproc
Leave the Project ID and Service account key blank
Enter the required configuration of worker node
Click on Save

Once the new compute profile is create attach the compute profile to the pipeline by clicking on configure in pipeline detail view and choosing the newly created compute profile and click on Save.

Additionally, please check autoscaling option in DataFsuion.

answered Oct 19 '20 at 10:02

aga

3,790
3
11
18

1

This method does not seem to support the development version of Data Fusion. I would like to ask if it is not an enterprise function. – Quack Oct 20 '20 at 02:32
1

It's available in the enterprise edition. – aga Oct 26 '20 at 08:29
@lnes I checked. I have a question. Do you know the relationship between the Master node and Worker node of Dataproc? Is the CPU, memory, of the VM instance in the individual pipeline related to the Worker node? – Quack Oct 26 '20 at 08:46
A master node maintains knowledge about the distributed file system and schedules resources allocation. Worker nodes store the actual data and provide processing power to run the jobs. – aga Oct 29 '20 at 11:12

How can I speed up the GCP datafusion(datapipeline)?

1 Answers1