How to schedule Dataproc PySpark jobs on GCP using Data Fusion/Cloud Composer

Question

Hello fellow developers,

I have recently started learning about GCP and I am working on a POC that requires me to create a pipeline that is able to schedule Dataproc jobs written in PySpark. Currently, I have created a Jupiter notebook on my Dataproc cluster and that reads data from GCS and writes it to BigQuery, it's working fine on Jupyter but I want to use that notebook inside a pipeline.

Just like on Azure we can schedule pipeline runs using Azure data factory, Please help me out which GCP tool would be helpful to achieve similar results.

My goal is to schedule the run of multiple Dataproc jobs.

Check Cloud Scheduler - https://cloud.google.com/scheduler – Prany Aug 16 '21 at 11:01 — Prany, Aug 16 '21 at 11:01

score 2 · Accepted Answer · edited Aug 17 '21 at 22:40

2

Yes, you can do that by creating a Dataproc workflow and scheduling it with Cloud Composer, see this doc for more details.

By using Data Fusion, you won’t be able to schedule Dataproc jobs written in PySpark. Data Fusion is a code-free deployment of ETL/ELT data pipelines. As per your requirement, you can directly create and schedule a pipeline to pull data from GCS and load it into BigQuery with Data Fusion.

edited Aug 17 '21 at 22:40

Vishal K

1,368
1
7
15

answered Aug 16 '21 at 16:19

Dagang

24,586
26
88
133

Hi @Dagang !! Can data fusion be used for this purpose ? – Snehil Singh Aug 17 '21 at 09:50
1

Hi @SnehilSingh, By using Data Fusion, you won’t be able to schedule Dataproc jobs written in PySpark. [Data Fusion](https://cloud.google.com/data-fusion) is a code-free deployment of ETL/ELT data pipelines. As per your requirement, you can directly create and schedule a pipeline to pull data from GCS and load it into BigQuery with Data Fusion. – Vishal K Aug 17 '21 at 14:57

How to schedule Dataproc PySpark jobs on GCP using Data Fusion/Cloud Composer

1 Answers1