How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines

Question

What is the best way to implement a CI/CD build process for Apache Beam/Dataflow classic templates & pipelines in Python? I have only found tutorials for this with Java that include artifact registry + Cloud Build, but rarely any in-depth tutorials for Python. I'd like to understand the "best-practice" way to develop pipelines in a Github repo and then having a CI/CD pipeline that automates staging template & kicking off job.

This Medium post was one of the more helpful high-level walkthroughs, but didn't dive in deep on getting all the tools to work together: https://medium.com/posh-engineering/how-to-deploy-your-apache-beam-pipeline-in-google-cloud-dataflow-3b9fe431c7bb

score 0 · Answer 1 · answered Aug 25 '22 at 08:34

I use a beam/dataflow pipeline with a CI/CD pipeline in GitLab. those are the steps that my CI/CD pipeline is following:

In my .gitlab-ci.yml file I pull a google/cloud-sdk Docker image which create an environment with python3.8 and the essentials of gcloud tools.

After that I run the unit tests and the integration tests of my pipeline. Once succeeded, I try to build a flex template (in your case you want to build a classic template) with the gcloud builds submit command.

Also if you want to automatically quick the job after all this, you have 2 options:

Either running the pipeline with a command line from the Docker container of your CI pipeline
Or since you already created a template for your pipeline, you can trigger it using an HTTP request for example

score 0 · Answer 2 · answered Jan 10 '23 at 11:00

yes, for me, the Medium post actually covers most of it and helped me to build my CI pipeline as well.

These are the stages that I have:

Infra - Terraform for the pre-requisite GCP infra
Build - pip -r requirements.txt and anything else.
Test - Unit, integration, end-to-end. I will implement performance tests with a sample of prod data later on.
Security Checks - Secrets scanning, SAST
SonarQube for SCA
Deploy Template and Metadata (both Manual) to PoC, other environments and Prod. I use standard templates.
Run Job (Manual) - actions to run job using the DirectRunner for quick testing, and also another job using the Dataflow runner using gcloud dataflow jobs run ${JOB_NAME}....

For most steps, I used the python:3.10 image as the default (I ran into issues with installing the apache-beam dependency using Python 3.11), and google/cloud-sdk alpine for the gcloud steps.

There are other things we need to consider such as an action to stop a dataflow job and to rollback to a previously working dataflow template (need to upload multiple templates to GCS).

Hope this helps.

How to implement a CI/CD pipeline for Apache Beam/Dataflow classic templates (Python) & data pipelines

2 Answers2