-2

What is the best way to implement a CI/CD build process for Apache Beam/Dataflow classic templates & pipelines in Python? I have only found tutorials for this with Java that include artifact registry + Cloud Build, but rarely any in-depth tutorials for Python. I'd like to understand the "best-practice" way to develop pipelines in a Github repo and then having a CI/CD pipeline that automates staging template & kicking off job.

This Medium post was one of the more helpful high-level walkthroughs, but didn't dive in deep on getting all the tools to work together: https://medium.com/posh-engineering/how-to-deploy-your-apache-beam-pipeline-in-google-cloud-dataflow-3b9fe431c7bb

pc_dev
  • 1

2 Answers2

0

I use a beam/dataflow pipeline with a CI/CD pipeline in GitLab. those are the steps that my CI/CD pipeline is following:

In my .gitlab-ci.yml file I pull a google/cloud-sdk Docker image which create an environment with python3.8 and the essentials of gcloud tools.

After that I run the unit tests and the integration tests of my pipeline. Once succeeded, I try to build a flex template (in your case you want to build a classic template) with the gcloud builds submit command.

Also if you want to automatically quick the job after all this, you have 2 options:

  • Either running the pipeline with a command line from the Docker container of your CI pipeline
  • Or since you already created a template for your pipeline, you can trigger it using an HTTP request for example
Idhem
  • 880
  • 1
  • 9
  • 22
0

yes, for me, the Medium post actually covers most of it and helped me to build my CI pipeline as well.

These are the stages that I have:

  • Infra - Terraform for the pre-requisite GCP infra
  • Build - pip -r requirements.txt and anything else.
  • Test - Unit, integration, end-to-end. I will implement performance tests with a sample of prod data later on.
  • Security Checks - Secrets scanning, SAST
  • SonarQube for SCA
  • Deploy Template and Metadata (both Manual) to PoC, other environments and Prod. I use standard templates.
  • Run Job (Manual) - actions to run job using the DirectRunner for quick testing, and also another job using the Dataflow runner using gcloud dataflow jobs run ${JOB_NAME}....

For most steps, I used the python:3.10 image as the default (I ran into issues with installing the apache-beam dependency using Python 3.11), and google/cloud-sdk alpine for the gcloud steps.

There are other things we need to consider such as an action to stop a dataflow job and to rollback to a previously working dataflow template (need to upload multiple templates to GCS).

Hope this helps.

kuboraam
  • 1
  • 1