0

we have a few jobs in gcp to execute sequentially and parallelly.

Need to open discussion to see the best option with most cost effective? example of jobs like. Dataflow, cloudfunctions, etc

ClouComposer(Airflow)

GCP Workflows

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
Rakesh Sabbani
  • 1,655
  • 5
  • 17
  • 31

1 Answers1

2

I think it's depends on some criteria, each solution has pros and cons.

This answer is my opinion and based on my experience

Cloud Composer

Pros

  • Full managed solution with monitoring tool
  • Based on Python
  • Airflow is open source
  • Airflow has a big community
  • Many operators exist to interact easily with all the GCP services (existing operators for BigQuery, Dataflow, Cloud Run, Cloud Function...)
  • Operators and Python are easily testable with unit tests
  • Cloud Composer is very fast
  • GKE autopilot with Composer 2 is more cost effective than Composer 1
  • Better handling of sizing machine and environement size with Composer 2 (no need to destroy and recreate the cluster)
  • Appropriate if you have many DAGs and data pipelines
  • With many DAGs and a good sizing the cost can be good and controlled
  • Structured logs proposed by Airflow, allow to apply alerting policies very easily
  • Adapted for complex DAGs
  • Retry from a step
  • Complete UI with Airflow Webserver
  • Logs more friendly than with Workflows

Cons

  • Cluster to manage
  • Even if Composer 2 uses GKE with autopilot mode, it’s usually more expensive than a full serverless solution like Cloud Workflows

Cloud Workflows

Pro

  • Full serverless solution
  • Cost effective
  • parallel tasks was proposed not long ago
  • adapted if you not have a large amount of dag
  • Use Google Cloud APIs to interact with services

Cons

  • The YAML code of Workflow is verbose
  • The testability is more difficult than a Python code with Airflow
  • In my opinion, it less adapted to complex DAGs and pipelines
  • With complex DAGs the YAML code is less maintenable than the equivalent with Python code
  • Retry from a step not possible
Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23
  • 1
    Workflow is the best :p :D :D :D – guillaume blaquiere Jan 11 '23 at 21:05
  • testability and resume on error are the weaknesses of Workflow. readability is also difficult at the beginning (and multi file is not yet supported) – guillaume blaquiere Jan 11 '23 at 21:06
  • 1
    However, you miss a major Cons on Composer: you can write python code. Yes, it's a cons!! because so many data eng mix the orchestration functions with the business process functions. And you have dags with data processing entangled and it's a nighmare at the end (in addition of poor scalability and performance) – guillaume blaquiere Jan 11 '23 at 21:08
  • Thanks for your comments Guillaume. I take the points. I see Python as a real advantage for managing orchestration :) For bad practices made by data engineers, unfortunately we can see them in many parts of computer software (backend, front, data...) I prefer to use a tool that will make my life easier and with a better maintenability for the team, then train them on best practices. – Mazlum Tosun Jan 11 '23 at 23:12
  • 1
    It also depends on the use cases, with less amount of DAGs, Workflows should be better and cheaper. For data team having a large amount of DAGs and sometimes complex orchestration logics, Airflow should be simpler and more maintenable – Mazlum Tosun Jan 11 '23 at 23:13
  • 2
    I also often see in Workflows too much logic in the YAML (dynamic code, loop...) YAML is good for configuration but if it contains logic, we lose readability and maintainability. – Mazlum Tosun Jan 11 '23 at 23:22
  • 1
    However the Serverless aspect is very nice, cheap and practical. If I have a framework to code with orchestration logic, I will necessarily choose Workflow for the lightweight and serverless aspects. – Mazlum Tosun Jan 11 '23 at 23:24