2

This is going to be a fairly general question. I have a pipeline that I would like to execute in real time. The pipeline can have sudden and unpredictable load changes, so scalability (both up and down) are important. The pipeline stages can be packaged as docker containers though they don't necessarily start that way.

I see three ways to build said pipeline on AWS. 1) I can write an Airflow DAG and use AWS managed workflows for Apache airflow. 2) I can write an AWS lambda pipeline with AWS step functions. 3) I can write a Kubeflow pipeline on top of AWS EKS.

These three options have different ramifications in terms of cost and scalability, I would presume. E.g. scaling a Kubernetes cluster in AWS EKS will be a lot slower than scaling Lambda functions assuming I don't hit the service quota for Lambdas. Can someone comment on the scalability of AWS managed Airflow? Does it scale faster than EKS? How does it compare to AWS Lambdas?

bumpbump
  • 542
  • 4
  • 17

1 Answers1

1

Why not use Airflow to orchestrate the entire pipeline? Airflow can certainly invoke a Step Function using the StepFunctionStartExecutionOperator or by writing a custom Python function to do the same with the PythonOperator.

Seems like this solution would be the best of both worlds: true data orchestration, monitoring, and alerting in Airflow (while keeping a fairly light Airflow instance since it's pure orchestration) with the scalability and responsiveness in AWS Lambda.

I've used this method for a very similar use case in the past and it worked like a charm. Plus, if you need to scale this pipeline to integrate with other services and systems in the future, Airflow gives you that flexibility because it's an orchestrator and is system- and provider-agnostic.

Josh Fell
  • 2,959
  • 1
  • 4
  • 15
  • 2
    If I am going to invoke a step function inside of an airflow node, why don't I just use step functions to orchestrate the whole thing? Why the additional layer of orchestration? – bumpbump Jul 01 '21 at 03:13
  • Since this is a general question, I'll say the classic answer: "It depends." If this pipeline will ever be more than just executing Lambda functions, Airflow provides the integrations with a myriad of other systems and providers. As well as pipeline monitoring and alerting in one place. You could run Docker containers with the [DocerOperator](https://registry.astronomer.io/providers/docker/modules/dockeroperator) or even chain Step Functions, Lamdas, etc. without having to deal with manage those dependencies in the services themselves. – Josh Fell Jul 01 '21 at 12:34
  • If the pipeline will never be more than Lamdas, seems like you answered your own question :) – Josh Fell Jul 01 '21 at 12:35
  • 1
    I am more curious about the scalability. In particular, how quickly can this scale up. I am a bit familiar with Step Functions, which imposes scaling limits even on top of the underlying Lambda infrastructure. Is the scalability of Airflow purely limited to the underlying executor? – bumpbump Jul 01 '21 at 17:44