0

How can I stream data between tasks in a workflow with the help of a data pipeline orchestration tool like Prefect, Dagster or Airflow?

I am looking for a good data pipeline orchestration tool. I think I have a fairly decent overview now of what Apache Airflow is capable of. One thing I am missing in Airflow is the possibility to stream data between tasks.

I have an existing Python pipeline which extracts, transforms and loads data and uses Unix pipes in between. In bash syntax: extract | transform | load meaning all three processes/tasks run in parallel.

I am aware that I could use Airflow with two intermediary storage targets in between and then start pulling the data as soon as it is available. My understanding is that I would have to create 3 distinct DAGs for this or keep everything in a single task where I would have to parallelize the processes manually. I could be wrong but that seems like a bad architecture for this solution. It should be possible to represent this workflow in a single abstraction and let the orchestration tool take care of parallelization.

I am also aware that using pipes might not work for all executors since they might reside on different nodes. However for this solution it would be fine to restrict the workflow to a single node or use an alternative way of streaming the data as long as it stays simple.

ELT would be another approach, but I don't like it much because it is way more sensible to remove sensitive data before it reaches the destination, not after. Plus the transform step in between allows me to reduce the amount of data I have to transfer and store considerably and also reduces the complexity of maintaining a temporary schema in the destination database :) Somehow the current shift to ELT does not appeal to me much.

phobic
  • 914
  • 10
  • 24
  • You can use dagster, configure an `io_manager`, which could be a cloud bucket or any storage you have and then these become available to the other ops/assets. you can go with the asset approach or the op approach. – Kay Mar 08 '23 at 04:19

2 Answers2

2

With Prefect, you could just use Python and in your flow, feed the result of one task to another, like this:

from prefect import flow, task
import httpx  

@task
def fetch_cat_fact():  # extract
    return httpx.get("https://catfact.ninja/fact?max_length=140").json()["fact"]

@task
def formatting(fact: str):  # transform
    return fact.title()

@task
def write_fact(fact: str):  # load
    with open("fact.txt", "w+") as f:
        f.write(fact)

@flow
def pipe():  # main function
    fact = fetch_cat_fact()
    formatted_fact = formatting(fact)
    msg = write_fact(formatted_fact)


if __name__ == "__main__":
    pipe()

If you wanted to add concurrency or parallelism with Dask or Ray, you could do so. Here's an example or parallelism with Dask, from a tutorial in the docs

from prefect import flow, task
from prefect_dask.task_runners import DaskTaskRunner

@task
def say_hello(name):
    print(f"hello {name}")

@task
def say_goodbye(name):
    print(f"goodbye {name}")

@flow(task_runner=DaskTaskRunner())
def greetings(names):
    for name in names:
        say_hello.submit(name)
        say_goodbye.submit(name)

if __name__ == "__main__":
    greetings(["arthur", "trillian", "ford", "marvin"])

Example based on question in comments:

from prefect import flow, task
from prefect_dask.task_runners import DaskTaskRunner

@task
def extract(num: int) -> int:
    return num

@task
def transform(num: int) -> int:
    return num + 1

@task(log_prints=True)
def prnt(num: int):
    print(num)

@flow(task_runner=DaskTaskRunner())
def adder(nums):
    for num in nums:
        val = extract(num)  # sequential b/c not submitted to task runner
        res = transform.submit(val)  # parallel b/c submitted to Dask task runner
        prnt.submit(res)  # parallel b/c submitted to Dask task runner


if __name__ == "__main__":
    adder([1, 2, 3])
jeffhale
  • 3,759
  • 7
  • 40
  • 56
  • Cool. Could you clarify a few points? In the second example, can I `say_goodbye.submit(name)` to kick off a third parallel task in the workflow? And what is a practical way to stream information between the tasks? I'm not sure how dask operates, could I still use pipes or do the tasks run on different nodes? Or is there some kind of interprocess queue I could use? – phobic Mar 06 '23 at 06:25
  • I meant in the second example, inside of the `say_hello` function can I write `say_goodbye.submit(name)` to kick off a third parallel task in the DAG (like extract -> transform -> load)? – phobic Mar 06 '23 at 06:33
  • If you want to run a few `say_goodbye` tasks in parallel with dask, just use a for loop in the `greetings` flow. Just regular Python works to stream the data - just return the result of one function and feed it into the next. You can learn about waiting for future states to resolve in Prefect here: https://docs.prefect.io/concepts/task-runners/#using-results-from-submitted-tasks You can check out more about prefect-dask here: https://prefecthq.github.io/prefect-dask/usage_guide/ – jeffhale Mar 06 '23 at 16:04
  • Sorry, maybe I misunderstand. Could you rewrite your second example to have an extract task that reads values from a list [1,2,3] and streams the values one at a time to a parallel transform task which adds 1 to each value and forwards that to another task load that prints the value? Each task should ideally run as a single process that runs in parallel with the other tasks. From what I understand, it looks like you would spawn 3 transform processes, one for each value instead of a single process that processes the stream of values one at a time. – phobic Mar 07 '23 at 07:14
  • You can't call a task from another task, but you can mix sequential execution and parallel execution as you choose. Will add more code to answer that is what I think you're asking for. – jeffhale Mar 07 '23 at 16:20
0

Airflow can pretty handle Extract/Transform/Load in a single DAG. But before coming to that let us try to understand your use-case.

  1. It seems you want to pass/stream data that is being processed between different tasks. But how large is your dataset ? If the data being processed is of large volume, you would need a tool/system to process it, like Spark or EMR or Snowflake which are built to process large datasets. If your dataset if of small size currently, at what pace will it grow?
  2. One of the reasons for the shift towards ELT from traditional ETL tools, is that you want to keep the processing as closer to the data as possible to save costs. As an example, imagine reading source data from a table into Informatica and then processing/transforming it by Informatica and then loading it back to the database. Compare this to, using a data orchestrator to read the data via SQL in Snowflake/Redshift/BQ and using their compute power and write it back to the same DB.

Data Orchestration allows you to segregate the "common" extract and load patterns and focus on the gist of the data pipelines. Allowing to transform/process the data closer to where it is stored, it helps you save money, gives you the flexibility to integrate heterogenous systems. As outlined in #1 above, even if you are processing data using Spark or any other external system, Airflow can communicate with it. It gives you a holistic view of how your Data tools interact with each other, makes it easier to debug, troubleshoot and expand your reach in your Data Landscape.

Airflow, can allow you to pass the data between tasks using XCoM. If you have larger datasets you can use XCoM Backend to support that. See here for reference. An easy way to do ETL is by using astro-sdk in Airflow, see example here and Getting Started Guide here.

manmeet
  • 330
  • 2
  • 4
  • 15
  • Thanks for your answer! I would be interested in how to get this to work with Airflow. 1. Several GB worth of data. Processing it with a simple script on my compute node is fine. No need to make it more complicated for this use case. 2. Thanks for the illustration, I got it. But in my use case ELT has the opposite effect as described in my post. Thanks for the references! I still have to read them, but just in case even if I can use XCoM for large data sets it does not support streaming between tasks that run in parallel inside the same DAG, or? – phobic Mar 06 '23 at 07:37
  • @phobic one clarifying question , when you "streaming between tasks" do you mean actual data streaming? or just passing the data between tasks? – manmeet Mar 08 '23 at 07:34
  • Streaming similar to a bash pipe. Like `extract_big_data | transform_line_by_line | load_line_by_line` where `extract_big_data` extracts and streams a big file line by line to the transform process and the transform step sends data line by line to the load process. There should ideally be 3 processes and all processes run in parallel. – phobic Mar 08 '23 at 08:15