How can I stream data between tasks in a workflow with the help of a data pipeline orchestration tool like Prefect, Dagster or Airflow?
I am looking for a good data pipeline orchestration tool. I think I have a fairly decent overview now of what Apache Airflow is capable of. One thing I am missing in Airflow is the possibility to stream data between tasks.
I have an existing Python pipeline which extracts, transforms and loads data and uses Unix pipes in between. In bash syntax: extract | transform | load
meaning all three processes/tasks run in parallel.
I am aware that I could use Airflow with two intermediary storage targets in between and then start pulling the data as soon as it is available. My understanding is that I would have to create 3 distinct DAGs for this or keep everything in a single task where I would have to parallelize the processes manually. I could be wrong but that seems like a bad architecture for this solution. It should be possible to represent this workflow in a single abstraction and let the orchestration tool take care of parallelization.
I am also aware that using pipes might not work for all executors since they might reside on different nodes. However for this solution it would be fine to restrict the workflow to a single node or use an alternative way of streaming the data as long as it stays simple.
ELT would be another approach, but I don't like it much because it is way more sensible to remove sensitive data before it reaches the destination, not after. Plus the transform step in between allows me to reduce the amount of data I have to transfer and store considerably and also reduces the complexity of maintaining a temporary schema in the destination database :) Somehow the current shift to ELT does not appeal to me much.