0

I am currently examining different design pattern options for our pipelines. Kedro framework seems like a good option (allowing to modular design pattern, visualization methods, etc.).

The pipelines should be created out of many modules that are either writting output to file or piping it to next module (depends on condition). In second case (pipe to next module) kedro falls of, as it reads the whole output into memory and then forwards to next step (or is there a possibility of unix-type pipeing)? I am working with Big Data, so this one is out for me. Why is this workflow different to a usual unix pipe? - unix pipes are reading in certain buffer size and forwarding it right away (I guess this gets swapped to disk and not kept in memory?). I would appreciate if you could point me to another framework that allows such functionality (I also don't mind implementing DP from scratch).

EDIT: I have nodes that are mainly dependant on external binaries and therefore, I would like to achieve Unix-like piping.

Jumpman
  • 429
  • 1
  • 3
  • 10

2 Answers2

0

Kedro-Accelerator is a Kedro plugin that brings some Unix pipe semantics to Kedro. Specifically, TeePlugin allows for passing data between nodes in memory (as MemoryDataSets) while writing outputs to disk/files in the background.

Buffering would be delegated to the underlying framework once you're using MemoryDataSets. For instance, for DataFrame objects, the default copy mode is assignment, so the behavior is analogous to running statements in sequence without any loading/saving:

from kedro.extras.datasets.pandas import CSVDataSet

node1_in = CSVDataSet(filepath="data.csv").load()  # Read data from a CSVDataSet as input to the first node.
node1_out = node1_in.dropna()  # The first node performs some operations on the input before returning.
node2_in = node1_out  # If the output of the first node/input to the second node is a MemoryDataSet, no data is passed, just references.
...

For the implementation details (as of Kedro 0.17.0), see https://github.com/quantumblacklabs/kedro/blob/0.17.0/kedro/io/memory_data_set.py#L105-L130.

deepyaman
  • 538
  • 5
  • 16
  • It would still require me to keep e.g. 60GB data in memory, which is above my limits if I am running multiple instances. – Jumpman Feb 23 '21 at 09:00
  • If you are working with big data, maybe it makes sense to consider using a computing framework like Dask or Spark? pandas works in memory, so this sort of challenge is related to using pandas as a backend. If you want to stick too pandas, you can learn more about potential mitigations on https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html. Note that Kedro has far better support/documentation for Spark than Dask at this time, although a number of people have used Dask/it's not very difficult to make it work. It may be easier to migrate from pandas to Dask (results could vary). – deepyaman Feb 24 '21 at 10:28
0

Kedro is a nice framework, but it is mainly suited for building batch pipelines. If you're looking for a "unix pipe"-like behavior, then you should look for stream processing pipeline frameworks like Spark Streaming.

If you want more options, check out the Awesome Streaming list of many other stream processing frameworks.

Sergiy Sokolenko
  • 5,967
  • 35
  • 37