Read from memory for full pipeline, read from files if retry or partial pipeline

Question

How can I use the pipeline to run from memory/file? I think the features are there but I am not sure how I can write the pipeline like this.

My use case is:

normal pipeline, from step 1 to step 10
run from step 2 to step 10

Imagine at step 1, I will write a dataframe to a csv, and step 2 will need to read from that. If I am running from step 1, I would want to pass that dataframe in memory (to save the read time). But if I start running from step 2, I will need to read from csv.

What is the best practice to do so with Kedro?

https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#pipeline-with-circular-dependencies

Lorena Balan · Answer 1 · 2020-08-19T11:12:19.150

I can think of 2 ways, depending on your use case.

a) You could use separate environments for this. When running the full pipeline, you use some environment regular where you don't have a catalog entry for dataset in question (hence will be turned into MemoryDataSet), while in a separate dev environment you have an entry in your catalog.yml to save it as a CSV. But it does mean you'd have to run dev from node 1 in order to generate the csv to be used for subsequent runs.

kedro run --env regular
kedro run --env dev
kedro run -e dev --from-nodes node2

Relevant docs: https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/02_configuration.html#additional-configuration-environments

b) Another way to do it, if you always want the first node to write to csv, is to have node1 return 2 outputs (same data), one as pandas.CSVDataSet and one as MemoryDataSet, and you define different pipelines. Pipeline complete where second node reads from memory, and partial where you don't have node1, and node2 reads from the csv dataset.

kedro run --pipeline complete
kedro run --pipeline partial

https://kedro.readthedocs.io/en/stable/06_nodes_and_pipelines/02_pipelines.html#running-a-pipeline-by-name

score 1 · Answer 2 · answered Aug 18 '20 at 10:50

1

In addition to options suggested by @Lorena Balan, you can use a CachedDataSet. Your catalog entry will look similar to this:

my_cached_dataset:
  type: CachedDataSet
  dataset:
    type: pandas.CSVDataSet
    filepath: path/to/file

Cached dataset will save the data using the regular underlying dataset and will also populate its internal cache, then the subsequent load will pull the data from that memory cache. If the cache is empty (your scenario 2), CachedDataSet on load will pull the data from the underlying [CSV] file.

answered Aug 18 '20 at 10:50

Dmitry Deryabin

1,518
2
14
27

cachedataset seems more suitable, since I do not know the partial pipeline in advance(for example, i may fail at step 4 due to network error and need to rerun from there). Is there any drawbacks? since I would like to use it for every dataset instead. – mediumnok Aug 19 '20 at 13:30
The only apparent drawback I see is the memory footprint of your pipeline since all the datasets will be cached in memory during the run. – Dmitry Deryabin Aug 20 '20 at 15:15

Read from memory for full pipeline, read from files if retry or partial pipeline

2 Answers2