2

I have a kedro pipeline which generates a file that is used again for the next run of that same pipeline. However, when the pipeline runs for the first time, that file does not exist, and it is handled in a node in the pipeline. Kedro throws an missing file error here at this time. Is there a way this can be handled through Kedro? Maybe add an catalog parameter missing=True or optional=True, and Kedro can safely ignore the file?

How I currently implemented the solution was to create an empty file, and check if the file is an empty dataframe in my node.

1 Answers1

1

I don't think this is possible.

I tried to propose a workaround using hooks to inject a custom MissingDataSet on the fly, but this workflow didn't work: https://github.com/kedro-org/kedro/issues/2690#issuecomment-1607746840

Apparently DataCatalog is not a singleton, so this is not straightforward.

astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
  • Oh man, this will be a very helpful feature if it was possible. Thanks! – Nandha Kumar Jun 26 '23 at 16:44
  • I'm not sure but maybe incremental dataset could be a possible way. you can define an empty folder as en incremental dataset in the data catalog and it will not fail at the first run. The challenge is to create a partition the first time and keep it unaltered for the next runs. – SprigganCG Jul 21 '23 at 10:31