How to create a list of catalog entries and pass them in as inputs in Kedro Pipeline

Question

I am trying to get a list of datasets from a catalog file i have created and pass them in as inputs of a single node to combine them and ultimately run the pipeline on airflow using the kedro-airflow plugin

This works on the cli with kedro run but seems to fail in airflow and I am not sure why:

#my_pipeline/pipeline.py
def create_pipeline(**kwargs):
      conf_loader = ConfigLoader(['conf/base'])
      conf_catalog = conf_loader.get('catalog-a*')

      datasets = [key for key, value in conf_catalog.items()] 
      return Pipeline([
           node(
            func=combine_data,
            inputs=datasets,
            outputs="combined_data",
            name="combined_data"
        ),
        ...#other nodes
      ])

The error I am getting on airflow looks something like this: Broken dag: Given configuration path either does not exist or is not a valid directory: 'conf/base'

This is a Kedro config loader error for sure but i can't seem to figure out why the only error occurs when running the pipeline via airflow. From what I have been reading mixing in the code API is not advised. Is this the right way pass in a list of datasets?

Edit

My catalog is basically a list of Sql query datasets:

dataset_1:
  type: pandas.SQLQueryDataSet
  sql: select * from my_table where created_at >= '2018-12-21 16:00:00' and partner_id=1
  credentials: staging_sql

dataset_2:
  type: pandas.SQLQueryDataSet
  sql: select * from my_table where created_at >= '2019-08-15 11:55:00' and partner_id=2
  credentials: staging_sql

Can you share your catalogue yaml as well? This may help. – pascalwhoop Sep 24 '20 at 10:32 — pascalwhoop, Sep 24 '20 at 10:32

score 0 · Answer 1 · edited Sep 24 '20 at 15:28

0

I think it might fail because kedro run is running this from its root directory where it can find the conf/base but the create_pipeline function is under my_pipeline directory so kedro ConfigLoader cannot find that. I think another way I've done this in the past is, to pass catalog: DataCatalog like this:

def create_pipeline(catalog: DataCatalog = None, * *kwargs) -> Pipeline:

Then you can iterate over or do:

datasets = catalog.datasets.

edited Sep 24 '20 at 15:28

Dharman

30,962
25
85
135

answered Sep 24 '20 at 15:23

mayurc

267
4
13

I appreciate you taking the time. After testing a couple of options i.e adding the datasets programatically instead and overloading the _get_catalog function in the context to pass my datasets to the pipeline args(didn't work out will need to test a bit more). I am leaning towards using the database to combine the datasets and read it all in later. Perhaps have an SQL file with the queries instead. It seems to be the recommended way of dealing with this kind of case. – Metrd Sep 29 '20 at 17:16

How to create a list of catalog entries and pass them in as inputs in Kedro Pipeline

Edit

1 Answers1