Parametrize input datasets in kedro

Question

I'm trying to move my project into a kedro pipeline but I'm struggling with the following step:

my prediction pipeline is being run by a scheduler. The scheduler supplies all the necessary parameters (dates, country codes etc.). Up until now I had a CLI which would get input parameters such as below

python predict --date 2022-01-03 --country UK

The code would then read the input dataset for a given date and for a given country, so the query would be something like:

SELECT *
FROM input_data_{country}
WHERE date = {date}

and this would be formatted using the input variables passed in the CLI.

Important note: the code has to run on any arbitary date passed by the scheduler, and not only on "today".

How would I parametrize Kedro's data catalog using CLI arguments?

I tried the examples in the documentation of Kedro but it seems that they are mainly geared towards using templates from config in reading the data. The key issue I'm struggling with is passing CLI arguments to the data catalog and haven't found a working solution. I looked into PartitionedDataSet but I don't see an option to have CLI arguments as inputs there

score 0 · Answer 1 · answered Feb 10 '23 at 10:07

I found the answer, here it is if anyone has a similar problem.

The key is to use a TemplatedConfigLoader class and insert variables into the catalog.yml

So with my example of country code:

SELECT *
FROM input_data_${country}

will get variables from globals.yml:

country: "UK"

but only if the settings.py are set in the following way, so that it uses the variables from the global config.

CONFIG_LOADER_CLASS = MyTemplatedConfigLoader # TemplatedConfigLoader
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
CONFIG_LOADER_ARGS = {
    "globals_pattern": "*globals.yml",
}

class MyTemplatedConfigLoader(TemplatedConfigLoader):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        if self.runtime_params:
            self._config_mapping.update(self.runtime_params)

Now, the variables can be overriden in terminal like this:

kedro run --pipeline=predict --params country:US

Parametrize input datasets in kedro

1 Answers1