0

I would like to call an API to enrich an existing dataset.

The existing dataset is a CSVDataSet configured in the catalog.
Now I would like to create a Node, that enriches the CSVDataSet with data from the API, that I have to call for every row in the CSV file. Then save the data into a database (SQLTableDataSet). My approach is to create an APIDataSet entry in the catalog and provide it as an input for the node, next to the CSVDataSet.
The issue here is, the APIDataSet is static (in general the DataSets seem to be very static). I need to call the load function at runtime within the Node for every entry in the csv file.

I didn't find a way to do this. Is it just a bad approach? Do I have to call the API within the Node instead of creating a APIDataSet?

Ivan Reshetnikov
  • 398
  • 2
  • 12
ndueck
  • 713
  • 1
  • 8
  • 27

2 Answers2

2

So typically, we don't like our nodes having knowledge of IO configuration. The belief is that functionally pure python functions are easier to test, maintain and build.

Typically the way we would keep this distinction would be for you to subclass our APIDataSet / CSVDataSet or both and then add your custom logic to do it all there.

datajoely
  • 1,466
  • 10
  • 13
  • In my case the data from the APIDataSet depends on the data from the CSVDataSet. Do you have a suggestion how to implement/design this? – ndueck Aug 22 '22 at 09:09
  • I would subclass CSVDataSet and then make an API call with requests.get() – datajoely Aug 22 '22 at 14:29
  • So you would process the data when creating the dataset, if I'm not mistaken. This would hide data processing in the subclass. Using this approach on the spaceflight example in the kedro docs e.g. this would mean I would subclass AbstractDataSet which would give me a DataSet with companies merged with shuttles. Imagine you could get the shuttles data only through an api, but cannot get all shuttles at once from the api. – ndueck Aug 22 '22 at 14:51
  • I'd prefer something like a custom, lazy loaded DataSet, that cannot be iterated at first, but when provided with data (e.g. shuttle_ids) it loads the data (shuttle data). Not sure if this is possible with pandas dataframes. Just started getting into Data Engineering/Science. – ndueck Aug 22 '22 at 14:56
  • Yes you would have to hide the processing logic in the dataset class and configure it through the catalog definition. In Kedro it's a hard rule that the nodes shouldn't have knowledge of IO. – datajoely Aug 22 '22 at 15:30
1

I have done this in my GDALRasterDataSet implementation. The idea is that if you need to enrich a dataset on the go, you can overload the load() method in a custom dataset and pass additional parameters there.

You can see an implementation here and an example of usage here.

The only extra thing you need to do is to re-write the load() method to accept kwargs (line 143) and write your own _load method that enriches your dataset. Everything else is boilerplate.

Ivan Reshetnikov
  • 398
  • 2
  • 12
Barros
  • 11
  • 3
  • @datajoely can you take a look at this solution? does it fit into the kedro rules or at least not contradict them? – ndueck Aug 23 '22 at 21:06