1

I have a kedro node which returns a list of pandas dataframe. In another node I do_something() to the dataframes. For example:

def first_node():
"""returns a list"""
    return list_item

def do_something(data):
"""perform same action to all list element"""
   data_ls = []
   for item in data:
       #do thing
       data_ls.append(item)
   return data_ls

However, instead of performing all actions in same node I want to create a different nodes for each list element returned by first_node(). I am struggling with this because kedro pipeline expects explicit mention of input and output. Is it possible to achieve this with kedro?

Thanks a lot!

  • Does `list_item` (output of `first_node`) always contain the same number of underlying dataframes? – swimmer Jul 05 '21 at 09:51
  • Not necessarily, no. However, for now as a starting point, I could even start with a fixed number of `list_item` – GreenTemple Jul 05 '21 at 09:56
  • 3
    In the Kedro philosophy, this notion of "fanning-out" is kind of discouraged because of the additional mental model you now need to keep in mind for your pipeline (you are now introducing some logic into your pipelines) and the philosophy is to output a bunch of dataframes and then `do_something` to each of them within a single node (possible using threads/multiprocessing within that node) – Zain Patel Jul 05 '21 at 17:43
  • @ZainPatel this is not always good enough, specially with bigger datasets when it's much better idea to split the data processing into separate nodes for better resources management then if trying to do it in a single node... – robertzp Feb 18 '22 at 14:55

1 Answers1

0

You can define the output from the first node to be a directory and to save each element in that directoty in the first node

Second node can take the directory as an input where you can define it in the catlog and then loop over the items and preform a certian functionality on the item