I've built a node in Kedro loading lazily an input partitioned dataset, and saving lazily two partitioned datasets as an output (following recommendations found in the Kedro community: using a lambda + callable, into a dict comprehension, processing in memory only the current partition).
What i did works well in a way because the output is correctly generated lazily. The problem for me is that the computation is done twice. My function "normalize_plate" used in my node, outputs the two kind of data i want to store separately (two pandas dataframes), but the only way i found to catch both outputs in two different partitioned datatset is to catch either the first ouput or the other in two distinct dict comprehension in the return part of the node :
def normalize_plates(
partitioned_standardized_profiles: Dict[str, Callable[[], Any]],
df_reference_plate,
df_descriptors,
model,
) -> Dict[str, Callable[[], Any]]:
return {
partition_key: (
lambda partition_load_func=partition_load_func: _normalize_plate(
partition_load_func(), df_reference_plate, df_descriptors, model
)[0]
)
for partition_key, partition_load_func in sorted(
partitioned_standardized_profiles.items()
)
}, {
partition_key: (
lambda partition_load_func=partition_load_func: _normalize_plate(
partition_load_func(), df_reference_plate, df_descriptors, model
)[1]
)
for partition_key, partition_load_func in sorted(
partitioned_standardized_profiles.items()
)
}
This is probably quite an unclean way to do this and I'm wondering if it's possible to build this kind of structure with only one for loop, hence, running the lambda only once per dataset ?