Looking for the right way to make a kedro node output lazily two partitioned datasets

Question

I've built a node in Kedro loading lazily an input partitioned dataset, and saving lazily two partitioned datasets as an output (following recommendations found in the Kedro community: using a lambda + callable, into a dict comprehension, processing in memory only the current partition).

What i did works well in a way because the output is correctly generated lazily. The problem for me is that the computation is done twice. My function "normalize_plate" used in my node, outputs the two kind of data i want to store separately (two pandas dataframes), but the only way i found to catch both outputs in two different partitioned datatset is to catch either the first ouput or the other in two distinct dict comprehension in the return part of the node :

def normalize_plates(
    partitioned_standardized_profiles: Dict[str, Callable[[], Any]],
    df_reference_plate,
    df_descriptors,
    model,
) -> Dict[str, Callable[[], Any]]:

    return {
        partition_key: (
            lambda partition_load_func=partition_load_func: _normalize_plate(
                partition_load_func(), df_reference_plate, df_descriptors, model
            )[0]
        )
        for partition_key, partition_load_func in sorted(
            partitioned_standardized_profiles.items()
        )
    }, {
        partition_key: (
            lambda partition_load_func=partition_load_func: _normalize_plate(
                partition_load_func(), df_reference_plate, df_descriptors, model
            )[1]
        )
        for partition_key, partition_load_func in sorted(
            partitioned_standardized_profiles.items()
        )
    }

This is probably quite an unclean way to do this and I'm wondering if it's possible to build this kind of structure with only one for loop, hence, running the lambda only once per dataset ?

deepyaman · Answer 1 · 2022-07-11T11:43:30.663

You may be able to achieve what you're looking for with caching:

import functools
import time


# Mock partitioned data
def load0():
    time.sleep(3)
    return 0


def load1():
    time.sleep(3)
    return 1


def load2():
    time.sleep(3)
    return 2


partitioned_standardized_profiles = {0: load0, 1: load1, 2: load2}


# Enable caching for `_normalize_plate`
@functools.lru_cache
def _normalize_plate(partition_load_func, other):
    data = partition_load_func()
    return data + other, data * other


# Construct return values
return_values = {
    partition_key: (
        lambda partition_load_func=partition_load_func: _normalize_plate(
            partition_load_func, 3
        )[0]
    )
    for partition_key, partition_load_func in partitioned_standardized_profiles.items()
}, {
    partition_key: (
        lambda partition_load_func=partition_load_func: _normalize_plate(
            partition_load_func, 3
        )[1]
    )
    for partition_key, partition_load_func in partitioned_standardized_profiles.items()
}


# Test
start = time.time()
data = return_values[0][0]()
end = time.time()
print(f"[0][0]: {data} ({round(end - start, 1)} seconds)")

start = time.time()
data = return_values[1][0]()
end = time.time()
print(f"[1][0]: {data} ({round(end - start, 1)} seconds)")

start = time.time()
data = return_values[0][1]()
end = time.time()
print(f"[0][1]: {data} ({round(end - start, 1)} seconds)")

start = time.time()
data = return_values[0][2]()
end = time.time()
print(f"[0][2]: {data} ({round(end - start, 1)} seconds)")


start = time.time()
data = return_values[1][1]()
end = time.time()
print(f"[1][1]: {data} ({round(end - start, 1)} seconds)")

start = time.time()
data = return_values[1][2]()
end = time.time()
print(f"[1][2]: {data} ({round(end - start, 1)} seconds)")

When run, prints:

[0][0]: 3 (3.0 seconds)
[1][0]: 0 (0.0 seconds)
[0][1]: 4 (3.0 seconds)
[0][2]: 5 (3.0 seconds)
[1][1]: 3 (0.0 seconds)
[1][2]: 6 (0.0 seconds)

That being said, if it's critical, I would recommend using a framework that's designed for more complex dependencies like these, such as Dask Delayed. (You can also use Dask with Kedro, so it's not an either-or situation.)

@SprigganCG I've updated my answer to reflect my view that it is doable, upon further thought, although I'm not sure how well it will work in a non-toy setting, where the cache values may be large (seeing as they're dataframes). Please feel free to provide feedback. :) — deepyaman, Jul 11 '22 at 11:47
Hi @deepyaman, I tested quickly to apply your method, but i get a: TypeError: unhashable type: 'DataFrame'. — SprigganCG, Jul 18 '22 at 13:03
Looked for ways to make the dataframe cacheable.. but adding again complexity for such a use case is quite not convenient. maybe I should check a dask oriented approach as you suggest... — SprigganCG, Jul 18 '22 at 16:27
Ah, sorry, I didn't think about `lru_cache` requiring it to be hashable; agree it doesn't make sense to add complexity, and better to use something like Dask. (And sorry for the late follow-up!) — deepyaman, Oct 07 '22 at 11:18

Looking for the right way to make a kedro node output lazily two partitioned datasets

1 Answers1