1

I have many delayed dict returned from a dask delayed function. I would like to aggregate them into a summary_dict like below. items function doesn't work on delayed object.

@dask.delayed
def get_dict(date):
    return {
        'a': {'date': date},
        'b': {'date': date}
    }

summary_dict = {}
dates = [d1, d2, d3, ...]
for date in dates:
    date_dict = get_dict(date)
    # the following doesn't work because date_dict is a delayed object
    for key, val in date_dict.items():
        summary_dict.setdefault(key, []).append(val)

I'm able to do the following to make it work. However, it is quite ugly because I have to hardcode the keys ahead.

hardcoded_keys = ['a', 'b']
get_dict_items = {key, dask.delayed(operator.itemgetter(key)) for key in keys}

summary_dict = {}
dates = [d1, d2, d3, ...]
for date in dates:
    date_dict = get_dict(date)
    for hardcoded_key in hardcoded_keys:
        val = get_dict_items[key](date_dict)
        summary_dict.setdefault(key, []).append(val)

Is there a better way to achieve this?

abisko
  • 663
  • 8
  • 21
  • `keys` is not defined in your second code block. – joseville Nov 19 '21 at 16:03
  • Also, you have `get_dict_items` and `get_dict_item`. Should both be `get_dict_items`? – joseville Nov 19 '21 at 16:03
  • it seems like you're using delayed just to store data, but then you're loading and looping through it all in series. I'm having a hard time understanding your example just because I don't understand how delayed is actually helping you here. Could you explain a bit about how you're seeing dask fit into your workflow? – Michael Delgado Nov 19 '21 at 22:07
  • I might be misreading the code, but does `summary_dict` contain just two keys ('a' and 'b')? – SultanOrazbayev Nov 22 '21 at 04:56
  • @joseville sorry `keys` should be `hardcoded_keys`, and `get_dict_item` should be `get_dict_items`. I just updated my question – abisko Nov 22 '21 at 14:27
  • @MichaelDelgado thanks for raising this. my actual `get_dict` function is a lot more complicated. by iterating dates, I'm parallelizing all the `get_dict` functions. – abisko Nov 22 '21 at 14:29
  • @SultanOrazbayev `summary_dict` could contain more keys, but simplified it in my example – abisko Nov 22 '21 at 14:30

1 Answers1

1

The only thing I might add to your solution @abisko is that Delayed objects support __getitem__, so instead of using operator.itemgetter, you should be able to index directly:

from dask import delayed
from datetime import datetime


@delayed
def get_dict(date):
    return {
        'a': {'date': date},
        'b': {'date': date}
    }


summary_dict = {}
dates = [datetime.now(), datetime(2020, 1, 1), datetime(1999, 4, 23)]
date_dicts = [get_dict(d) for d in dates]
for date_dict in date_dicts:
    for key in ["a", "b"]:
        summary_dict.setdefault(key, []).append(date_dict[key])

It might also be worth noting that the public Delayed API accepts an nout argument, which would support iterating through tuples, but unfortunately not dictionaries.

scj13
  • 306
  • 1
  • 5