I would like to create a dask.Bag
(or dask.Array
) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory.
delayed_array = [delayed(generator) for generator in list_of_generators]
my_bag = db.from_delayed(delayed_array)
NB list_of_generators
is exactly that - the generators haven't been consumed (yet).
My problem is that when creating delayed_array
the generators are consumed and RAM is exhausted. Is there a way to get these long lists into the Bag
without first consuming them, or at least consuming them in chunks so RAM use is kept low?
NNB I could write the generators to disk, and then load the files into the Bag
- but I thought I might be able to use dask
to get around this?