4

I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory.

delayed_array = [delayed(generator) for generator in list_of_generators]
my_bag = db.from_delayed(delayed_array)

NB list_of_generators is exactly that - the generators haven't been consumed (yet).

My problem is that when creating delayed_array the generators are consumed and RAM is exhausted. Is there a way to get these long lists into the Bag without first consuming them, or at least consuming them in chunks so RAM use is kept low?

NNB I could write the generators to disk, and then load the files into the Bag - but I thought I might be able to use dask to get around this?

danodonovan
  • 19,636
  • 10
  • 70
  • 78
  • 1
    I'm pretty sure that `from_delayed` expects each piece to be small enough to fit into memory, so the only ways around this would be (a) chunk up the generators with `islice` so you have 10x as many generators 1/10th the size (and then reshape things if needed after construction), or (b) wrap each generator in something that builds up an array (or something else compact) iteratively so you can feed those arrays to dask instead of the generators, or (c) use disk instead of memory as you suggested. None is exactly trivial, so… hopefully someone has a better solution. – abarnert Jun 14 '18 at 16:41
  • Of course (c) is really just a special case of (b), since the numpy and pandas functions to read files all do it by reading the file iteratively and constructing an array/series/dataframe without first constructing a list to convert from, but if you don't want to use the disk, you have to write that iterative building part yourself (whether by feeding fromiter, or looping and adding rows explicitly, or whatever). But, one more possibility: if you have enough memory to hold each compact binary file in memory, instead of writing to disk, you can write to a `BytesIO` and then read that. – abarnert Jun 14 '18 at 16:44

1 Answers1

4

A decent subset of Dask.bag can work with large iterators. Your solution is almost perfect, but you'll need to provide a function that creates your generators when called rather than the generators themselves.

In [1]: import dask.bag as db

In [2]: import dask

In [3]: b = db.from_delayed([dask.delayed(range)(i) for i in [100000000] * 5])

In [4]: b
Out[4]: dask.bag<bag-fro..., npartitions=5>

In [5]: b.take(5)
Out[5]: (0, 1, 2, 3, 4)

In [6]: b.sum()
Out[6]: <dask.bag.core.Item at 0x7f852d8737b8>

In [7]: b.sum().compute()
Out[7]: 24999999750000000

However, there are certainly ways that this can bite you. Some slightly more complex dask bag operations do need to make partitions concrete, which could blow out RAM.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • (sorry for delay in accepting). Yes, this worked for me - unfortunately my generators were of significantly different lengths (and this didn't work too well) but that's not the question I asked! – danodonovan Jul 09 '18 at 11:34