2

I am using python 2.7 with dask dataframe

I have a df that is too big for memory but fits into disk beautifully.

I group by an index, and than need to iterate over the groups, I found here how to do it.

When I try to use the suggested code:

for value in drx["col"].unique():
    print value

I get an error

File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1709, in getitem raise NotImplementedError() NotImplementedError

Assuming that it's not implemented, I found the way to iterate the series I get using unique() is this

But when I try to utilize it like so:

data = table["col"].unique()
it = data.iteritems()
for val in it:
    print 1

My memory explodes as if all the values of the columns are saved in memory for as long as I use the iterator.

How can I use the iterator values without saving all of them into memory?

thebeancounter
  • 4,261
  • 8
  • 61
  • 109
  • 1
    I don't think you're going to be able to avoid materializing all the items you've seen thus far into some sort of data-structure if you want to iterate over only *unique* values. – juanpa.arrivillaga Nov 05 '17 at 08:21
  • @juanpa.arrivillaga - this operation requires a shuffle, but dask knows how to shuffle on disk, and after the shuffle is made, there is no reason to keep all the results in memory, you can easily dump them into disk and then iterate... – thebeancounter Nov 05 '17 at 08:30
  • What requires a shuffle? – juanpa.arrivillaga Nov 05 '17 at 08:33
  • @juanpa.arrivillaga getting the list of unique values, require iteration over all the items, which require a shuffle, other than that, and dask knows to do this in disk. which means that you don't really need to store the entire data set in memory at any stage of the operation. – thebeancounter Nov 05 '17 at 10:29

1 Answers1

4

If all of the unique values fit into memory then call compute beforehand

for item in df[col].unique().compute()
    ...

Otherwise I recommend writing to disk with something like parquet and then iterating off of that

df[col].unique(split_out=10).to_parquet(...)
s = dd.read_parquet(...)
for item in s.iteritems():
    ...
MRocklin
  • 55,641
  • 23
  • 163
  • 235