5

Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame?

I currently use this approach

import dask.array as da
import dask.dataframe as dd

dx = da.random.random((10000, 10000), chunks=(1000, 1000))
ddf = dd.from_dask_array(dx)
ddf = ddf.drop_duplicates()
dx = ddf.to_dask_array(lengths=True)

which works for bigger data-sets than np.unique(dx, axis=0) but eventually also runs out of memory.

I'm using Python 3.6 (but can upgrade), Dask 0.20 and Ubuntu 18.04 LTS.

Edgar H
  • 1,376
  • 2
  • 17
  • 31

1 Answers1

4

You can always just use numpy.unique:

import dask.array as da
import numpy as np

dx = da.random.random((10000, 10000), chunks=(1000, 1000))
dx = np.unique(dx, axis=0)

This may still leave you with memory issues when you try to use it with "data sets larger than my RAM", since it will run the calculation on a single node. There is a dask.array.unique function, but it doesn't support the axis keyword yet. This means that it will flatten the array and return the unique single values, not the unique rows. The sorting functions that would allow for any kind of a hand-rolled parallelized version don't seem to be implemented in dask.array either.

My recommendation would be to just suck it up for now and convert to dask.dataframe. This approach assures that you get the correct output, even if it's not the fastest conceivable implementation.

Edit

I initially thought there might be a simple hack that could be used to implement the axis parameter for dask.array.unique. However, the blob type trick that numpy.unqiue uses to implement its own axis keyword turns out to not carry over easily to Dask arrays, owing to the presence of chunks.

So no clever worakaround for now. Just use dask.dataframe.

tel
  • 13,005
  • 2
  • 44
  • 62