2

What is the most efficient way to create a dask.array from a dask.Series of list? The series consists of 5 million lists 300 of elements. It is currently divide into 500 partitions. Currently I am trying:

pt = [delayed(np.array)(y)
      for y in
      [delayed(list)(x)
       for x in series.to_delayed()]]
da = delayed(dask.array.concatenate)(pt, axis=1)
da = dask.array.from_delayed(da, (vec.size.compute(), 300), dtype=float)

The idea is to convert each partition into a numpy array and stitch those together into a dask.array. This code is taking forever to run though. A numpy array can be built from this data quite quickly from this data sequentially as long as there is enough RAM.

ascripter
  • 5,665
  • 12
  • 45
  • 68
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90

2 Answers2

2

I think that you are on the right track using dask.delayed. However calling list on the series is probably not ideal. I would create a function that converts one of your series into a numpy array and then go through delayed with that.

def convert_series_to_array(pandas_series):  # make this as fast as you can
    ...
    return numpy_array

L = dask_series.to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=...) for x in L]
x = da.concatenate(arrays, axis=0)

Also, regarding this line:

da = delayed(dask.array.concatenate)(pt, axis=1)

You should never call delayed on a dask function. They are already lazy.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

Looking at this with some dummy data. Building on @MRocklin's answer (and molding more after my specific use case), let's say that your vectors are actually list of ints instead of floats and the list is stored as a string. We take the series, transform it, and store it in a zarr array file.

# create dummy data
vectors = [ np.random.randint(low=0,high=100,size=300).tolist() for _ in range(1000) ]
df = pd.DataFrame()
df['vector'] = vectors
df['vector'] = df['vector'].map(lambda x:f"{x}")
df['foo'] = 'bar'
ddf = dd.from_pandas( df, npartitions=100 )

# transform series data to numpy array
def convert_series_to_array( series ):  # make this as fast as you can
    series_ = [ast.literal_eval( i ) for i in series]
    return np.stack(series_, axis=0)

L = ddf['vector'].to_delayed()
L = [delayed(convert_series_to_array)(x) for x in L]
arrays = [da.from_delayed(x, shape=(np.nan, 300), dtype=np.int64) for x in L]
x = da.concatenate(arrays, axis=0)

# store result into a zarr array
x.compute_chunk_sizes().to_zarr( 'toy_dataset.zarr', '/home/user/Documents/', overwrite=True )
scottlittle
  • 18,866
  • 8
  • 51
  • 70