How do I convert a Dask Dataframe into a Dask Array?

Question

I have a dask dataframe object but would like to have a dask array. How do I accomplish this?

https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.to_dask_array.html — HappyFace, Feb 18 '22 at 21:07

MRocklin · Answer 1 · 2018-08-31T16:25:27.063

There are three ways to do this.

Use the aptly named .to_dask_array() method
Use the .values attribute, or the to_records() method, like with Pandas
Use map_partitions to call any function that converts a pandas dataframe into a numpy array on all of the partitions

Here is an example doing all three.

>>> import dask

>>> df = dask.datasets.timeseries()

>>> df
Dask DataFrame Structure:
                   id    name        x        y
npartitions=30                                 
2000-01-01      int64  object  float64  float64
2000-01-02        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-01-30        ...     ...      ...      ...
2000-01-31        ...     ...      ...      ...
Dask Name: make-timeseries, 30 tasks

>>> import numpy as np

>>> df.map_partitions(np.asarray)
dask.array<asarray, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.to_dask_array()
dask.array<array, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.values
dask.array<values, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>

>>> df.to_records()  # note that this returns a record array
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('timestamp', 'O'), ('id', '<i8'), ('name', 'O'), ('x', '<f8'), ('y', '<f8')]), chunksize=(nan,)

>>> dask.__version__
0.19.0

Note that because Dask dataframes don't maintain the number of rows in each chunk, the resulting arrays also won't have this information. (note the NaN values in the shape and chunk size.

How do I convert a Dask Dataframe into a Dask Array?

1 Answers1