I have a dask dataframe object but would like to have a dask array. How do I accomplish this?
Asked
Active
Viewed 3,399 times
1 Answers
7
There are three ways to do this.
- Use the aptly named .to_dask_array() method
- Use the
.values
attribute, or theto_records()
method, like with Pandas - Use
map_partitions
to call any function that converts a pandas dataframe into a numpy array on all of the partitions
Here is an example doing all three.
>>> import dask
>>> df = dask.datasets.timeseries()
>>> df
Dask DataFrame Structure:
id name x y
npartitions=30
2000-01-01 int64 object float64 float64
2000-01-02 ... ... ... ...
... ... ... ... ...
2000-01-30 ... ... ... ...
2000-01-31 ... ... ... ...
Dask Name: make-timeseries, 30 tasks
>>> import numpy as np
>>> df.map_partitions(np.asarray)
dask.array<asarray, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.to_dask_array()
dask.array<array, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.values
dask.array<values, shape=(nan, 4), dtype=object, chunksize=(nan, 4)>
>>> df.to_records() # note that this returns a record array
dask.array<to_records, shape=(nan,), dtype=(numpy.record, [('timestamp', 'O'), ('id', '<i8'), ('name', 'O'), ('x', '<f8'), ('y', '<f8')]), chunksize=(nan,)
>>> dask.__version__
0.19.0
Note that because Dask dataframes don't maintain the number of rows in each chunk, the resulting arrays also won't have this information. (note the NaN
values in the shape and chunk size.

MRocklin
- 55,641
- 23
- 163
- 235