Subset dask dataframe by column position

Question

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.

from sklearn.datasets import load_iris
import dask.dataframe as dd

d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)

What I would like to do:

in_memory = ddf.iloc[:,2:4].compute()

What I have been able to do:

ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()

map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.

It is useful to you to simply get the columns (`cols = list(ddf.columns[2:4])`) and index by them (`ddf[cols]`)? — mdurant, May 24 '17 at 20:04
That does it! And it's a lot faster than what I attempted. Please add as an answer and I will accept. — Zelazny7, May 24 '17 at 20:10

score 8 · Accepted Answer · answered May 24 '17 at 20:32

Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:

cols = list(ddf.columns[2:4])
ddf[cols].compute()

This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

Subset dask dataframe by column position

What I would like to do:

What I have been able to do:

1 Answers1