6

Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.

from sklearn.datasets import load_iris
import dask.dataframe as dd

d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)

What I would like to do:

in_memory = ddf.iloc[:,2:4].compute()

What I have been able to do:

ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()

map_partitions works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.

Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • It is useful to you to simply get the columns (`cols = list(ddf.columns[2:4])`) and index by them (`ddf[cols]`)? – mdurant May 24 '17 at 20:04
  • That does it! And it's a lot faster than what I attempted. Please add as an answer and I will accept. – Zelazny7 May 24 '17 at 20:10

1 Answers1

8

Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:

cols = list(ddf.columns[2:4])
ddf[cols].compute()

This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.

mdurant
  • 27,272
  • 5
  • 45
  • 74