9

I want to select rows from a dask dataframe based on a list of indices. How can I do that?

Example: Let's say, I have the following dask dataframe.

dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)

Furthermore, I have a list of indices, that I am interested in, e.g.

indices_i_want_to_select = ['x1','x3', 'y6']

From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select

Arco Bast
  • 3,595
  • 2
  • 26
  • 53
  • `loc` on lists is not yet supported. See https://github.com/dask/dask/issues/1298 – MRocklin Jul 12 '16 at 02:47
  • Thank you for this information. I do not insist on using loc, just any possible way to generate a dask dataframe based on a list of indices would be nice. Currently, I'm a bit stuck. – Arco Bast Jul 12 '16 at 13:15
  • You should be able to hack something up with `map_partitions` – MRocklin Jul 16 '16 at 14:38

2 Answers2

12

Edit: dask now supports loc on lists:

ddf_selected = ddf.loc[indices_i_want_to_select]

The following should still work, but is not necessary anymore:

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)

#list of indices I want to select
l = ['i1', 4, 5]

#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
  • For me this example returns an empty Series. This is the same problem I encountered in my own code. Am I missing something? – Tom Hemmes Nov 14 '17 at 14:08
  • 5
    Aha, the difference is in applying the method to index or other column value. In case of other column value, simply use: `ddf_selected = ddf[ddf.B.isin(l)]` – Tom Hemmes Nov 14 '17 at 14:49
2

Using dask version '1.2.0' results with an error due to the mixed index type. in any case there is an option to use loc.

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)

# #list of indices I want to select
l = ['i1', '4', '5']

# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()
skibee
  • 1,279
  • 1
  • 17
  • 37