How can I select data from a dask dataframe by a list of indices?

Question

I want to select rows from a dask dataframe based on a list of indices. How can I do that?

Example: Let's say, I have the following dask dataframe.

dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)

Furthermore, I have a list of indices, that I am interested in, e.g.

indices_i_want_to_select = ['x1','x3', 'y6']

From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select

`loc` on lists is not yet supported. See https://github.com/dask/dask/issues/1298 — MRocklin, Jul 12 '16 at 02:47
Thank you for this information. I do not insist on using loc, just any possible way to generate a dask dataframe based on a list of indices would be nice. Currently, I'm a bit stuck. — Arco Bast, Jul 12 '16 at 13:15
You should be able to hack something up with `map_partitions` — MRocklin, Jul 16 '16 at 14:38

Arco Bast · Accepted Answer · 2020-02-09T18:48:36.803

12

Edit: dask now supports loc on lists:

ddf_selected = ddf.loc[indices_i_want_to_select]

The following should still work, but is not necessary anymore:

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)

#list of indices I want to select
l = ['i1', 4, 5]

#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)

edited Feb 09 '20 at 18:48

answered Nov 07 '16 at 23:04

Arco Bast

3,595
2
26
53

For me this example returns an empty Series. This is the same problem I encountered in my own code. Am I missing something? – Tom Hemmes Nov 14 '17 at 14:08
5

Aha, the difference is in applying the method to index or other column value. In case of other column value, simply use: `ddf_selected = ddf[ddf.B.isin(l)]` – Tom Hemmes Nov 14 '17 at 14:49

score 2 · Answer 2 · answered Jun 20 '19 at 07:59

Using dask version '1.2.0' results with an error due to the mixed index type. in any case there is an option to use loc.

import pandas as pd
import dask.dataframe as dd

#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)

# #list of indices I want to select
l = ['i1', '4', '5']

# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()

How can I select data from a dask dataframe by a list of indices?

2 Answers2

Linked