Dask: subset (or drop) rows from Dataframe by index

Question

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)

Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.

What is the best way to do this?

The functionality of operations on the index in dask are fairly limited. For instance, [`Index.difference`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.difference.html#pandas.Index.difference) would be the straightforward implementation but its also not implemented. — Brad Solomon, Nov 20 '17 at 18:12

score 0 · Accepted Answer · answered Nov 21 '17 at 19:01

0

I'm not sure this is the "best" way, but here's how I ended up doing it:

Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe

answered Nov 21 '17 at 19:01

terry87

445
1
4
15

score -2 · Answer 2 · answered Dec 31 '17 at 13:15

-2

Another possibility is:

df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

answered Dec 31 '17 at 13:15

skibee

1,279
1
17
37

Dask: subset (or drop) rows from Dataframe by index

2 Answers2