3

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)

Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.

What is the best way to do this?

terry87
  • 445
  • 1
  • 4
  • 15
  • The functionality of operations on the index in dask are fairly limited. For instance, [`Index.difference`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.difference.html#pandas.Index.difference) would be the straightforward implementation but its also not implemented. – Brad Solomon Nov 20 '17 at 18:12

2 Answers2

0

I'm not sure this is the "best" way, but here's how I ended up doing it:

  1. Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
  2. Inner join the Dask Dataframe
terry87
  • 445
  • 1
  • 4
  • 15
-2

Another possibility is:

df_index = df.reset_index()
df_index = df_index.dorp_dplicates()
skibee
  • 1,279
  • 1
  • 17
  • 37