NotImplementedError is thrown when I use isin with Dask data frames

Question

Let's say I have two dask data frames:

import dask.dataframe as dd 
import pandas as pd

dd_1 = dd.from_pandas(pd.DataFrame({'a': [1, 2,3], 'b': [6, 7, 8]}), npartitions=1)

dd_2 = dd.from_pandas(pd.DataFrame({'a': [1, 2, 5], 'b': [3, 7, 1]}), npartitions=1)

Now I want to filter the first one using the values of the column in the second one:

dd_1[dd_1.a.isin(dd_2.a)]

When I try to do this the following error is thrown:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-38-850f035e0842> in <module>
----> 1 dd_1[dd_1.a.isin(dd_2.a)]

/usr/local/lib/python3.7/site-packages/dask/dataframe/core.py in isin(self, values)
   2113     @derived_from(pd.Series)
   2114     def isin(self, values):
-> 2115         return elemwise(M.isin, self, list(values))
   2116 
   2117     @insert_meta_param_description(pad=12)

/usr/local/lib/python3.7/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   2045             graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self, key])
   2046             return Series(graph, name, self._meta, self.divisions)
-> 2047         raise NotImplementedError()
   2048 
   2049     @derived_from(pd.DataFrame)

NotImplementedError:

Any suggestion?

Can you state more clearly what you mean by "filter the first one using the values of the column in the second one"? — Jeremy McGibbon, Mar 19 '19 at 20:19
Exactly what you see in the example I provided. I want to keep the rows of dd_1 whose values of the 'a' column are in the values of the 'a' column in dd_2. :) — amarchin, Mar 19 '19 at 21:37

score 2 · Answer 1 · edited Jan 05 '20 at 13:46

2

With the latest version of dask (2.9.1) my personal workaround was to convert the second series (dd_2.a in your case) to pandas.

edited Jan 05 '20 at 13:46

Hamed Baziyad

1,954
5
27
40

answered Jan 05 '20 at 12:13

BinderNet

47
7

NotImplementedError is thrown when I use isin with Dask data frames

1 Answers1