4

Let's say I have two dask data frames:

import dask.dataframe as dd 
import pandas as pd

dd_1 = dd.from_pandas(pd.DataFrame({'a': [1, 2,3], 'b': [6, 7, 8]}), npartitions=1)

dd_2 = dd.from_pandas(pd.DataFrame({'a': [1, 2, 5], 'b': [3, 7, 1]}), npartitions=1)

Now I want to filter the first one using the values of the column in the second one:

dd_1[dd_1.a.isin(dd_2.a)]

When I try to do this the following error is thrown:

NotImplementedError                       Traceback (most recent call last)
<ipython-input-38-850f035e0842> in <module>
----> 1 dd_1[dd_1.a.isin(dd_2.a)]

/usr/local/lib/python3.7/site-packages/dask/dataframe/core.py in isin(self, values)
   2113     @derived_from(pd.Series)
   2114     def isin(self, values):
-> 2115         return elemwise(M.isin, self, list(values))
   2116 
   2117     @insert_meta_param_description(pad=12)

/usr/local/lib/python3.7/site-packages/dask/dataframe/core.py in __getitem__(self, key)
   2045             graph = HighLevelGraph.from_collections(name, dsk, dependencies=[self, key])
   2046             return Series(graph, name, self._meta, self.divisions)
-> 2047         raise NotImplementedError()
   2048 
   2049     @derived_from(pd.DataFrame)

NotImplementedError: 

Any suggestion?

amarchin
  • 2,044
  • 1
  • 16
  • 32
  • Can you state more clearly what you mean by "filter the first one using the values of the column in the second one"? – Jeremy McGibbon Mar 19 '19 at 20:19
  • Exactly what you see in the example I provided. I want to keep the rows of dd_1 whose values of the 'a' column are in the values of the 'a' column in dd_2. :) – amarchin Mar 19 '19 at 21:37

1 Answers1

2

With the latest version of dask (2.9.1) my personal workaround was to convert the second series (dd_2.a in your case) to pandas.

Hamed Baziyad
  • 1,954
  • 5
  • 27
  • 40
BinderNet
  • 47
  • 7