I need to find duplicates in a column in a dask
DataFrame.
For pandas
there is duplicated()
method for this. Though in dask
it is not supported.
Q: What is the best way of getting all duplicated values in dask?
My Idea:
Make a column I'm checking as index, then drop_duplicates
and then join
.
Is there any better solution?
For example:
df = pandas.DataFrame(
[
['a'],
['b'],
['c'],
['a']
],
columns=['col']
)
df_test = dask.dataframe.from_pandas(df, npartitions=2)
# Expected to get dataframe with value 'a', as it appears twice