Dask - Find duplicate values

Question

I need to find duplicates in a column in a dask DataFrame.

For pandas there is duplicated() method for this. Though in dask it is not supported.

Q: What is the best way of getting all duplicated values in dask?

My Idea: Make a column I'm checking as index, then drop_duplicates and then join.

Is there any better solution?

For example:

df = pandas.DataFrame(
    [
        ['a'],
        ['b'],
        ['c'],
        ['a']
    ],
    columns=['col']
)
df_test = dask.dataframe.from_pandas(df, npartitions=2)
# Expected to get dataframe with value 'a', as it appears twice

score 2 · Answer 1 · answered Oct 08 '20 at 13:05

I've came up with following solution:

import dask.dataframe as dd
import pandas

if __name__ == '__main__':
    df = pandas.DataFrame(
        [
            ['a'],
            ['b'],
            ['c'],
            ['a']
        ],
        columns=["col-a"]
    )
    ddf = dd.from_pandas(df, npartitions=2)

    # Apparently the code below will fail if the dask DataFrame is empty
    if ddf.index.size.compute() != 0:
        # With indexing data will be repartitioned - and all duplicated can be found within one partition
        indexed_df = ddf.set_index('col-a', drop=False)
        # Mark duplicate values within partitions. dask DataFrame does not support duplicates().
        dups = indexed_df.map_partitions(lambda d: d.duplicated())
        # Get duplicated by indexes calculated in previous step.
        duplicates = indexed_df[dups].compute().index.tolist()
        print(duplicates) # Prints: ['a']

Can this be further improved?

score 0 · Answer 2 · answered Mar 11 '21 at 03:57

import dask.dataframe as dd
import pandas

if __name__ == '__main__':
    df = pandas.DataFrame(
        [
            ['a'],
            ['b'],
            ['b'],
            ['c'],
            ['a'],
            ['a']
        ],
        columns=["col-a"]
    )
    ddf = dd.from_pandas(df, npartitions=2)
    ddf_filter = ddf['col-a'].value_counts().map(lambda x: x > 1)
    ddf = ddf[ddf["col-a"].isin(list(ddf_filter[ddf_filter].index))]
    print(ddf.compute())

Dask - Find duplicate values

2 Answers2