Get duplicated rows in larger-than-memory dataset with pandas

Question

The pandas.dataframe.duplicated is great for finding duplicate rows across specified columns within a dataframe.

However, my dataset is larger than what fits in memory (and even larger than what I could fit in after extending it within reasonable budget limits).

This is fine for most of the analyses that I have to execute since I can loop over my dataset (csv and dbf files), loading each file into memory on its own and do everything in sequence. However, regarding duplicate analysis, this is apparently not suitable for finding duplicates across the whole dataset but only within single files.

Is there any algorithm or approach for finding duplicates across multiple dataframes while not having to load them all into memory at the same time?

How about hashing the values of rows and looking for duplicate hash values? — AndreyF, Feb 16 '17 at 15:22
I don't understand how "GETTING" duplicate rows, as this question asks, is the same as "DROPPING" duplicate rows, as the question marked as duplicate of this asks — robertspierre, Apr 19 '19 at 08:56

score 2 · Answer 1 · answered Feb 16 '17 at 15:37

You can hash the values of the "key" columns and maintain a set of hash codes you already encountered:

import hashlib

hash_set = set()  # this will contain all the hash codes of rows seen

def is_duplicate(row):
    m = hashlib.md5()
    for c in ["column1", "column2", "column3"]:
        m.update(row[c])
    hash_code = m.digest()
    if hash_code in hash_set:
        return 1
    hash_set.add(hash_code)
    return 0

for df_path in [df1_path, df2_path, df3_path]:  # iterate dataframes 1 by 1
    df = pd.read_csv(df_path)  # load the dataframe
    df["duplicate"] = df.apply(is_duplicate, axis=1)
    unique_df = df[df["duplicate"]==0]  # a "globaly" unique dataframe
    unique_df.pop("duplicate")  # you don't need this column anymore
    # YOUR CODE...

JeyJey · Answer 2 · 2017-02-16T15:24:28.270

1

i would suggest two things. First is to load the data frames into an rdbms if possible. Then you can find duplicates by grouping key columns.

Second is, extract only the key columns from the big files and compare these with each other.

Try to sort the rows over the key columns in the files, so you can detect a duplicate by only compare one row with the next.

Hope that helps.

edited Feb 16 '17 at 15:24

answered Feb 16 '17 at 15:17

JeyJey

41
4

sort is quasi a "find-duplicates" problem, is a good suggestion only if your data is previously sorted – Carlos Nov 09 '20 at 16:47

Get duplicated rows in larger-than-memory dataset with pandas

2 Answers2