Optimize the traversal of a column of a dataframe

Question

I want to check for fuzzy duplicates in a column of the dataframe using fuzzywuzzy. In this case, I have to iterate over the rows one by one using two nested for loops.

for i in df['col']:
    for j in df['col']:
       ratio = fuzz.ratio(i, j)
       if ratio > 90:
           print("row duplicates")

Except that my dataframe contains 600 000 rows, and this code has a complexity of 0(n²). Is there a lighter way of doing this?

score 1 · Accepted Answer · answered Feb 15 '22 at 08:30

For your use case I would recommend the usage of RapidFuzz (I am the author). In particular the function process.cdist should allow you to implement this very efficiently:

import numpy as np
from rapidfuzz import fuzz, process

process.cdist(df['col'], df['col'],
    scorer=fuzz.ratio, dtype=np.uint8, score_cutoff=90, workers=-1)

Note that this creates a matrix of size len(df['col']) * len(df['col']), which would be way to large when working with 600.000 elements (around 335 GB). To reduce the memory usage you can compare the strings in multiple smaller steps:

process.cdist(df['col'][0:10000], df['col'][0:10000],
    scorer=fuzz.ratio, dtype=np.uint8, score_cutoff=90, workers=-1)

Note that since your comparing the row to itself you can skip some sections of this matrix, since fuzz.ratio(df['col'][1], df['col'][0]) == fuzz.ratio(df['col'][0], df['col'][1]).

Optimize the traversal of a column of a dataframe

1 Answers1