I want to check for fuzzy duplicates in a column of the dataframe using fuzzywuzzy. In this case, I have to iterate over the rows one by one using two nested for loops.
for i in df['col']:
for j in df['col']:
ratio = fuzz.ratio(i, j)
if ratio > 90:
print("row duplicates")
Except that my dataframe contains 600 000 rows
, and this code has a complexity of 0(n²)
.
Is there a lighter way of doing this?