1

I am in the process of deduplicating(fuzzy matching with string similarity algorithms) a database table with 25 million plus rows. Pandas dedupe has been working PERFECT for this on smaller sets of data even up to 5 million rows. After 5 million the process takes too long and crashes even after letting it run for 8+ hours. Any ideas on how to optimize this? Ideally, I would have access to a pyspark environment but that is not the case, unfortunately.

import pandas as pd
import pandas_dedupe

df = pd.read_sql("select row1, row2, row3")

df_final = pandas_dedupe.dedupe_dataframe(df, ['row1','row2','row3'])

df_final.to_csv('deduplicationOUT.csv')

codingInMyBasement
  • 728
  • 1
  • 6
  • 20
  • It might be possible that you are running out of memory with a dataset this large. May I ask what is the size of you dataset and RAM? – Nik Jul 23 '21 at 21:36
  • Maybe worth a read [Is there any faster alternative to col.drop_duplicates()?](https://stackoverflow.com/questions/54196959/is-there-any-faster-alternative-to-col-drop-duplicates) – MDR Jul 23 '21 at 21:44

0 Answers0