I am in the process of deduplicating(fuzzy matching with string similarity algorithms) a database table with 25 million plus rows. Pandas dedupe has been working PERFECT for this on smaller sets of data even up to 5 million rows. After 5 million the process takes too long and crashes even after letting it run for 8+ hours. Any ideas on how to optimize this? Ideally, I would have access to a pyspark environment but that is not the case, unfortunately.
import pandas as pd
import pandas_dedupe
df = pd.read_sql("select row1, row2, row3")
df_final = pandas_dedupe.dedupe_dataframe(df, ['row1','row2','row3'])
df_final.to_csv('deduplicationOUT.csv')