I have this code that functions properly and produces the result I am looking for:
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)
print('Done did')
However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.
import glob
from thefuzz import fuzz
import modin.pandas as pd
files = glob.glob('/folder/folder/2013/*.csv')
for file in files:
df = pd.read_csv(file, dtype=str, lineterminator='\n')
f_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
print(f'{file} has been done')
It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?
The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:
ID | Text |
---|---|
11213 | I am going to the cinema |
23213 | Black is my favourite colour |
35455 | I am going to the cinema with you |
421323 | My friends think I am a good guy. |