I have trying to find the most efficient way of finding the similar match to a particular string.
Here's a code I found online which is working for me but the data I am working on is huge plus the strings can have max 200 characters. Plus, the strings have some special characters(for instance, Chinese characters), float values, and other characters like '/',':' etc. As a result, the code is taking long to process the data and I was wondering if there was a better way to handle it.
from fuzzywuzzy import process
df = pd.read_excel('aaa.xlsx', sheet_name='test')
df = df.applymap(str) #converting the entire worksheet to string
similarity = df.assign(Output=[process.extract(i, df['ColumnA'], limit=1) for i in df['ColumnB']])
similarity.to_excel('bbb.xlsx')
Context: I have around 900 rows of strings (ColumnA) and I am comparing it one-by-one with 4000 rows of strings (ColumnB). Would be great if some could please help. Thanks!