0

I have trying to find the most efficient way of finding the similar match to a particular string.

Here's a code I found online which is working for me but the data I am working on is huge plus the strings can have max 200 characters. Plus, the strings have some special characters(for instance, Chinese characters), float values, and other characters like '/',':' etc. As a result, the code is taking long to process the data and I was wondering if there was a better way to handle it.

from fuzzywuzzy import process

df = pd.read_excel('aaa.xlsx', sheet_name='test') 
df = df.applymap(str) #converting the entire worksheet to string 
similarity = df.assign(Output=[process.extract(i, df['ColumnA'], limit=1) for i in df['ColumnB']])
similarity.to_excel('bbb.xlsx')

Context: I have around 900 rows of strings (ColumnA) and I am comparing it one-by-one with 4000 rows of strings (ColumnB). Would be great if some could please help. Thanks!

silentwraith
  • 87
  • 1
  • 7
  • There's a comparator called TF-IDF. Sci-kit implements a version of this at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . You could use TF-IDF to find the most similar strings between columns A and B. – rajah9 Jul 04 '20 at 13:53
  • One other thing you can try out when you would like to keep the way fuzzywuzzy matches strings is my library RapidFuzz: https://github.com/maxbachmann/rapidfuzz. It matches strings similar to FuzzyWuzzy, but is quite a bit faster – maxbachmann Jul 06 '20 at 06:25

0 Answers0