I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset (~100 words)
The words are actually brand names, this is a sample output from the small dataset that I just mentioned, where I get the similar brands grouped by name:
[
('asos-design', 'asos'),
('m-and-s', 'm-and-s-collection'),
('polo-ralph-lauren', 'ralph-lauren'),
('hugo-boss', 'boss'),
('yves-saint-laurent', 'saint-laurent')
]
Now, my problem with this, is that if I run my current code for the full dataset, it is really slow, and I don't really know how to improve the performance, or how to do it without using 2 for loops.
This is my code.
import csv
from fuzzywuzzy import fuzz
THRESHOLD = 90
possible_matches = []
with open('words.csv', encoding='utf-8') as csvfile:
words = []
reader = csv.reader(csvfile)
for row in reader:
word, x, y, *rest = row
words.append(word)
for i in range(len(words)-1):
for j in range(i+1, len(words)):
if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
possible_matches.append((words[i], words[j]))
print(i)
print(possible_matches)
How can I improve the performance?