2

I have a CSV file with ~20000 words and I'd like to group the words by similarity. To complete such task, I am using the fantastic fuzzywuzzy package, which seems to work really well and achieves exactly what I am looking for with a small dataset (~100 words)

The words are actually brand names, this is a sample output from the small dataset that I just mentioned, where I get the similar brands grouped by name:

[
    ('asos-design', 'asos'), 
    ('m-and-s', 'm-and-s-collection'), 
    ('polo-ralph-lauren', 'ralph-lauren'), 
    ('hugo-boss', 'boss'), 
    ('yves-saint-laurent', 'saint-laurent')
]

Now, my problem with this, is that if I run my current code for the full dataset, it is really slow, and I don't really know how to improve the performance, or how to do it without using 2 for loops.

This is my code.

import csv
from fuzzywuzzy import fuzz

THRESHOLD = 90

possible_matches = []


with open('words.csv', encoding='utf-8') as csvfile:
    words = []
    reader = csv.reader(csvfile)
    for row in reader:
        word, x, y, *rest = row
        words.append(word)

    for i in range(len(words)-1):
        for j in range(i+1, len(words)): 
            if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD:
                possible_matches.append((words[i], words[j]))

        print(i)
    print(possible_matches)

How can I improve the performance?

Dalvtor
  • 3,160
  • 3
  • 21
  • 36
  • 1
    For starter, eliminating the needs to `append()` would boost performance signficantly. For instance your first loop you could have easily done `words = [row[0] for row in csv.reader(csvfile)]` instead, it would be much quicker. – r.ook Feb 14 '19 at 15:44
  • 1
    Do you need pairs of words in exactly that order in `possible_matches` or order of pairs inside `possible matches` is irrelevant for you? For example would you accept `[('a','b'),('c','d')]` in place of `[('c','d'),('a','b')]`? – Daweo Feb 14 '19 at 16:04
  • The order is irrelevant, I just want the pairs – Dalvtor Feb 14 '19 at 16:09
  • 1
    Can you show some examples for words that should be considered equal, maybe with your "small" dataset? Maybe you could [stem](https://en.wikipedia.org/wiki/Stemming) the words instead and create a dictionary `{stem: [list, of, words, with, that, stem], ...}`? That would be O(n) instead of O(n²). – tobias_k Feb 14 '19 at 16:37
  • Sure, I will update my question – Dalvtor Feb 14 '19 at 17:12

2 Answers2

3

For 20,000 words, or brands, any approach that compares each word to each other word, i.e. has quadratic complexity O(n²), may be too slow. For 20,000 it may still be barely acceptable, but for any larger data set it will quickly break down.

Instead, you could try to extract some "feature" from your words and group them accordingly. My first idea was to use a stemmer, but since your words are names rather than real words, this will not work. I don't know how representative your sample data is, but you could try to group the words according to their components separated by -, then get the unique non-trivial groups, and you are done.

words = ['asos-design', 'asos', 'm-and-s', 'm-and-s-collection', 
         'polo-ralph-lauren', 'ralph-lauren', 'hugo-boss', 'boss',
         'yves-saint-laurent', 'saint-laurent']

from collections import defaultdict
parts = defaultdict(list)
for word in words:
    for part in word.split("-"):
        parts[part].append(word)

result = set(tuple(group) for group in parts.values() if len(group) > 1)

Result:

{('asos-design', 'asos'),
 ('hugo-boss', 'boss'),
 ('m-and-s', 'm-and-s-collection'),
 ('polo-ralph-lauren', 'ralph-lauren'),
 ('yves-saint-laurent', 'saint-laurent')}

You might also want to filter out some stop words first, like and, or keep those together with the words around them. This will probably still yield some false-positives, e.g. with words like polo or collection that may appear with several different brands, but I assume that the same is true for using fuzzywuzzy or similar. A bit of post-processing and manual filtering of the groups may be in order.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
2

Try using list comprehensions instead, it is faster than list.append() method:

with open('words.csv', encoding='utf-8') as csvfile:
    words = [row[0] for row in csv.reader(csvfile)]

    possible_matches = [(words[i], words[j]) for i in range(len(words)-1) for j in range(i+1, len(words)) if fuzz.token_set_ratio(words[i], words[j]) >= THRESHOLD]

    print(possible_matches)

Unfortunately with this way you can't do a print(i) in each iteration, but assuming you only needed the print(i) for debugging it wouldn't affect your final result.

Converting a loop into a list comprehension is extremely easy, consider you have a loop like this:

for i in iterable_1:
    lst.append(something)

The list comprehension becomes:

lst = [something for i in iterable_1]

For nested loops and conditions, just follow the same logic:

iterable_1:
    iterable_2:
        ...
        some_condition:
            lst.append(something)

# becomes

lst = [something <iterable_1> <iterable_2> ... <some_condition>]

# Or if you have an else clause:

iterable_1:
    ...
    if some_condition:
        lst.append(something)
    else:
        lst.append(something_else)

lst = [something if some_condition else something_else <iterable_1> <iterable_2> ...]
r.ook
  • 13,466
  • 2
  • 22
  • 39
  • Will this really make a significant difference, given that OP's dataset has 20,000 words and this is still O(n²)? – tobias_k Feb 14 '19 at 16:36
  • note, list-comprehensions while making code succinct are a pain to debug and don't even get me started on maintainability... – Srikar Appalaraju Feb 14 '19 at 16:42
  • @tobias_k It is at least *some* improvement to remove the slow process of `append`, but without actual data I can't tell what the actual performance would be like. I profess I'm not knowledgeable enough to solve *O(n²)* as it seems OP need to compare each word against its following words, unless the requirement is actually different. – r.ook Feb 14 '19 at 17:03
  • @SrikarAppalaraju agreed, but in cases where performance is important like OPs, it might be a worthwhile trade off. Otherwise we'd need to understand fully what OP is looking for, perhaps there is an `itertools` function that can help instead. – r.ook Feb 14 '19 at 17:05
  • I just tried this: Using a very simple list of plain numbers, without any computation, using a list comprehension is about 40% faster (for very long lists, less so for shorter lists). However, I still very much doubt that the overhead of calling `append` has any measurable effect when compared to the actual fuzzy matching, which is entirely unaffected from the list comprehension. – tobias_k Feb 15 '19 at 14:19
  • I'm not familiar with the `fuzzywuzzy` module at all, so I didn't address that part and just chimed in as I saw the `append` was an immediate improvement (norminal perhaps, but with that much data it *should* make a difference) and at the time it wasn't exactly clear what type of matches were going on. For what it's worth, I did give a +1 to your answer as I think it addressed OP's question better. – r.ook Feb 15 '19 at 14:32
  • Sorry if I am so insistent. Normally I am a bit fan of list comprehensions. But again, the `append` part is just a tiny fraction of OP's code. We don't know how many of those 20,000^2 pairs are actually matches, but let's be generous and say there are 10^7 matches. Creating such a list with append takes 635ms on my system, using a list comprehension takes 370ms. That means that, all else being equal, using the list comprehension will improve OP's "really slow" code by ~260ms, which is not much. – tobias_k Feb 15 '19 at 15:42
  • I do appreciate a healthy discussion, it promotes better thinking, but I'm not sure what more I can contribute here. I'm not insisting that my answer solves OP's problem, just that I believe it *will* help the speed even if it might end up nominal. I'm a hobbyist so algorithms like these are a bit outside of my comfort zone, just trying to help where I can. If you felt my answer truly doesn't contribute to OP's question at all (or actually making the issue worse), then I'd happily delete the answer. As far as addressing the fuzzy matches, I do believe you have a better approach. – r.ook Feb 15 '19 at 15:56