Is there a way to boost matching performance when doing string matching in Python?

Question

I have a very large dictionary which stores large numbers of English sentences and their Spanish translations. When given a random English sentence, I intend to use Python's fuzzywuzzy library to find its closest match in the dictionary. My code:

from fuzzywuzzy import process
sentencePairs = {'How are you?':'¿Cómo estás?', 'Good morning!':'¡Buenos días!'}
query= 'How old are you?'
match = process.extractOne(query, sentencePairs.keys())[0]
print(match, sentencePairs[match], sep='\n')

In real life scenario, the sentencePairs dictionary would be very large, with at least one million items stored. So it will take a long time to get the result with fuzzywuzzy, even if python-Levenshtein is installed to provide speedup. So is there a better way to achieve better performance? My goal is to get the result in less than a few seconds, or even in real time.

Note, you really aren't using a `dict` here as it is intended, you might as well use a list of tuples, or two separate lists — juanpa.arrivillaga, Sep 14 '20 at 15:36

maxbachmann · Accepted Answer · 2020-10-02T05:36:45.033

Ways to improve the performance

Fuzzy Matching using the Levenshtein Distance will never be super fast, but there are a couple of things in your code you can optimise:

When passing a string and a list to process.extractOne it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your reusing the same English:Spanish mapping each time you should do this preprocessing once ahead of time.
Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.
internally process.extractOne is using fuzz.WRatio to compare the strings by default. This is a combination of multiple string matching algorithms. So selecting a faster algorithm by passing e.g. scorer=fuzz.ratio to process.extractOne improves the performance. However keep in mind that this changes the way your strings are compared, so depending on your data you might not want to do this.

Implementation making use of 1 and 2

from rapidfuzz import process, utils
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None)
print(match, sentencePairs[match], sep='\n')

Implementation making use of 1, 2 and 3

from rapidfuzz import process, utils, fuzz
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
   utils.default_process(query),
   sentencePairs.keys(),
   processor=None,
   scorer=fuzz.ratio)
print(match, sentencePairs[match], sep='\n')

Benchmarks

To provide some time comparisions I generated a million sentences:

import string
import random
random.seed(18)
sentencePairs = {
    ''.join(random.choice(string.ascii_lowercase + string.digits)
       for _ in range(15)
    ): "spanish text"
    for s in range(1000000)
}
query= 'How old are you?'

The following table shows how long the different solutions require on my computer

| Implementation                           | Runtime        |
|------------------------------------------|----------------|
| Your current implementation              | 18.98 seconds  |
| Implementation making use of 1 and 2     | 1.4 seconds    |
| Implementation making use of 1, 2 and 3  | 0.4 seconds    |

I don't find `fuzz.ratio` or `fuzz.WRatio` in your code. Do you mean using them to loop over all `sentenPairs` keys instead of `process.extractOne()` and it takes 0.4 seconds when using `fuzz.ratio`? — wbzy00, Sep 15 '20 at 02:24
prcess.extractOne supports a scorer parameter to change the string matching algorithm it is using. I updated the answer to make this more clear. — maxbachmann, Sep 15 '20 at 05:47
Regarding the `utils.default_process` and `processor`, can I just use `match, _ = process.extractOne(query, sentencePairs.keys(), scorer=fuzz.ratio)` instead of `match, _ = process.extractOne(utils.default_process(query), sentencePairs.keys(), processor=None, scorer=fuzz.ratio)`? There seems to be no difference between these two lines of code, and the execution times are basically the same based on my testing. I also checked https://github.com/maxbachmann/rapidfuzz/blob/master/docs/usage/process.md, still can't find any difference. — wbzy00, Sep 15 '20 at 12:20
utils.default_process is preprocessing your string (lower cases it, removes non alphanumeric characters like e.g. question marks and trims whitespaces). So when you do not need/want this you can just leave it out (e.g. the String " How old are you?" is transformed to "how old are you") — maxbachmann, Sep 15 '20 at 13:18

score 2 · Answer 2 · answered Sep 14 '20 at 15:17

There could be a better solution but top of my mind I could think of partition.

You can create 26 different dictionaries each representing an English alphabet. And then you can load all these dictionaries with all those keys which starts with corresponding alphabet. E.g. adict, bdict... zdict etc. So. hdict would contain Key value for Key starting with h. Like key= "how are you?"

In this way, you need to query only that dictionary which matches with starting alphabet.

With this solution, you lose the "fuzzy" search on the first character. It's still a valid option, though. — Pac0, Sep 14 '20 at 15:27

Is there a way to boost matching performance when doing string matching in Python?

2 Answers2

Ways to improve the performance

Implementation making use of 1 and 2

Implementation making use of 1, 2 and 3

Benchmarks