Ways to improve the performance
Fuzzy Matching using the Levenshtein Distance will never be super fast, but there are a couple of things in your code you can optimise:
When passing a string and a list to process.extractOne it will preprocess these strings by lowercasing them, removing non alphanumeric characters and trimming whitespaces. Since your reusing the same English:Spanish mapping each time you should do this preprocessing once ahead of time.
Even when using python-Levenshtein FuzzyWuzzy is not really optimised in a lot of places. You should replace it with RapidFuzz which implements the same algorithms with a similar interface, but is mostly implemented in C++ and comes with some additional algorithmic improvements making it a lot faster.
internally process.extractOne
is using fuzz.WRatio
to compare the strings by default. This is a combination of multiple string matching algorithms. So selecting a faster algorithm by passing e.g. scorer=fuzz.ratio
to process.extractOne improves the performance. However keep in mind that this changes the way your strings are compared, so depending on your data you might not want to do this.
Implementation making use of 1 and 2
from rapidfuzz import process, utils
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
utils.default_process(query),
sentencePairs.keys(),
processor=None)
print(match, sentencePairs[match], sep='\n')
Implementation making use of 1, 2 and 3
from rapidfuzz import process, utils, fuzz
# english sentences are already lower cased
# and without special characters like question marks
sentencePairs = {'how are you':'¿Cómo estás?', 'good morning':'¡Buenos días!'}
query= 'How old are you?'
match, _ = process.extractOne(
utils.default_process(query),
sentencePairs.keys(),
processor=None,
scorer=fuzz.ratio)
print(match, sentencePairs[match], sep='\n')
Benchmarks
To provide some time comparisions I generated a million sentences:
import string
import random
random.seed(18)
sentencePairs = {
''.join(random.choice(string.ascii_lowercase + string.digits)
for _ in range(15)
): "spanish text"
for s in range(1000000)
}
query= 'How old are you?'
The following table shows how long the different solutions require on my computer
| Implementation | Runtime |
|------------------------------------------|----------------|
| Your current implementation | 18.98 seconds |
| Implementation making use of 1 and 2 | 1.4 seconds |
| Implementation making use of 1, 2 and 3 | 0.4 seconds |