Better fuzzy matching performance?

Question

I'm currently using method get_close_matches method from difflib to iterate through a list of 15,000 strings to get the closest match against another list of approx 15,000 strings:

a=['blah','pie','apple'...]
b=['jimbo','zomg','pie'...]

for value in a:
    difflib.get_close_matches(value,b,n=1,cutoff=.85)

It takes .58 seconds per value which means it will take 8,714 seconds or 145 minutes to finish the loop. Is there another library/method that might be faster or a way to improve the speed for this method? I've already tried converting both arrays to lower case, but it only resulted in a slight speed increase.

You can try to remove element from list b after match – user1209304 Jan 28 '14 at 16:00 — user1209304, Jan 28 '14 at 16:00

score 9 · Answer 1 · answered Jul 11 '16 at 22:46

9

fuzzyset indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib. For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.

answered Jul 11 '16 at 22:46

hobs

18,473
10
83
106

1

hi @hobs thanks for pointing this out! `fuzzyset` is a great package but the documentation is very thin. How do you know that the performance is in `0(log(N))? Can you point me to some papers related to the algo? – ℕʘʘḆḽḘ Jul 27 '16 at 11:23
@ℕʘʘḆḽḘ the docs page on pypi is pretty awesome now. They even show how they break the string up to create the trigram reverse index. Lookups on a properly implemented reverse index are never slower than `O(log(N))` -- but N is the # of trigrams, not strings, in this case. – hobs Jan 16 '19 at 17:19

score 4 · Answer 2 · answered Apr 03 '21 at 15:49

4

RapidFuzz

is the super-fast lib for fuzzy string matching. It has the same API as famous fuzzywuzzy, but times faster and MIT licensed.

answered Apr 03 '21 at 15:49

Alexey Trofimov

4,287
1
18
27

Hi, any way to return the position of the match? – hafiz031 Nov 20 '21 at 05:04

score 3 · Answer 3 · answered Jan 28 '14 at 15:59

Perhaps you can build an index of the trigrams (three consecutive letters) that appear in each list. Only check strings in a against strings in b that share a trigram.

You might want to look at the BLAST bioinformatics tool; it does approximate sequence alignments against a sequence database.

score 3 · Answer 4 · answered Nov 30 '22 at 09:36

Benchmarks in 2022

tl;dr: RapidFuzz was fastest.

Test: Pick the best string match from 1.000.000 elements. Tested on my old i7 notebook with 32gb RAM.

Best to worst:

RapidFuzz (drop-in replacement for TheFuzz): ~20ms
fuzzyset2: ~320ms
TheFuzz (ex fuzzywuzzy): ~7s

score 1 · Answer 5 · answered Jan 28 '14 at 15:13

Try this

https://code.google.com/p/pylevenshtein/

The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.

score 0 · Answer 6 · answered Apr 24 '19 at 17:33

0

I had tried few methods for fuzzy match. the best one was cosine similarity, with threshold as per your need (i kept 80% fuzzy match).

answered Apr 24 '19 at 17:33

Shalini Baranwal

2,780
4
24
34

Better fuzzy matching performance?

6 Answers6

Benchmarks in 2022

Linked