Finding best matching string from a list of strings

Question

Been trying to find the closest match to a string from a list of strings.

I've used "difflib" module : https://docs.python.org/3/library/difflib.html

but the results not always as expected.

Example:

import difflib

words_list = ['sprite','coke','lemon sparkling water']

difflib.get_close_matches('watter',words_list)

result:

[]

and I want the result to be:

['lemon sparkling water']

if the list would be:

words_list = ['sprite','coke','lemon sparkling water','water']

the query would have worked

How can I make it work without "water" being the first word in the string?

thanks

You might have to define what you mean by "closest match". `difflib` probably doesn't consider that to be a "close match" because most of the string doesn't match. For your particular example you could do `[w for w in words_list if 'water' in w]`, but that wouldn't work if the exact word "water" wasn't in the string. — Samwise, Jun 15 '22 at 18:37
Maybe you could try something like `[s for s in words_list if any(difflib.get_close_matches('water', word) for word in s.split())]`? — Samwise, Jun 15 '22 at 18:40
Maybe a better approach is to create bigrams? Or something similar with the cosine similarity? You need vector embeddings of your strings in these cases, though. — Robert, Jun 15 '22 at 18:49
You could instead make your own function for closest match by the number of characters that match divided by total characters present. — Raunak Jain, Jun 15 '22 at 18:56
Why does `words_list` contain sentences consisting of multiple words? — ddejohn, Jun 15 '22 at 19:59
Just bad nameing. Call it products list and i'm searching for the most similar product — tumir, Jun 15 '22 at 20:14

score 3 · Answer 1 · answered Jun 15 '22 at 18:57

3

Per the docs, you can set the cutoff value to lower the standards for comparison:

import difflib

words_list = ['sprite','coke','lemon sparkling water']
print(difflib.get_close_matches('watter',words_list,cutoff=.35))

Output:

['lemon sparkling water']

answered Jun 15 '22 at 18:57

Mark Tolonen

166,664
26
169
251

Extra arguments `, 1, 0` seem closer to "closest match" to me. – Kelly Bundy Jun 15 '22 at 18:59
I think the cutoff is a bit problematic. it solves this exact problem but if I query 'watef' the result (with cutoff 0.35) will be ['sprite'] – tumir Jun 15 '22 at 19:01
@KellyBundy Depends if you want `goobledegook` to return `coke`. – Mark Tolonen Jun 15 '22 at 19:03
@tumir It's an algorithm. The `difflib` algorithm scores `sprite` higher. Use `n=len(words_list), cutoff=0` to get a ranked list of all the words according to the algorithm. – Mark Tolonen Jun 15 '22 at 19:05
Well, if `coke` is closest, then it's closest :-) – Kelly Bundy Jun 15 '22 at 19:08
I understand it's an algorithm and those are the results, been wondering if someone could suggest different methods like Levenshtein distance or fuzzy matching? – tumir Jun 15 '22 at 19:21
1

@tumir That wasn't your question and library recommendations are off-topic for SO. – Mark Tolonen Jun 15 '22 at 19:49
Wasn't looking for library recommendations looking for algorithms and methods to address the problem. – tumir Jun 15 '22 at 20:11

score 0 · Answer 2 · answered Jun 15 '22 at 19:52

Use difflib.SequenceMatcher.ratio as the key parameter for the max function. To facilitate this make a subclass of difflib.SequenceMatcher with a __call__() attribute.

import difflib

class SM(difflib.SequenceMatcher):
    def __init__(self,a):
        super().__init__(a=a)
    def __call__(self,b):
        self.set_seq2(b)
        return self.ratio()

Subclass instances are made with the known string. A second string, to be matched, must be passed when calling an instance. Because the instance is callable it can be used as max's key argument.

words_list = ['sprite','coke','lemon sparkling water']

water = SM('water')
best = max(words_list, key=water)

Caveat - you have to accept the result of the difflib measure of the sequences’ similarity.

Finding best matching string from a list of strings

2 Answers2