-1

Been trying to find the closest match to a string from a list of strings.

I've used "difflib" module : https://docs.python.org/3/library/difflib.html

but the results not always as expected.

Example:

import difflib

words_list = ['sprite','coke','lemon sparkling water']

difflib.get_close_matches('watter',words_list)

result:

[]

and I want the result to be:

['lemon sparkling water']

if the list would be:

words_list = ['sprite','coke','lemon sparkling water','water']

the query would have worked

How can I make it work without "water" being the first word in the string?

thanks

tumir
  • 39
  • 1
  • 7
  • 2
    You might have to define what you mean by "closest match". `difflib` probably doesn't consider that to be a "close match" because most of the string doesn't match. For your particular example you could do `[w for w in words_list if 'water' in w]`, but that wouldn't work if the exact word "water" wasn't in the string. – Samwise Jun 15 '22 at 18:37
  • Maybe you could try something like `[s for s in words_list if any(difflib.get_close_matches('water', word) for word in s.split())]`? – Samwise Jun 15 '22 at 18:40
  • Maybe a better approach is to create bigrams? Or something similar with the cosine similarity? You need vector embeddings of your strings in these cases, though. – Robert Jun 15 '22 at 18:49
  • You could instead make your own function for closest match by the number of characters that match divided by total characters present. – Raunak Jain Jun 15 '22 at 18:56
  • Why does `words_list` contain sentences consisting of multiple words? – ddejohn Jun 15 '22 at 19:59
  • Just bad nameing. Call it products list and i'm searching for the most similar product – tumir Jun 15 '22 at 20:14

2 Answers2

3

Per the docs, you can set the cutoff value to lower the standards for comparison:

import difflib

words_list = ['sprite','coke','lemon sparkling water']
print(difflib.get_close_matches('watter',words_list,cutoff=.35))

Output:

['lemon sparkling water']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Use difflib.SequenceMatcher.ratio as the key parameter for the max function. To facilitate this make a subclass of difflib.SequenceMatcher with a __call__() attribute.

import difflib

class SM(difflib.SequenceMatcher):
    def __init__(self,a):
        super().__init__(a=a)
    def __call__(self,b):
        self.set_seq2(b)
        return self.ratio()

Subclass instances are made with the known string. A second string, to be matched, must be passed when calling an instance. Because the instance is callable it can be used as max's key argument.

words_list = ['sprite','coke','lemon sparkling water']

water = SM('water')
best = max(words_list, key=water)

Caveat - you have to accept the result of the difflib measure of the sequences’ similarity.

wwii
  • 23,232
  • 7
  • 37
  • 77