0

I have two lists. The first list contains values from the database; the second list contains a set of words from a file that came from the user.

list_with_name_in_BD = [
    'LOPEZ TRANSPORT', 
    'LOPEZ TRANSPORTATION', 
    'Lopez Transport', 
    'napora inc', 
    'Narain Transport Ltd'
]

The total number of items in this list is about 300.000

list_with_words_in_file = [
    'Description', 'llc', 'Load', 'Rate', 'Amount', '$1950.00', 
    'Please', 'BBE', 'note', 'Lopez transport', 'LLC'
]

The total number of elements in this list is about 1000

My task is to find the most similar (or completely identical) elements between two lists. That is, as you can see, there are no identical elements between the two lists (although they may be), but there are similar ones. So, at the output, I would like to get a list with the most similar (or identical elements) between the two lists. That is, the result should be as follows.

output = [
    'Lopez Transport', 
    'LOPEZ TRANSPORT', 
    'LOPEZ TRANSPORTATION'
]

I will be grateful for any help and advice you can give me.

Paul
  • 53
  • 3
  • 21
  • This is not an easy task. Your example there can be solved by making everything lower case and looking for exact matches. That's not hard, although you'll still make 300,000,000 comparisons. But if there are spelling issues, you'll need the Levenshtein distance. There are Python packages for that. – Tim Roberts Aug 21 '23 at 20:47
  • @TimRoberts Yes, the task is difficult, which is why I turned to the community. I have already spent a lot of time trying to solve this problem, but so far to no avail. Yes, there are spelling problems, but there are times when a clear match can be found in the file from the user (but this is rare) – Paul Aug 21 '23 at 20:51
  • @TimRoberts But even if I use Levenshtein distance, I still have to go through two lists and these are still the same 300.000.000 comparisons, right? – Paul Aug 21 '23 at 21:00
  • Exactly. There's no way to avoid that. Thinking a bit outside the box, if the phrases are only one or two words, you could use a Soundex algorithm to convert the words to soundex tokens. You could then look those up directly in a dictionary. Soundex converts words to tokens so that common misspellings of a word result in the same token. – Tim Roberts Aug 21 '23 at 21:27
  • How similar are the most similar ones usually? – Kelly Bundy Aug 21 '23 at 21:29
  • @KellyBundy For example_1: max transport llc and mx transport llc and Max transport llc; For example_2: MYC EXPRESS CORP and MYC EXPRESS and myc express – Paul Aug 21 '23 at 21:54

2 Answers2

3

rapidfuzz might be useful here. extract in particular can take a preprocessing function (in your case str.lower), a cutoff score and a scorer as arguments.

See if the following brings some improvement to nested loops:

from rapidfuzz import process

list_with_name_in_DB = ["LOPEZ TRANSPORT", "LOPEZ TRANSPORTATION", "Lopez Transport", "napora inc", "Narain Transport Ltd"]
list_with_words_in_file = ['Description', 'Load', 'Rate', 'Amount', '$1950.00', 'Please', 'note', 'Lopez transport']

res = []
for e in list_with_words_in_file:
    r = process.extract(e, list_with_name_in_DB, processor=str.lower, score_cutoff=80)
    res += r

print(res)

Output:

[('LOPEZ TRANSPORT', 100.0, 0), ('Lopez Transport', 100.0, 2), ('LOPEZ TRANSPORTATION', 85.71428571428572, 1)]
Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • This is the right answer, I'll be curious to see what the performance is on his larger list of names. – Tim Roberts Aug 21 '23 at 21:27
  • @Tranbi Thank you for paying attention to my question. Give me time, I'll check it out and get back to you with feedback. – Paul Aug 21 '23 at 22:01
  • @Tranbi I am very grateful to you. I looked at the problem from a slightly different angle. But unfortunately, the code execution time is 55 seconds, which is, as you understand, a very long time. – Paul Aug 21 '23 at 22:18
  • @Tranbi i will try to increase score_cutoff and check the results – Paul Aug 21 '23 at 22:19
  • @Paul I can't tell if 55s is long, have you compared it to other methods? Why is it so critical that your program runs faster? How/When do you run it? Give us some context. If some entries are repeating, you might consider caching... – Tranbi Aug 22 '23 at 02:24
  • @Tranbi No, I did not compare with other solutions, since all my solutions were based on iterating over two lists, which is irrational and very long in this case. So far I've increased the score_cutoff to 95 and the execution time is about 10s, but it's still long. I start the program's countdown time at the moment the data is collected into the list from the database, but most of the program is occupied by the loop (in the code you presented). Yes, in the list_with_words_in_file list, some entries may be repeated, but I remove duplicates by turning it into a set – Paul Aug 22 '23 at 08:41
  • Try `processor=rapidfuzz.utils.default_process, scorer=rapidfuzz.fuzz.ratio`. Also, you can make this use multiprocessing. – ken Aug 22 '23 at 10:49
  • @ken Yes thank you. scorer=rapidfuzz.fuzz.ratio slightly improved the running time of the algorithm – Paul Aug 24 '23 at 10:37
0

Here's a way to do the search with soundex. Note that I have included a simple soundex conversion here from Is there a soundex function for python?, but there are several Python modules that can do it (soundex, fuzzy, jellyfish).

This has only one loop; the other loop becomes the "in" operator searching for a match in the set.

list_with_name_in_BD = [
    'LOPEZ TRANSPORT', 
    'LOPEZ TRANSPORTATION', 
    'Lopez Transport', 
    'napora inc', 
    'Narain Transport Ltd'
]

list_with_words_in_file = [
    'Description', 'Load', 'Rate', 'Amount', '$1950.00', 
    'Please', 'note', 'Lopez transport'
]

import unicodedata
def soundex(s):

    if not s:
        return ""

    s = unicodedata.normalize("NFKD", s)
    s = s.upper()

    replacements = (
        ("BFPV", "1"),
        ("CGJKQSXZ", "2"),
        ("DT", "3"),
        ("L", "4"),
        ("MN", "5"),
        ("R", "6"),
    )
    result = [s[0]]
    count = 1

    # find would-be replacment for first character
    for lset, sub in replacements:
        if s[0] in lset:
            last = sub
            break
    else:
        last = None

    for letter in s[1:]:
        for lset, sub in replacements:
            if letter in lset:
                if sub != last:
                    result.append(sub)
                    count += 1
                last = sub
                break
        else:
            if letter != "H" and letter != "W":
                # leave last alone if middle letter is H or W
                last = None
        if count == 4:
            break

    result += "0" * (4 - count)
    return "".join(result)

def soundexphrase(s):
    words = s.split()
    return ''.join(soundex(w) for w in words)

# Encode the names to search for:


keywords = set(map(soundexphrase,list_with_name_in_BD))
print(keywords)

for possible in list_with_words_in_file:
    s = soundexphrase(possible)
    print(f"{possible} encoded as {s} found: {s in keywords}")

Output:

{'N650T652L300', 'L120T652', 'N160I520'}
Description encoded as D261 found: False
Load encoded as L300 found: False
Rate encoded as R300 found: False
Amount encoded as A553 found: False
$1950.00 encoded as $000 found: False
Please encoded as P420 found: False
note encoded as N300 found: False
Lopez transport encoded as L120T652 found: True
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Thanks for the provided answer. Yes, this algorithm is very fast, but very inaccurate, some words that are in list_with_words_in_file this algorithm does not find in list_with_name_in_BD. And also the words in the output list are in a chaotic order, and I would like them to be in order from the maximum match – Paul Aug 24 '23 at 21:33
  • The words in the output list are not chaotic, they are in exactly the same order as the input list. This algorithm does not do "match quality". It either matches or it doesn't. If it doesn't fit your need, then it doesn't fit your need. It was just a speedy alternative. – Tim Roberts Aug 24 '23 at 21:42
  • In any case, I thank you, so found out an alternative approach – Paul Aug 25 '23 at 07:05