Fuzzy comparison of strings in lists of huge length (taking into account performance)

Question

I have two lists:

The first list I get from the database is the names of various companies (can be written in uppercase, lowercase or a combination)

list_from_DB = ["Reebok", "MAZDA", "PATROL", "AsbEngland-bank", "Mazda INCC", "HIGHWAY lcc", "mazda", "015 Amazon", ......]

There are about 400,000 items in this list.

The second list I get the by parsing text from the user (there can be absolutely any words, numbers, signs, etc.)

list_from_user = ['ts', '$243', 'mazda', 'OFFICERS', 'SIGNATURE', 'Date:07/20/2022', 'Wilson', 'Bank', .......]

There are about 1000 items in this list.

What I need to do is find which items from list_from_user are in list_from_DB and display them in the order of the greatest similarity. As you can see below, the items in the two lists may be identical, or they may differ in spelling.

Output

["mazda", "MAZDA", "Mazda INCC", "AsbEngland-bank"]

What I do: yes, I know about fuzzy character matching libraries, I use rapidfuzz.

res = []
for e in list_from_user:
    r = rapidfuzz.process.extract_iter(e, list_from_DB, processor=str.lower, scorer=rapidfuzz.fuzz.ratio, score_cutoff=95)
    res += r

Yes, the result is working, but very long, about 30 seconds, since the loop must perform 1000 * 400.000 = 400.000.000 operations.

Therefore, the question is the following: is it possible to solve this problem without enumeration of all options, but in some other way? (I'm not against the method with enumeration of all options, but if it fits in time)

My time target is 3 seconds max.

What is your db backend? E.g. postgres has a trigram similarity function. Then you could hand over the fuzzy search to the db which provides the results ordered by similarity. — user2390182, Aug 24 '23 at 13:44
What exactly is *your* criteria for similar strings? The question shows using a specific library with specific settings, but since you don’t want to use that - what are the criteria for an alternative? — MisterMiyagi, Aug 24 '23 at 13:47
Another thing to explore is the [burkhard-keller-tree](https://en.wikipedia.org/wiki/BK-tree). You can create one from the db data and reuse it for every query term by the user. — user2390182, Aug 24 '23 at 13:48
@user2390182 Yes, we are using postgres. And I know about trigrams. But I can't understand your train of thought. It will be even longer if I transfer each item from list_from_user to the database to determine trigrams. Maybe you mean something else, please explain — Paul, Aug 24 '23 at 13:49
Yeah, 1000 db hits may as well be slow... Note the BK-tree suggestion: https://www.geeksforgeeks.org/bk-tree-introduction-implementation/ This data structure needs to be built only once and is designed to look up similar items efficiently. — user2390182, Aug 24 '23 at 13:53
@MisterMiyagi Let me give you one more example. In the list list_from_DB = [ 'LOPEZ TRANSPORT', 'LOPEZ TRANSPORTATION', 'Lopez Transport', 'napora inc', 'Narain Transport Ltd' ] and in the list list_from_user = [ 'Description', 'llc', 'Load', 'Rate', 'Amount', '$1950.00', 'Please', 'BBE', 'note', 'Lopez transport', 'LLC' ] similar will be the following elements [ 'Lopez Transport', 'LOPEZ TRANSPORT', 'LOPEZ TRANSPORTATION' ] — Paul, Aug 24 '23 at 13:55
What's the issue with the soundex solution that Tim gave you? — Kelly Bundy, Aug 24 '23 at 13:57
@user2390182 Yes, thanks, I'll take a look at your recommendations and get back to you with feedback. If you would like to help me with the code, I would also be grateful. — Paul, Aug 24 '23 at 13:57
Examples are meaningless without a description of why they apply. Are you looking for case insensitive substring matches? — MisterMiyagi, Aug 24 '23 at 13:59
@MisterMiyagi Yes, so that first completely identical options are found, and then the same options are case-insensitive, then options where words with errors can be present in the list from the user (for example, list_from_user = ["Volswagen"], and in list_from_DB = ["Volkswagen ", "Mazda"] ) — Paul, Aug 24 '23 at 14:06
Why then, for example, is "Narain Transport Ltd" not similar to "Lopez transport" seeing how they are both superstrings of "Transport"? In other words, what do you consider a negligible error? — MisterMiyagi, Aug 24 '23 at 14:15
@KellyBundy Because with this option, I can not distribute the found items from the most similar to the least similar. Also, not all items are defined correctly, and for example, those that are in the database and in the list from the user, they are not at all. Yes, that option is fast, but not suitable for solving my particular task. — Paul, Aug 24 '23 at 14:16
@MisterMiyagi Yes, it is similar, but with less weight than the others — Paul, Aug 24 '23 at 14:18
So what you are actually looking for is an algorithm for matching by edit distance with a threshold and sub-quadratic scaling? Would this be single pass for both collections, or would one (say, from the DB) be matched many times and thus amortise building an index or similar? — MisterMiyagi, Aug 24 '23 at 14:23
@MisterMiyagi If from my explanations you have drawn such a conclusion, then probably yes. It's just that I'm having a hard time understanding the terminology from your last post. — Paul, Aug 24 '23 at 14:25

score 1 · Accepted Answer · answered Aug 26 '23 at 08:40

If the result of what you are currently doing meets your needs, then cdist is a better choice.

The code below is how to save everything we can get from cdist.

import numpy as np
import rapidfuzz


def find_fuzzy(list_from_user, list_from_DB, score_cutoff: int):
    score_matrix = rapidfuzz.process.cdist(
        list_from_user,
        list_from_DB,
        processor=str.lower,
        scorer=rapidfuzz.fuzz.ratio,
        dtype=np.uint8,  # Output the score as uint8, which is faster.
        workers=-1,  # Use multithreading. -1 means use all cores.
        score_cutoff=score_cutoff,
    )

    results = []
    user_indices, db_indices = np.nonzero(score_matrix)
    for user_index_of_match, db_index_of_match in zip(user_indices, db_indices):
        results.append(
            {
                "user_index_of_match": user_index_of_match,
                "db_index_of_match": db_index_of_match,
                "user_item_of_match": list_from_user[user_index_of_match],
                "db_item_of_match": list_from_DB[db_index_of_match],
                "score_of_match": score_matrix[user_index_of_match, db_index_of_match],
            }
        )
    return results

Your answer is incredible. This is a cool approach to solving this problem, many thanks — Paul, Aug 27 '23 at 12:59
While studying the code and various possibilities, I had the following idea: is it possible to get the score_matrix immediately without zeros? this is to avoid using np.nonzero(score_matrix). This again leads to the best performance of the code. It's just that I see in the documentation, there is a parameter scorer_kwargs . Perhaps it can be used in some way? — Paul, Aug 27 '23 at 14:39
@Paul Unfortunately, it is not possible. The `scorer_kwargs` are passed to the scorer (in the above code, the `rapidfuzz.fuzz.ratio` function) and do not change the behavior of the `cdist` itself. A deeper look, [this](https://github.com/maxbachmann/RapidFuzz/blob/v3.2.0/src/rapidfuzz/process_cpp.hpp#L494) is the cdist implementation. The matrix is allocated at the very first step of the function, so there is no way to get around it. — ken, Aug 27 '23 at 15:43

Fuzzy comparison of strings in lists of huge length (taking into account performance)

1 Answers1