I have two lists:
The first list I get from the database is the names of various companies (can be written in uppercase, lowercase or a combination)
list_from_DB = ["Reebok", "MAZDA", "PATROL", "AsbEngland-bank", "Mazda INCC", "HIGHWAY lcc", "mazda", "015 Amazon", ......]
There are about 400,000 items in this list.
The second list I get the by parsing text from the user (there can be absolutely any words, numbers, signs, etc.)
list_from_user = ['ts', '$243', 'mazda', 'OFFICERS', 'SIGNATURE', 'Date:07/20/2022', 'Wilson', 'Bank', .......]
There are about 1000 items in this list.
What I need to do is find which items from list_from_user are in list_from_DB and display them in the order of the greatest similarity. As you can see below, the items in the two lists may be identical, or they may differ in spelling.
Output
["mazda", "MAZDA", "Mazda INCC", "AsbEngland-bank"]
What I do: yes, I know about fuzzy character matching libraries, I use rapidfuzz.
res = []
for e in list_from_user:
r = rapidfuzz.process.extract_iter(e, list_from_DB, processor=str.lower, scorer=rapidfuzz.fuzz.ratio, score_cutoff=95)
res += r
Yes, the result is working, but very long, about 30 seconds, since the loop must perform 1000 * 400.000 = 400.000.000 operations.
Therefore, the question is the following: is it possible to solve this problem without enumeration of all options, but in some other way? (I'm not against the method with enumeration of all options, but if it fits in time)
My time target is 3 seconds max.