1

I have two lists: ref_list and inp_list. How can one make use of FuzzyWuzzy to match the input list from the reference list?

inp_list = pd.DataFrame(['ADAMS SEBASTIAN',  'HAIMBILI SEUN',  'MUTESI 
                          JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA', 
                          'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON', 
                          'PETERS MARIO', 'SHEETEKELA  MATT NICKY'],
                          columns =['Names'])



ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN 
                         KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA  MATT NICKY MBILI'], columns = 
                        ['Names']) 

After some research, I modified some codes I found on the internet. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.

Below are the codes, first a helper function was defined.

def match_term(term, inp_list, min_score=0):
    # -1 score in case I don't get any matches
    max_score = -1
    
    # return empty for no match 
    max_name = ''
    
    # iterate over all names in the other
    for term2 in inp_list:
        # find the fuzzy match score
        score = fuzz.token_sort_ratio(term, term2)
    
        # checking if I am above my threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = term2
            max_score = score
    return (max_name, max_score)


# list for dicts for easy dataframe creation
dict_list = []

#iterating over the sales file
for name in inp_list:
    #use the defined function above to find the best match, also set the threshold to a chosen #
    match = match_term(name, ref_list, 94)
    
    #new dict for storing data
    dict_ = {}
    dict_.update({'passenger_name': name})
    dict_.update({'match_name': match[0]})
    dict_.update({'score': match[1]})
    
    dict_list.append(dict_)

Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
John Mutesi
  • 77
  • 2
  • 9
  • Have you looked at https://pypi.org/project/fuzzywuzzy/ – Jay M Jul 08 '20 at 09:15
  • Yes, I did. In fact, that is where they explain the basic usage of fuzzywuzzy. There emphasis there is on the usage of the four ratios and how a single string or name can be matched to a list of strings or names. – John Mutesi Jul 08 '20 at 09:27

2 Answers2

1

You can try to vectorized the operations instead of evaluate the scores in a loop.

Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list. Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1). Finally you'll get the best match name and score in ref_list for each name in inp_list.

黄煜楠
  • 64
  • 4
  • I am not too sure how to move forward `df = pd.DataFrame({'ref': pd.Series(ref_list), 'inp': pd.Series(inp_list)})` `df.apply(process.extractOne(row['inp'], row['ref']))` **NameError: name 'row' is not defined** – John Mutesi Jul 08 '20 at 10:40
  • Sorry there is a typo, please try again. – 黄煜楠 Jul 08 '20 at 11:11
  • I think that we are getting closer. I have changed to the revised code and below is the first 5 items I am getting. Let me careful go through it again maybe there is something I am missing `0 (A, 60)` `1 (A, 60)` `2 (A, 60)` `3 (A, 60)` `4 (A, 60)` – John Mutesi Jul 08 '20 at 13:28
0

The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy, you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:

import random, string, time
import pandas as pd
from string_grouper import match_strings

alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1

random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
                
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)

t_1 = time.time()
matches = match_strings(series_1, series_2,
                        min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)

It takes less than one second to do 25.000.000 comparisons! For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that

"Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop".

To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.

Tortar
  • 625
  • 5
  • 15