0

I have two sets of names of which I would like to find the closest match between the two, if no "close enough" match is found I would like to match the name to itself.

My current approach is to create a dataframe with all the possible combinations and use .apply or lists to iterate over and calculate a similarity ratio via SequenceMatcher (imported as sm).

The problem is I have a few thousand names in both lists which results in intractable run times.

My matching criteria ideally would be a sm ratio of >= 0.85 with the first name being found in the second name as a whole word. If these criteria are not met, the name should be matched to itself.

The last step I would like to implement is then to replace the original series with these matched names.

Here is the code for my current approach, please let me know if this is unclear an how I can help clarify:

stop_words = [
             'pharmaceuticals',
             'pharmaceutical',
             'pharma',
             'therapeutic',
             'biopharma',
             'therapeutics',
             'international',
             'biotechnology',
             'vaccines',
             '\&',
             '&',
             'co.',
             'co',
             'biotherapeutics',
             'biotherapeutic',
             'life science',
             'life sciences',
             'laboratories',
             'biologics',
             ]

temp_db_companies = db['Company']

def approximate_match(s1, s2):
    return str(sm(None, str(s1).lower().strip(), str(s2).lower().strip()).ratio()) + '|' + str(s2).strip()


def replace_val_w_best_match(df, col_initial, series_final, stop_words):
    init_names = df[col_initial].str.lower().str.split(" ", expand=True).replace(stop_words, "").fillna('')

    init_names = pd.Series(['' for n in range(len(init_names))]).str.cat([init_names[col] for col in init_names.columns], sep= " ").str.replace('  ', ' ').reset_index()

    matching_df = pd.DataFrame(columns = list(init_names.columns) + list(series_final), data = init_names)

    matching_df = pd.melt(matching_df,
                          id_vars = ['index', 0],
                          value_vars = list(series_final),
                          var_name = 'Comparators',
                          value_name = 'Best match')

#    matching =  matching_df.apply(lambda row: approximate_match(row[0], row['Comparators']), axis = 1)

    l = [(matching_df[0]), list(matching_df['Comparators'])]

    ratio = [sm(None, name1, name2) for name1 in l[0] for name2 in l[1]]

    match = [name2 for name1 in l[0] for name2 in l[1]]

    print(ratio[:5])
    print(match[:5])
wingsoficarus116
  • 429
  • 5
  • 17

1 Answers1

1

What you're probably looking for is the Levenshtein distance algorithm. It computes the minimum number of edits needed to transform one string into the other.

Check out this library: https://github.com/ztane/python-Levenshtein/

The Levenshtein library has a class called StringMatcher.py designed to help you with this problem.

This library also includes a similar functionality: https://github.com/gfairchild/pyxDamerauLevenshtein

Greg Jennings
  • 1,611
  • 16
  • 25
  • Is this more efficient than sequence matcher? My issue is more the run time versus the ability to match approximate strings. – wingsoficarus116 Jan 16 '17 at 03:03
  • It's hard to say without seeing the timings of your existing implementation, but if performance is the most critical thing I'd recommend benchmarking the second implementation. It's written in Cython so it should be extremely fast. The benchmarks on the pyxDamerauLevenshtein library show running about 500k comparisons in 2 mins – Greg Jennings Jan 16 '17 at 03:15
  • Looks like it is an order of magnitude faster, but I need to install MS Visual C++ first? The regular pip install command isn't working. Sorry if this is a mundane question, but would love to implement your recommendation. – wingsoficarus116 Jan 16 '17 at 15:06
  • No worries :) If you're working with python's numerical stack, I'd strongly recommend you consider switching to conda as your package manager (https://www.continuum.io/downloads). It handles all of the numerical library dependencies for you. You'll need to recreate your environment in conda, and from there, you can just `conda install pyxDamerauLevenshtein` – Greg Jennings Jan 16 '17 at 17:37
  • Thanks, pip gave me a link to the C++ info, got it to work! Huge time saver. – wingsoficarus116 Jan 16 '17 at 20:14
  • Great. One other thing you might look at is numba. When you're looping over a collection of items in python there's a lot of call overhead. numba can help you minimize this greatly for numerical operations. – Greg Jennings Jan 16 '17 at 21:16