3

Is there any way to speed up the fuzzy string match using fuzzywuzzy in pandas.


I have a dataframe as extra_names which has names that I want to run fuzzy matches for with another dataframe as names_df.

>> extra_names.head()

     not_matching
0 Vij Sales
1 Crom Electronics 
2 REL Digital
3 Bajaj Elec
4 Reliance Digi

>> len(extra_names)
6500

>> names_df.head()

         names   types
0 Vijay Sales        1
1 Croma Electronics  1
2 Reliance Digital   2
3 Bajaj Electronics  2
4 Pai Electricals    2

>> len(names_df)
250

As of now, I'm running the logic using the following code, but its taking forever to complete.

choices = names_df['names'].unique().tolist()

def fuzzy_match(row):
    best_match = process.extractOne(row, choices)
    return best_match[0], best_match[1] if best_match else '',''

%%timeit
extra_names['best_match'], extra_names['match%'] = extra_names['not_matching'].apply(fuzzy_match)

As I'm posting this question, the query is still running. Is there any way to speed up this fuzzy string matching process?

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
Aman Singh
  • 1,111
  • 3
  • 17
  • 31

1 Answers1

8

Let's try difflib:

import difflib
from functools import partial

f = partial(
    difflib.get_close_matches, possibilities=names_df['names'].tolist(), n=1)

matches = extra_names['not_matching'].map(f).str[0].fillna('')
scores = [
    difflib.SequenceMatcher(None, x, y).ratio() 
    for x, y in zip(matches, extra_names['not_matching'])
]

extra_names.assign(best=matches, score=scores)

       not_matching               best     score
0         Vij Sales        Vijay Sales  0.900000
1  Crom Electronics  Croma Electronics  0.969697
2       REL Digital   Reliance Digital  0.666667
3        Bajaj Elec  Bajaj Electronics  0.740741
4     Reliance Digi   Reliance Digital  0.896552
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks, I ran it through your code-logic. But got the error as `TypeError: object of type 'float' has no len()` – Aman Singh Jun 10 '19 at 07:01
  • @AmanSingh never mind... just realised you might be dealing with NaNs... I have edited with a better fix. Sorry for the confusion. – cs95 Jun 10 '19 at 07:05
  • 1
    Thanks @cs95 that fillna(' ') at the end removed the error. Thanks for showing me this quicker approach to perform fuzzy string matching. Fuzzywuzzy may be a great library but it takes ages. – Aman Singh Jun 10 '19 at 07:12
  • what is the use of partial and when to use @cs95 – Pyd Jun 10 '19 at 10:13
  • 2
    @pyd the use of partial is to avoid lambda functions. – cs95 Jun 10 '19 at 22:35
  • Thanks! @cs95 Much more faster than fuzzy. However, is it possible to retrieve the column 'types' from name_df? – A2N15 Jan 30 '20 at 09:09
  • @Annis15 are you asking about df.dtypes? – cs95 Jan 30 '20 at 14:30
  • @cs95 I am asking about name_df.types = [1, 1, 1, 2, ..]. Is there a better way to add this column to our final df (extra_names.assign(best=matches, score=scores)) than merging? – A2N15 Jan 31 '20 at 07:41