4

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records.

What I tried so far,

First I tried to use record linkage package in python, unfortunately it ran out of memory when it build the multi index, so I moved to AWS with good machine power and successfully built it, however when I tried to run the comparison on it, it runs forever, I agree that its due to the number of comparison.

Then, I tried to do string match with fuzzy wuzzy and parallelise the process using dask package. And executed it on a sample data. It works fine, but I know the process will still take time as the search space is wide. I am looking for a way to add blocking or indexing on this piece of code.

test = pd.DataFrame({'Address1':['123 Cheese Way','234 Cookie Place','345 Pizza Drive','456 Pretzel Junction'],'city':['X','U','X','U']}) 
test2 = pd.DataFrame({'Address1':['123 chese wy','234 kookie Pl','345 Pizzza DR','456 Pretzel Junktion'],'city':['X','U','Z','Y'] , 'ID' : ['1','3','4','8']}) 

Here, I am trying to look for test.Address1 in test2.Address1 and bring its ID.

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    slave_df['score'] = slave_df.Address1.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'ID']

dmaster = dd.from_pandas(test, npartitions=24)
dmaster = dmaster.assign(ID_there=dmaster.Address1.apply(lambda x: helper(x, test2)))
dmaster.compute(get=dask.multiprocessing.get)

This works fine, however I am not sure how I can apply indexing on it by limiting the search space on the same city.

Lets say, I am creating an index on the city field and subset based on the city of the original string and pass that city to the helper function,

# sort the dataframe
test2.sort_values(by=['city'], inplace=True)
# set the index to be this and don't drop
test2.set_index(keys=['city'], drop=False,inplace=True)

I don't know how to do that ? Please advise. Thanks in advance.

mdurant
  • 27,272
  • 5
  • 45
  • 74
ds_user
  • 2,139
  • 4
  • 36
  • 71
  • Have you resolved this? – Bob Haffner Jun 25 '17 at 17:02
  • Not yet. Still working on it. – ds_user Jun 25 '17 at 19:06
  • Hey @ds_user. I've been googling this problem for weeks and have the exact same issue as you. I have about 100k addresses in `df1`, and 2 million in `df2` that also have an `ID` that I need to extract. This ID is actually a zipcode that I will append to addresses in `df1`. Did you find a solution for this? I'm also using fuzzymatching but it is taking an extremely long time. Any help would be appreciated. – SCool Aug 21 '19 at 15:34

2 Answers2

2

I prefer using fuzzywuzzy.process.extractOne. That compares a string to an iterable of strings.

def extract_one(col, other):
    # need this for dask later
    other = other.compute() if hasattr(other, 'compute') else other
    return pd.DataFrame([process.extractOne(x, other) for x in col],
                        columns=['Address1', 'score', 'idx'],
                        index=col.index)

extract_one(test.Address1, test2.Address1)

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The idx is the index of the other passed to extract_one that matches closest. I would recommend having a meaningful index, to making joining the results later on easier.

For your second question, about filtering to cities, I would use a groupby and apply

gr1 = test.groupby('city')
gr2 = test2.groupby("city")

gr1.apply(lambda x: extract_one(x.Address1, 
gr2.get_group(x.name).Address1))

               Address1  score  idx
0          123 chese wy     92    0
1         234 kookie Pl     83    1
2         345 Pizzza DR     86    2
3  456 Pretzel Junktion     95    3

The only difference with dask is the need to specify a meta to the apply:

ddf1 = dd.from_pandas(test, 2)
ddf2 = dd.from_pandas(test2, 2)

dgr1 = ddf1.groupby('city')
dgr2 = ddf2.groupby('city')

meta = pd.DataFrame(columns=['Address1', 'score', 'idx'])
dgr1.apply(lambda x: extract_one(x.Address1, 

dgr2.get_group(x.name).Address1),
               meta=meta).compute()

             Address1  score  idx
city                             
U    0  234 kookie Pl     83    1
     1  234 kookie Pl     28    1
X    0   123 chese wy     92    0
     1   123 chese wy     28    0

Here's a notebook: https://gist.github.com/a932b3591346b898d6816a5efc2bc5ad

I'm curious to hear how the performance is. I'm assuming the actual string comparison done in fuzzy wuzzy will take the bulk of the time, but I'd love to hear back on how much overhead is spent in pandas and dask. Make sure you have the C extentions for computing the Levenshtein distance.

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • Thanks for your response. I will try it today. How do I parallelise the process? Both the methods? – ds_user Jun 21 '17 at 18:44
  • 2
    Everything should be parallelized. You may want to check http://dask.pydata.org/en/latest/dataframe-groupby.html, you shouldn't have to spill to disk so it may be OK. You'll also want to make sure that the Levenshtein distance extension releases the GIL for it's computation (I'm not sure if it does) – TomAugspurger Jun 21 '17 at 19:10
  • And another error is - AssertionError: 3 columns passed, passed data had 2 columns - on this line - columns=['Address1', 'score', 'idx']) . I think the fuzzywuzzy.process.extractOne does not return idx as part of the tuple it returns. Can you please help on this. @TomAugspurger – ds_user Jun 21 '17 at 22:38
  • Also, by doing this, I will loose the index of test.Address1, so it will make things even complicated. – ds_user Jun 22 '17 at 05:39
  • > I think the fuzzywuzzy.process.extractOne does not return idx as part of the tuple it returns It depends on what's passed. When `other` is dict-like (which includes a Series), extractOne returns a 3-tuple. When it's just array-like it's a 2-tuple. What's `type(other)` that's being passed to `extractOne` for you? – TomAugspurger Jun 22 '17 at 14:15
  • Yes. That is causing issues for me. :( – ds_user Jun 22 '17 at 14:16
  • I just edited `extract_one` to keep `col.index` around. I haven't tested it, so let me know if that broke things. Does my example notebook work for you, or do you get the AssertionError? – TomAugspurger Jun 22 '17 at 14:19
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147379/discussion-between-ds-user-and-tomaugspurger). – ds_user Jun 22 '17 at 14:21
  • Hey my `type(other)` is a series. Even if I convert that to a dict, the key is the index of the record within the group not the actual index of the record in test2. @TomAugspurger – ds_user Jun 22 '17 at 20:23
2

i run on the same problem once. The whole process takes forever and even you will use multiprocessing it is not really going to be very fast. The main problem that causes the slow speed is the fuzzy matching because the processing is very tedious and requires a lot of time.

Alternatively, and more efficient in my opinion, it would be to use embedding aka bag of words and apply an ML method on it. the fact that you will use numerical vector makes the whole process way faster!

George Sotiropoulos
  • 1,864
  • 1
  • 22
  • 32