1

I have 2 dataframes - call them df1 and df2.

The first looks like:

df1

Name          G ID         Type         Source       Year
Washington    WTG1         Normal       Lin          2002
Washington    1            Normal       Lin          2001
Washington    4            Normal       BOX
Eddie         GT2          Normal       Lin
Mann          1            New          BOX          2018
Mann          2            Normal       BOX
Mann          SI-01        Old          Lin          2017
Mann          GGh          Old          Lin
.
.

The second looks like:

df2

Name          Unit ID       Year
Washington    WTG-1         2002
Washington    1             2001
Washington    4             2003
Eddie         GT02          2010
Mann          1             2018
Mann          2             2001
Mann          SI1           2017
Mann          JO            2000                       
.
.

As you can see df1 has some of the Year values that df2 has but it's random as to when it has it or not. Also, some values like GGh in Mann of df2 has no match at all that's obvious.

The G ID and Unit ID are what I'm most interested in merging with. I need a way to make some conditions after merging on the Years (those are pretty accurate) where the ID values can be used to merge.

The condition would be something like Levenshtein Distance:

if i in df1['G ID'] does not match j in df2['Unit ID'] -> Calculate LD -> if distance is close enough then join

I know this is very pseudo code but I'm not sure how to proceed. Is there a way to do this?

HelloToEarth
  • 2,027
  • 3
  • 22
  • 48
  • CHeck with https://stackoverflow.com/questions/49680416/most-likely-word-based-on-max-levenshtien-distance/49680515#49680515 – BENY Jan 13 '19 at 21:44
  • https://stackoverflow.com/questions/41795640/merge-pandas-data-frames-based-on-conditions – Samer Ayoub Jan 13 '19 at 22:22

1 Answers1

2

You can try with the .get_close_matches() method from difflib like this:

import difflib

# make a key column to merge based on close matches
df2['Fuzzy_Key'] = df2.Unit_ID.map(lambda x: difflib.get_close_matches(x, df1.G_ID))

# since the values in our Fuzzy_Key column are lists, we have to convert them to strings
df2['Fuzzy_Key'] = df2.Fuzzy_Key.apply(lambda x: ''.join(map(str, x)))

Output

    Name        Unit_ID Year    Fuzzy_Key
0   Washington  WTG-1   2002    WTG1
1   Washington  1       2001    11
2   Washington  4       2003    4
3   Eddie       GT02    2010    GT2
4   Mann        1       2018    11
5   Mann        2       2001    2
6   Mann        SI1     2017    SI-01
7   Mann        JO      2000    

After that, we can merge on the new created Fuzzy_Key

pd.merge(df2, df1[['Type', 'Source', 'Year', 'G_ID']], 
         how='left', 
         left_on=['Year', 'Fuzzy_Key'], 
         right_on=['Year', 'G_ID'])

Output

    Name        Unit_ID Year    Fuzzy_Key   Type    Source  G_ID
0   Washington  WTG-1   2002    WTG1        Normal  Lin     WTG1
1   Washington  1       2001    11          NaN     NaN     NaN
2   Washington  4       2003    4           NaN     NaN     NaN
3   Eddie       GT02    2010    GT2         NaN     NaN     NaN
4   Mann        1       2018    11          NaN     NaN     NaN
5   Mann        2       2001    2           NaN     NaN     NaN
6   Mann        SI1     2017    SI-01       Old     Lin     SI-01
7   Mann        JO      2000                NaN     NaN     NaN

Important note
It did not match on the keys (1 & 2001) and (1 & 2018) because at the creation of Fuzzy_Key column it close matched to 11 and not 1 as you can see in the first output. I could not figure out why this happend, else row 1 and 4 would have matches as well.

Hope this is a start for your problem. Good luck!

Erfan
  • 40,971
  • 8
  • 66
  • 78