I have 2 dataframes - call them df1
and df2
.
The first looks like:
df1
Name G ID Type Source Year
Washington WTG1 Normal Lin 2002
Washington 1 Normal Lin 2001
Washington 4 Normal BOX
Eddie GT2 Normal Lin
Mann 1 New BOX 2018
Mann 2 Normal BOX
Mann SI-01 Old Lin 2017
Mann GGh Old Lin
.
.
The second looks like:
df2
Name Unit ID Year
Washington WTG-1 2002
Washington 1 2001
Washington 4 2003
Eddie GT02 2010
Mann 1 2018
Mann 2 2001
Mann SI1 2017
Mann JO 2000
.
.
As you can see df1
has some of the Year
values that df2
has but it's random as to when it has it or not. Also, some values like GGh
in Mann
of df2
has no match at all that's obvious.
The G ID
and Unit ID
are what I'm most interested in merging with. I need a way to make some conditions after merging on the Years (those are pretty accurate) where the ID
values can be used to merge.
The condition would be something like Levenshtein Distance:
if i in df1['G ID']
does not match j in df2['Unit ID']
-> Calculate LD
-> if distance is close enough
then join
I know this is very pseudo code but I'm not sure how to proceed. Is there a way to do this?