I've 2 dataframes:
Dataframe 1:
path hierarchy
0 path3 path1/path2/path3
1 path2 path1/path2
2 path6 path1/path2/path4/path5/path6
DataFrame 2:
path hierarcy unique_id
0 path2 path1/<random_string>/path2 1
1 Path3 <random_string>/Path_2/path3 2
2 path2 path1/Path2 3
3 pAth6 path1/path2/path4/path6 4
4 path6 path1/path2/path4/path5/path6 5
Expected output:
path hierarchy unique_id
0 path3 path1/path2/path3 2
1 path2 path1/path2 3
2 path6 path1/path2/path4/path5/path6 5
Now, I wanna fill in the unique_id's from the 2nd dataframe to the 1st one. But, there are some issues -
- I can't directly do
merge(left)
the 2 dataframes based on thepath
column as the correct path depends on thehierarchy
column as well. - The strings in the hierarchy column are
not
exactly the same. There are some differences for examplePath_2
andpath2
. The same is also valid for the path column.
I tried fuzzy
matching but it's the hierarchy
column that's creating the issue. I want to start the matching from the right side of the hierarchy column and then move towards the left applying fuzzy matching on each level.
I'm not sure if that's a good approach or not.
Basically, the requirement is to identify the most perfect match based on both path/hierarchy columns both.
Dictionaries to create the dataframe:
df1 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
'hierarchy': {0: 'path1/path2/path3',
1: 'path1/path2',
2: 'path1/path2/path4/path5/path6'}})
df2 = pd.DataFrame({'path': {0: 'path2', 1: 'Path3', 2: 'path2', 3: 'path6', 4: 'path6'},
'hierarcy': {0: 'path1/<random_string>/path2',
1: '<random_string>/Path_2/path3',
2: 'path1/Path2',
3: 'path1/path2/path4/path6',
4: 'path1/path2/path4/path5/path6'},
'unique_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})
df3 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
'hierarchy': {0: 'path1/path2/path3',
1: 'path1/path2',
2: 'path1/path2/path4/path5/path6'},
'unique_id': {0: 2, 1: 3, 2: 5}})
Any help is appreciated.