1

I've 2 dataframes:

Dataframe 1:

    path                      hierarchy
0  path3              path1/path2/path3
1  path2                    path1/path2
2  path6  path1/path2/path4/path5/path6

DataFrame 2:

    path                       hierarcy  unique_id
0  path2    path1/<random_string>/path2          1
1  Path3   <random_string>/Path_2/path3          2
2  path2                    path1/Path2          3
3  pAth6        path1/path2/path4/path6          4
4  path6  path1/path2/path4/path5/path6          5

Expected output:

    path                      hierarchy  unique_id
0  path3              path1/path2/path3          2
1  path2                    path1/path2          3
2  path6  path1/path2/path4/path5/path6          5

Now, I wanna fill in the unique_id's from the 2nd dataframe to the 1st one. But, there are some issues -

  1. I can't directly do merge(left) the 2 dataframes based on the path column as the correct path depends on the hierarchy column as well.
  2. The strings in the hierarchy column are not exactly the same. There are some differences for example Path_2 and path2. The same is also valid for the path column.

I tried fuzzy matching but it's the hierarchy column that's creating the issue. I want to start the matching from the right side of the hierarchy column and then move towards the left applying fuzzy matching on each level. I'm not sure if that's a good approach or not.

Basically, the requirement is to identify the most perfect match based on both path/hierarchy columns both.

Dictionaries to create the dataframe:

df1 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
 'hierarchy': {0: 'path1/path2/path3',
  1: 'path1/path2',
  2: 'path1/path2/path4/path5/path6'}})


df2 = pd.DataFrame({'path': {0: 'path2', 1: 'Path3', 2: 'path2', 3: 'path6', 4: 'path6'},
 'hierarcy': {0: 'path1/<random_string>/path2',
  1: '<random_string>/Path_2/path3',
  2: 'path1/Path2',
  3: 'path1/path2/path4/path6',
  4: 'path1/path2/path4/path5/path6'},
 'unique_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})


df3 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
 'hierarchy': {0: 'path1/path2/path3',
  1: 'path1/path2',
  2: 'path1/path2/path4/path5/path6'},
 'unique_id': {0: 2, 1: 3, 2: 5}})

Any help is appreciated.

Shubham Sharma
  • 129
  • 1
  • 8
  • Will this help you https://stackoverflow.com/questions/68479358/fuzzy-wuzzy-to-find-a-match-and-other-columns-associated-with-match/68479623?noredirect=1#comment121024464_68479623 – Alexey Jul 27 '21 at 16:21
  • 1
    @Alexey This will not help already tried, Requirement is to do the fuzzy matching on each hierarchy level. and the path column. – Shubham Sharma Jul 27 '21 at 16:33

2 Answers2

0

Transform text in df1 and df2 to lowercase:

df1['path'] = df1['path'].str.lower()
df2['path'] = df2['path'].str.lower()

Merge:

result = pd.merge(df1, df2, on=['path'])

Result:

enter image description here

Piotr Żak
  • 2,046
  • 5
  • 18
  • 30
  • Hii, Thanks for the anser. But, it's not that simple result depends on the hierarchy column as well and the strings are not similar. So, I need to use the fuzzy-wuzzy comparison at each level. – Shubham Sharma Jul 27 '21 at 16:32
  • https://www.geeksforgeeks.org/how-to-do-fuzzy-matching-on-pandas-dataframe-column-using-python/ Maybe. that can helps. – Piotr Żak Jul 27 '21 at 16:34
0

I tried fuzzy matching, but didn't get satisfying results, because the order of the hierarchy is crucial here. Let me suggest an alternative approach that treats the hierarchies as vectors and then retrieves the idx of the vector in df2 with the closest spatial proximity. In the example below the paths are cleaned, split by / and are assigned a unique value. Subsequently the spatial distance is calculated of every hierarchy in df1 against the vectors with the same length in df2:

import pandas as pd
import numpy as np
from scipy import spatial

df1 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
 'hierarchy': {0: 'path1/path2/path3',
  1: 'path1/path2',
  2: 'path1/path2/path4/path5/path6'}})


df2 = pd.DataFrame({'path': {0: 'path2', 1: 'Path3', 2: 'path2', 3: 'path6', 4: 'path6'},
 'hierarchy': {0: 'path1/<random_string>/path2',
  1: '<random_string>/Path_2/path3',
  2: 'path1/Path2',
  3: 'path1/path2/path4/path6',
  4: 'path1/path2/path4/path5/path6'},
 'unique_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})

#clean up paths
df1['hierarchy'] = df1['hierarchy'].str.lower().str.replace('_','').str.split('/')
df2['hierarchy'] = df2['hierarchy'].str.lower().str.replace('_','').str.split('/')

#create unique integer ids per path and map them to the dataframes
unique_dirs = set(list(np.concatenate(df1['hierarchy'].to_list() + df2['hierarchy'].to_list())))
dir_map = dict(zip(unique_dirs, range(len(unique_dirs))))

df1 = df1.assign(mapped=[list(map(dir_map.get, x)) for x in df1.hierarchy])
df2 = df2.assign(mapped=[list(map(dir_map.get, x)) for x in df2.hierarchy])

#calculate spatial distance to vectors from df2 with the same length, then return the matched hierarchy and idx
def check(row):
  value = row['mapped']
  arr = np.array(df2[df2['mapped'].str.len() == len(value)]['mapped'].to_list())
  tree = spatial.KDTree(arr)
  distances, indices = tree.query(value)
  matches = df2[df2['mapped'].astype(str) == str(tree.data[indices].tolist())]
  return matches['hierarchy'].tolist()[0], matches['unique_id'].tolist()[0]

df1[['match', 'idx']] = df1.apply(check, axis=1, result_type="expand")

Result df1:

path hierarchy mapped match idx
0 path3 ['path1', 'path2', 'path3'] [0, 1, 3] ['<randomstring>', 'path2', 'path3'] 2
1 path2 ['path1', 'path2'] [0, 1] ['path1', 'path2'] 3
2 path6 ['path1', 'path2', 'path4', 'path5', 'path6'] [0, 1, 6, 5, 4] ['path1', 'path2', 'path4', 'path5', 'path6'] 5

The code could likely be computationally optimized and will not work if multiple similar hierarchies with random paths at the same place exist. In that case I wouldn't assign a random integer, but retrieve embeddings for the paths using a language model prior to calculating the distance.

RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26