0

Lets say i have 2 excel files each containing a column of names and dates

Excel 1:

Name
0      Bla bla bla June 04 2018 
1      Puppy Dog June 01 2017
2      Donald Duck February 24 2017
3      Bruno Venus April 24 2019

Excel 2:

                             Name
0        Pluto Feb 09 2019
1        Donald Glover Feb 22 2020
2        Dog Feb 22 2020
3        Bla Bla Feb 22 2020

I want to match each cell from column 1 to each cell in column 2 and then locate the biggest similarity.

The following function will give a percentage value of how much two input match each other.

SequenceMatcher code example:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()


x = "Adam Clausen a Feb 09 2019"
y = "Adam Clausen Feb 08 2019"
print(similar(x,y))

Output:0.92

Shawn Atlas
  • 25
  • 1
  • 7
  • build a 2 dimensions dataframe with colum1 as index and column 2 as column, then compute each value with `similar` and sort the highest point ? – pyOliv Jun 02 '20 at 10:42
  • That is actually a smart idea, but thing is i want to match them as 2 separate columns because in the long term i want to take the two matches and their respective rows and the align them, but i thought i should just start simple here :/ – Shawn Atlas Jun 02 '20 at 10:53
  • If you have solved it, post the solution as an answer below and accept it. – deceze Jun 29 '20 at 14:24

2 Answers2

1

If u know how to load colums as dataframe..this code should get your job done..

from difflib import SequenceMatcher

col_1 = ['potato','tomato', 'apple']
col_2 = ['tomatoe','potatao','appel']

def similar(a,b):
    ratio = SequenceMatcher(None, a, b).ratio()
    matches = a, b
    return ratio, matches

for i in col_1:
    print(max(similar(i,j) for j in col_2))
pritam samanta
  • 434
  • 4
  • 10
  • Thanks, but the code only give me the overall max value. I wanted to take each cell and find what it matched the most with and then print the 2 matches allegnid right after each other :) – Shawn Atlas Jun 02 '20 at 12:12
  • the code works for this scenario but when i apply it for my df's i get only one match? – Shawn Atlas Jun 02 '20 at 13:06
  • I just edited so i could show the data frames how they appear – Shawn Atlas Jun 02 '20 at 13:10
  • with the new frame do you also only print 1 match? – Shawn Atlas Jun 03 '20 at 09:34
  • this is the output im getting with your new dataframe '(0.45614035087719296, ('Abilass Vethanayagam June 04 2018', 'Adam Clausen Feb 09 2019')) (0.42105263157894735, ('Abraham H, Tesfazghi June 01 2017', 'Adam Clausen Feb 09 2019')) (0.509090909090909, ('Achilles Cortel February 24 2017', 'Adam Honore Feb 22 2020')) (0.4074074074074074, ('Achraf El khadri April 24 2019', 'Adam Clausen Feb 09 2019'))' – pritam samanta Jun 03 '20 at 14:53
  • it is printing all the matches for me.. make sure that ur are supplying the input in correct format(string).. – pritam samanta Jun 03 '20 at 14:56
0

UPDATED/SOLVED PART

The following code does:

  • It takes the 2 input files and turn them into dataframes
  • It will then take a specific Column ( in this case they are both called Name) and use it as its match input
  • It takes one name from File 1 and runs through all the names in File 2
  • It then takes the name with the highest match and saves their respective row and save them next to each other in the Output File

Code:

import pandas as pd
import numpy as np
from difflib import SequenceMatcher

def similar(a, b):
    ratio = SequenceMatcher(None, a, b).ratio()
    return ratio

#Load Batchlog to Data frame

data1 = pd.read_excel (r'File1.xlsx')
data2 = pd.read_excel (r'File2.xlsx')

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df1['Name'] = df1['Name'].astype(str)
df2['Name'] = df2['Name'].astype(str)

#Function/LOOP
order = []
for index, row in df1.iterrows():
    maxima = [similar(row['Name'], j) for j in df2['Name']]

#best_Ratio=Best Match
    best_ratio = max(maxima)
    best_row = np.argmax(maxima)

#Rearrange new order and save in Output File
    order.append(best_row)

df2 = df2.iloc[order].reset_index()

pd.concat([df1, df2], axis=1)

dfFinal=pd.concat([df1, df2], axis=1)

dfFinal.to_excel("OUTPUT.xlsx")  
#Thank you for the help!
Shawn Atlas
  • 25
  • 1
  • 7