I have been trying the last 2 weeks to solve this problem, and i am almost at the goal.
Case: Overall depiction of what i am trying
- I have 2 dataframes extracted from 2 different excel sheets for this example let us say 3x3 (DF1 and DF2)
- I want to match the cells from Column2 in DF1 with Column2 in DF2
- I need to match the cells one by one
Example: Let us say i have Cell X1 and i match it which each cell in Y(1,2,3) X1 match the most with Y3.
- I want to Extract the Row X1 is located in and the Row Y3 is located in and save them aligned next to each other in a single row potentially in a 3. excel sheet
UPDATED What i have:
This code is able to match with sequencematcher and print the matches, however i only get one output match instead of a list of maximum matches:
import pandas as pd
from difflib import SequenceMatcher
data1 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
'nr1': [22000,25000,27000,35000],
'nr2': [1,2,3,4]}
data2 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
'nr1': [22000,25000,27000,35000],
'nr2': [1,2,3,4]}
df1 = pd.DataFrame(data1, columns = ['Fruit', 'nr1', 'nr2'])
df2 = pd.DataFrame(data2, columns = ['nr1','Fruit', 'nr2'])
#Single out specefic columns to match
col1=(df1.iloc[:,[0]])
col2=(df2.iloc[:,[1]])
#function to match 2 values similarity
def similar(a,b):
ratio = SequenceMatcher(None, a, b).ratio()
matches = a, b
return ratio, matches
for i in col1:
print(max(similar(i,j) for j in col2))
Output: (1.0, ('Fruit', 'Fruit'))
How do i fix so that it will give me all the max matches and how do i extract the respective rows the matches are located in?