0

I have been trying the last 2 weeks to solve this problem, and i am almost at the goal.

Case: Overall depiction of what i am trying

  • I have 2 dataframes extracted from 2 different excel sheets for this example let us say 3x3 (DF1 and DF2)
  • I want to match the cells from Column2 in DF1 with Column2 in DF2
  • I need to match the cells one by one

Example: Let us say i have Cell X1 and i match it which each cell in Y(1,2,3) X1 match the most with Y3.

  • I want to Extract the Row X1 is located in and the Row Y3 is located in and save them aligned next to each other in a single row potentially in a 3. excel sheet

UPDATED What i have:

This code is able to match with sequencematcher and print the matches, however i only get one output match instead of a list of maximum matches:

import pandas as pd
from difflib import SequenceMatcher

data1 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
        'nr1': [22000,25000,27000,35000],
        'nr2': [1,2,3,4]}

data2 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
        'nr1': [22000,25000,27000,35000],
        'nr2': [1,2,3,4]}

df1 = pd.DataFrame(data1, columns = ['Fruit', 'nr1', 'nr2'])
df2 = pd.DataFrame(data2, columns = ['nr1','Fruit', 'nr2'])

#Single out specefic columns to match
col1=(df1.iloc[:,[0]])
col2=(df2.iloc[:,[1]])

#function to match 2 values similarity
def similar(a,b):
    ratio = SequenceMatcher(None, a, b).ratio()
    matches = a, b
    return ratio, matches

for i in col1:
    print(max(similar(i,j) for j in col2))

Output: (1.0, ('Fruit', 'Fruit'))

How do i fix so that it will give me all the max matches and how do i extract the respective rows the matches are located in?

Shawn Atlas
  • 25
  • 1
  • 7
  • If you can do this for a 1x3 dataframe, what stops you from just taking the respective columns from the two dataframes as you inputs? This way, you only have one-dimensional data and your function works. To get the index, just save the whole array. Then you can get use idxmax – Not_a_programmer Jun 03 '20 at 10:08
  • aah i just added your suggestion, but i have a bit of trouble still: How do i fix it so that it will give me all the best matches and how do i extract the respective rows the matches are located in? – Shawn Atlas Jun 03 '20 at 10:36

1 Answers1

0

This should work:

import pandas as pd
import numpy as np
from difflib import SequenceMatcher


def similar(a, b):
    ratio = SequenceMatcher(None, a, b).ratio()
    return ratio


data1 = {'Fruit': ['Apple', 'Pear', 'mango', 'Pinapple'],
         'nr1': [22000, 25000, 27000, 35000],
         'nr2': [1, 2, 3, 4]}

data2 = {'Fruit': ['Apple', 'mango', 'peer', 'Pinapple'],
         'nr1': [22000, 25000, 27000, 35000],
         'nr2': [1, 2, 3, 4]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

order = []
for index, row in df1.iterrows():
    maxima = [similar(row['Fruit'], j) for j in df2['Fruit']]
    best_ratio = max(maxima)
    best_row = np.argmax(maxima)

    order.append(best_row)

df2 = df2.iloc[order].reset_index()

pd.concat([df1, df2], axis=1)
  • hmm i don't know if i am doing something wrong but i wanted my code to give me following output after it matches example: apple it would take: Apple,22000 ,1,Apple,22000,1 Because if somebody would have spelled apple wrong it would still locate the biggest similarity and align them together in one line – Shawn Atlas Jun 03 '20 at 11:04
  • I edited my answer. Is this what you were looking for? – Not_a_programmer Jun 03 '20 at 11:19
  • i tried using this code with 2 dataframes with above 8000 rows and i get: 'float' object is not iterable – Shawn Atlas Jun 05 '20 at 14:30
  • In which line does the error occur? Are you sure your dataframes have the correct structure? This sounds like something went wrong with importing the data – Not_a_programmer Jun 06 '20 at 17:40
  • i tried printing the dataframe and it worked. it appears in the library not in the code: File "C:XX\Anaconda3\lib\difflib.py", line 311, in __chain_b for i, elt in enumerate(b): TypeError: 'float' object is not iterable – Shawn Atlas Jun 07 '20 at 09:44
  • So, the error is in `ratio = SequenceMatcher(None, a, b).ratio()`. `SequenceMatcher` expects `a` and `b` to be sequences (e.g., strings, lists). Now your error tells you that either `a` or `b` is a float instead. Therefore, the problem is in your data. Your `Fruit` column contains numbers instead of strings. If you want to treat these numbers as strings, you can change the type first, e.g. by using `df1['Fruit'] = df1['Fruit'].astype(str)`. See the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) for more info. – Not_a_programmer Jun 08 '20 at 09:01