1

We have two dataframe

dataframe 1 ::

enter image description here

dataframe 2 :

enter image description here

need to validate same data in second dataset in combined column and add id column from first dataset

means output like ::

enter image description here

!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
data = pd.read_csv(dataframe 1)
df = pd.read_csv(dataframe 2)

word = data['data'].tolist()
find = df['combined'].tolist()
df_final = pd.DataFrame(columns=['combined','id'])
for j in find:
    j = str(j)
    for i in word:
        if i:
            i = str(i)
            Token_Sort_Ratio = fuzz.token_sort_ratio(j,i)
            if Token_Sort_Ratio > 70:
                #print(i)
                final = data[data.data == i]
                df1 = df[df.combined == j]
                df_final['id']=df_final['id'].append(final['id'],ignore_index=True)
                df_final['combined']= df_final['combined'].append(df1['combined'],ignore_index=True)

But data is not append in df_final dataset, kindly help me about this. after that we are planning to join df_final and dataframe 2 on combined column

please feel free to suggest, If you have any other solution apart from this

Amol
  • 336
  • 3
  • 5
  • 17

1 Answers1

1
import pandas as pd 
from fuzzywuzzy import fuzz

df1 = pd.DataFrame([['12','gandhi vidhalaya 225'],['45','balvidhya mandir a 456'],['65','jspm 4568'],[45,'coep 7896']], columns= ['id','data'])

df2 = pd.DataFrame([['june','gandhi vidhalaya dc 225'],['july','balvidhya mandir a 456'],['march','jspm d 4568'],['jan','coep 7896']], columns= ['month','combined']) 


data = []
for i in range(df1.shape[0]):
    for j in range(df2.shape[0]):
        token_ratio =  fuzz.ratio(df1['data'][i], df2['combined'][j])
        if token_ratio > 70:
           column_B = df2.iloc[j]['combined']
           column_A = df2.iloc[j]['month']
           data.append((column_B, column_A))

df_final = pd.concat([df1,pd.DataFrame(data, columns  = ['combined_text','month'])], axis = 1)

#Op drop the data column you will get the desired result
df_final.drop(columns = ['data'])

enter image description here

qaiser
  • 2,770
  • 2
  • 17
  • 29
  • Guys ,if you have any other solution apart from fuzzywuzzy,please feel free to suggest – Amol Jan 16 '20 at 12:48
  • if we have null value at any dataset , above code is not working, so before implementong this kindly remove null value – Amol Jan 16 '20 at 12:50