store duplicated rows while comparing two dataframes in panda

Question

hello people (I am new to python) Question: I have 2 dataframes df1 and df2, I want to check if there's duplicates based on same (url, price, pourcent) then store them in new datframe also check if there's duplicated url but price change and store them in new datframe

df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

DDaly · Answer 1 · 2021-03-05T22:43:15.467

Here's some code that might help get you started. This creates two sample dataframes, creates a new dataframe of matching urls then finally checks if the rows are exact matches or not.

#Sample df 1
df1 = pd.DataFrame({'url': ["urlone","urltwo","urlthree","urlfour"],
                   'price': [1, 2, 3, 4],
                   'percent': [0.5, 1, 3, 8]
                   })

#sample df 2
df2 = pd.DataFrame({'url': ["urlone","urlthree","urlfive","urlsix"],
                   'price': [1, 2, 3, 4],
                   'percent': [0.5, 1, 3, 8]
                   })


##This tells you all of the matches between the two columns and stores it in a variable called match
match = pd.match(df2['url'],df1['url'])

>>>print(match)
[ 0  2 -1 -1]
##The index tells you where the matches are in df2
##The number tells you where the corresponding match is in df1
##A value of -1 means no match
##You can copy both over to df3

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

#Iterate through match and add to df3
for n,i in enumerate(match):
    print(n)
    print(i)
    if i >= 0: # negative numbers are not matches
        print("Loop")
        df3 = df3.append(df1.iloc[i])
        df3 = df3.append(df2.iloc[n])


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()

P.S. It's useful if you include the code in the text so other people can run it easily :)

Updated variation using your dataframes and using set instead of pd.match


df1 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '23.450', '12'], ['www.sercos.com.tn/after/', '11.000', '5'], ['www.sercos.com.tn/new/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])

df2 = pd.DataFrame([['www.sercos.com.tn/corps-bains/', '13.890', '18'], ['www.sercos.com.tn/new/', '34.000', '10'], ['www.sercos.com.tn/before/', '34.000', '0'], ['www.sercos.com.tn/now/', '14.750', '11']],
              columns=['url', 'price', 'pourcent'])


##This tells you all of the matches between the two columns and stores it in a variable called match_set
match_set = set(df2['url']).intersection(df1['url'])

print(match_set)
#List of urls that match

##df3 for storing duplicated
df3 = pd.DataFrame(columns=df1.columns)

for item in match_set:
    df3 = df3.append(df1.loc[df1['url'] == item])
    df3 = df3.append(df2.loc[df2['url'] == item])


#Iterate through match and add to df3


#df3.duplicated will then tell you if the rows are exactly the same or not. 
df3.duplicated()
print(df3)
print(df3.duplicated())

ok i've updated with the code, i have the error:module 'pandas' has no attribute 'match', maybe problem of version??? — Eya Mila, Mar 05 '21 at 22:25
I'm running in jupyter environment using python 3.8.5.final.0 and pandas 1.1.3 — Eya Mila, Mar 05 '21 at 22:35
Yes, it looks like pd.match is being deprecated. I've added a second piece of code that works without pd.match. — DDaly, Mar 05 '21 at 22:44

store duplicated rows while comparing two dataframes in panda

1 Answers1