pandas merge df many to many without duplicates

Question

suppose i have two df like below:

import pandas as pd

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "c": [6,7,8,9]
}
df1 = pd.DataFrame(data_dic)

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "d": [10,10,12,13]
}
df2 = pd.DataFrame(data_dic)

Result:

df1

df2

   a  b   d
0  0  3  10
1  0  3  10
2  1  4  12
3  2  5  13

where each of df have the same key value on column 'a' and 'b'.

Q: How to merge them without duplicates? [option with merging df and then delete duplicates is unsatisfactory]

I try to do this by using below codes but this creates duplicate rows and significantly increases the file capacity.

df = pd.merge(df1, df2, on=['a', 'b'] , how='left', validate='many_to_many')
df = pd.merge(df1, df2, on=['a', 'b'] , how='inner')

Expected result:

   a  b  c   d
0  0  3  6  10
1  0  3  7  10
2  1  4  8  12
3  2  5  9  13

Thank You All for hard work.

jezrael · Answer 1 · 2019-08-08T07:06:34.673

4

Use GroupBy.cumcount for counter columns in both DataFrames with merge by added column:

df1['g'] = df1.groupby(['a','b']).cumcount()
df2['g'] = df2.groupby(['a','b']).cumcount()

df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
print (df)
   a  b  c  g   d
0  0  3  6  0  10
1  0  3  7  1  10
2  1  4  8  0  12
3  2  5  9  0  13

Difference with another solutions the best see in changed data in second df second 10 to 11 - it correct merge by first duplicate pair a, b from df1 with first a, b pais from second, similar for all duplicates and also for unique pairs:

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "d": [10,11,12,13]
}
df2 = pd.DataFrame(data_dic)


df1['g'] = df1.groupby(['a','b']).cumcount()
df2['g'] = df2.groupby(['a','b']).cumcount()

df = pd.merge(df1, df2, on=['a', 'b', 'g'] , how='inner')
print (df)

   a  b  c  g   d
0  0  3  6  0  10
1  0  3  7  1  11
2  1  4  8  0  12
3  2  5  9  0  13

edited Aug 08 '19 at 07:06

answered Aug 08 '19 at 06:39

jezrael

822,522
95
1,334
1,252

Thank You again for nice solution. For now i don't know why but this approach creates additional 5% of data in my raw data compare to data created by removing duplicates first. – Zaraki Kenpachi Aug 08 '19 at 07:16
@ZarakiKenpachi - I try explain difference between my solution and all another, it merge also by duplicates, first with first, second with second. Maybe the best see if test output with `df = pd.merge( df1.drop_duplicates(), df2.drop_duplicates(), on=['a', 'b'], how='inner' )` from my changed data. – jezrael Aug 08 '19 at 07:18
1

I think this feature will be useful. I'm still testing results. – Zaraki Kenpachi Aug 08 '19 at 07:33
@ZarakiKenpachi - Maybe I forget one very important thing - my solution merge by dupicates by `a, b` columns pairs, `kitman0804` solution remove duplicates by all columns. – jezrael Aug 08 '19 at 07:35

score 1 · Answer 2 · answered Aug 08 '19 at 06:41

1

You can remove the duplicated rows before merging

df = pd.merge(
    df1.drop_duplicates(), 
    df2.drop_duplicates(), 
    on=['a', 'b'], how='inner'
)
print(df)

#    a  b  c   d
# 0  0  3  6  10
# 1  0  3  7  10
# 2  1  4  8  12
# 3  2  5  9  13

answered Aug 08 '19 at 06:41

pe-perry

2,591
2
22
33

Thank You for simple and fast solution. – Zaraki Kenpachi Aug 08 '19 at 07:18
hi there dear pe-perry - great ! this is just awesome – malaga Feb 09 '21 at 13:41

score 1 · Answer 3 · answered Aug 08 '19 at 06:46

You could also drop duplicates after the merge

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "c": [6,7,8,9]
}
df1 = pd.DataFrame(data_dic)

data_dic = {
    "a": [0,0,1,2],
    "b": [3,3,4,5],
    "d": [10,10,12,13]
}
df2 = pd.DataFrame(data_dic)


df3 = pd.merge(df1, df2, how='inner', on=['a', 'b']).drop_duplicates()

df3:

   a  b  c   d
0  0  3  6  10
2  0  3  7  10
4  1  4  8  12
5  2  5  9  13

pandas merge df many to many without duplicates

3 Answers3

Linked

Related