I have 2 data frames, each with a common value in column C. I am using pd.merge to merge these data frames, but i dont want duplicate rows for each matching pair. This will be more clearly visualised below:
DF1
df = pd.DataFrame(columns=['A','B','C'])
df['A'] = ['2_21','2_34']
df['B'] = ['X1','X2']
df['C'] = ['2','2']
print(df)
A B C
0 2_21 X1 2
1 2_34 X2 2
DF2
df2 = pd.DataFrame(columns=['D','E','C'])
df2['D'] = ['2_27','2_35']
df2['E'] = ['Y1','Y2']
df2['C'] = ['2','2']
print(df)
D E C
0 2_27 Y1 2
1 2_35 Y2 2`
Unwanted output/dataframe:
df1.merge(df2, how='left', on='C')
A B C D E
0 2_21 X1 2 2_27 Y1
1 2_21 X1 2 2_35 Y2
2 2_34 X2 2 2_27 Y1
3 2_34 X2 2 2_35 Y2
Desired output/dataframe
A B C D E
0 2_21 X1 2 2_27 Y1
1 2_34 X2 2 2_35 Y2
I have tried filtering the unwanted output/data frame by removing duplicates which was answered herebut this doesnt work as i need to split the columns up into subcolumns and remove duplicates of the subcolumns, not individual cells (which didnt seem practical after attempting).
I also cant simply remove the unwanted rows in the unwanted dataframe as there are many randomly occurring rows that need manual removal.
The pandas docs on merge are here but it doesnt seem to allow this.
Does anyone know of a possible way to achieve this? I really appreciate your help in advance as ive been stuck on this for a while. Thank you