I have 2 Df with an identical structure to be merged with a condition. I need to have all the id raw but not duplicated and when there is a same id I need to choice the one from the dataframe with val==1 or just the row from the df1 if df1.val == df2.val
df1
id val X1 X2 X3
0 az1 1 10 12 16
1 az2 0 15 17 19
3 az3 0 8 9 17
4 az5 1 3 2 67
5 az7 0 31 2 68
df2
id val X1 X2 X3
0 az1 1 12 10 15
1 az2 1 13 15 18
3 az4 1 10 9 47
5 az6 0 33 12 68
6 az7 0 1 26 90
I want a new df3 like this:
df3
id val X1 X2 X3
0 az1 1 10 12 16
1 az2 1 13 15 18
3 az3 0 8 9 17
4 az4 1 10 9 47
5 az5 1 3 2 67
6 az6 0 33 12 68
7 az7 0 31 2 68
It contains:
- all the df1 rows
- all the df2 rows where df2.val==1 AND where df2.id is not present in df1
- all the df2 rows where (df2.id == df1.id) AND df2.val==1 AND df1.val==0 (overwrite the df1 raw)
- all the df2 rows with df2.val == 0 AND where df2.id is not present in df1
With the command:
Unique=pd.concat([df1, df2]).drop_duplicates(subset ="id", keep = False, inplace = False).sort_values('id')
Unique
id val X1 X2 X3
0 az3 0 8 9 17
1 az4 1 10 9 47
2 az5 1 3 2 67
3 az6 0 33 12 68
Then:
Temp = pd.concat([df1, Unique]).drop_duplicates().sort_values('id')
Temp
id val X1 X2 X3
0 az1 1 10 12 16
1 az3 0 8 9 17
2 az4 1 10 9 47
3 az5 1 3 2 67
4 az6 0 33 12 68
5 az7 0 31 2 68
Finaly:
df2_1 =df2[df2["val"]==1]
df3 = pd.concat([df2_1, Temp]).drop_duplicates(subset ="id", keep = "first", inplace = False).sort_values('id')
df3
id val X1 X2 X3
0 az1 1 10 12 16
1 az2 1 13 15 18
3 az3 0 8 9 17
4 az4 1 10 9 47
5 az5 1 3 2 67
6 az6 0 33 12 68
7 az7 0 31 2 68
Any idea to do the job in more fast and elegant way?