I have the following dataframe:
Out[117]: mydata
author email ri oi
0 X1 NaN NaN 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab@ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2@ma.com NaN 0000-0001-8437-498X
4 X5 ab@ma.com NaN 0000-0001-8437-498X
where column ri
represents an author's ResearcherID, and oi
the ORCID. One author may has more than one email address, so column email
has duplicates.
Firstly, I'm trying to fill na in ri
if the corresponding rows in oi
share the same value, using a non-NaN value in ri
. The result I want is:
author email ri oi
0 X1 NaN K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab@ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab2@ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab@ma.com K-5448-2012 0000-0001-8437-498X
Secondly, merging emails and using the merged value to fill na in column email
, if the values in ri
(or oi
) are identical. I want to get a dataframe like the following one:
author email ri oi
0 X1 ab@ma.com;ab2@ma.com K-5448-2012 0000-0001-8437-498X
1 X2 NaN NaN NaN
2 X3 ab@ma.com;ab2@ma.com K-5448-2012 0000-0001-8437-498X
3 X4 ab@ma.com;ab2@ma.com K-5448-2012 0000-0001-8437-498X
4 X5 ab@ma.com;ab2@ma.com K-5448-2012 0000-0001-8437-498X
I've tried the following code:
final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
fill_df = mydata[mydata.oi == i]
fill_df.ri = fill_df.ri.fillna(method='ffill')
fill_df.ri = fill_df.ri.fillna(method='bfill')
null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])
This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email
and then use the merged value as an input in the process of filling na
?