1

I have the following dataframe:

Out[117]: mydata
  author           email           ri                   oi
0      X1             NaN          NaN  0000-0001-8437-498X
1      X2             NaN          NaN                  NaN
2      X3       ab@ma.com  K-5448-2012  0000-0001-8437-498X
3      X4      ab2@ma.com          NaN  0000-0001-8437-498X
4      X5       ab@ma.com          NaN  0000-0001-8437-498X

where column ri represents an author's ResearcherID, and oi the ORCID. One author may has more than one email address, so column email has duplicates.

Firstly, I'm trying to fill na in ri if the corresponding rows in oi share the same value, using a non-NaN value in ri. The result I want is:

  author           email           ri                   oi
0      X1             NaN  K-5448-2012  0000-0001-8437-498X
1      X2             NaN          NaN                  NaN
2      X3       ab@ma.com  K-5448-2012  0000-0001-8437-498X
3      X4      ab2@ma.com  K-5448-2012  0000-0001-8437-498X
4      X5       ab@ma.com  K-5448-2012  0000-0001-8437-498X

Secondly, merging emails and using the merged value to fill na in column email, if the values in ri (or oi) are identical. I want to get a dataframe like the following one:

  author                  email           ri                       oi
0      X1   ab@ma.com;ab2@ma.com  K-5448-2012           0000-0001-8437-498X
1      X2                    NaN          NaN                           NaN
2      X3   ab@ma.com;ab2@ma.com  K-5448-2012           0000-0001-8437-498X
3      X4   ab@ma.com;ab2@ma.com  K-5448-2012           0000-0001-8437-498X
4      X5   ab@ma.com;ab2@ma.com  K-5448-2012           0000-0001-8437-498X

I've tried the following code:

final_df = pd.DataFrame()
na_df = mydata[mydata.oi.isna()]
for i in set(mydata.oi.dropna()):
    fill_df = mydata[mydata.oi == i]
    fill_df.ri = fill_df.ri.fillna(method='ffill')
    fill_df.ri = fill_df.ri.fillna(method='bfill')
    null_df = pd.concat([null_df, fill_df])
final_df = pd.concat([final_df, na_df])

This code returned the one I want in the the frist step, but is there an elegent way to approach this? Furthermore, how to get the merged value in email and then use the merged value as an input in the process of filling na?

bryantz
  • 23
  • 4
  • 1
    Well written question. However, I think you have more than one problem statement you are trying to address in this question. It would have been best if you had two separate ones. It will be difficult to close the question if only a partial answer is provided. Anyways, let me look into it. – Joe Ferndz Aug 29 '20 at 04:00
  • yeah, I agree with @Ferndz. Actually, this question is under the context of [Author Name Disambiguation](https://en.wikipedia.org/wiki/Author_name_disambiguation). I just tried to provide more details related to this background. – bryantz Aug 29 '20 at 07:35

1 Answers1

2

Try 2 transform. One for each column. On ri, use first. On email, use combination of dropna, unique, and join

g = df.dropna(subset=['oi']).groupby('oi')
df['ri']    = g.ri.transform('first')
df['email'] = g.email.transform(lambda x: ';'.join(x.dropna().unique()))

Out[79]:
  author                 email           ri                   oi
0     X1  ab@ma.com;ab2@ma.com  K-5448-2012  0000-0001-8437-498X
1     X2                   NaN          NaN                  NaN
2     X3  ab@ma.com;ab2@ma.com  K-5448-2012  0000-0001-8437-498X
3     X4  ab@ma.com;ab2@ma.com  K-5448-2012  0000-0001-8437-498X
4     X5  ab@ma.com;ab2@ma.com  K-5448-2012  0000-0001-8437-498X
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • The `first()` function returns the first value of each group, consider the situation if the first value is `NaN`, then it will re-write all values with `NaN`. However, the demand here is filling NaN with a non-NaN value in each group (if exists). – bryantz Aug 31 '20 at 08:19
  • 1
    @bryantz: `groupby.first` returns the first `non-NaN` value. So, as long as there is at least one `non-NaN` value in the group, it will return the first `non-NaN`. You may change your sample to verify it. – Andy L. Aug 31 '20 at 10:33