1

I would like to replace missing values based on the values of the column Submitted.

Find below what I have:

Year Country Submitted Age12 Age14
2018 CHI 1 267 NaN
2019 CHI NaN NaN NaN
2020 CHI 1 244 203
2018 ALB 1 163 165
2019 ALB 1 NaN NaN
2020 ALB 1 161 NaN
2018 GER 1 451 381
2019 GER NaN NaN NaN
2020 GER 1 361 321

An this is what I would like to have:

Year Country Submitted Age12 Age14
2018 CHI 1 267 NaN
2019 CHI NaN 267 NaN
2020 CHI 1 244 203
2018 ALB 1 163 165
2019 ALB 1 NaN NaN
2020 ALB 1 161 NaN
2018 GER 1 451 381
2019 GER NaN 451 381
2020 GER 1 361 321

I tried using the command df.fillna(axis=0, method='ffill') But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.

I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".

Thank you

CelloRibeiro
  • 160
  • 11
  • Hi @MarceloRiberio Do you actually want the column `Submitted` be with NaN values filled up or not ? Your expected output shows it should NOT be changed. However, the accepted answer will change it as well. In case you really need to change this column as well, please update your question. If not, better select another solution. – SeaBean Sep 05 '21 at 07:29
  • 1
    Hi @SeaBean thanks for your comment. Keeping `Submitted` either NaN or 1 works for me. The solutions posted here worked in this way – CelloRibeiro Sep 05 '21 at 09:45

4 Answers4

1

Try using where together with what you did:

 df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))

This will replace the entries only when Submitted is null.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • 1
    Thank @Ami Tavory this solution is very elegant and almost what I need. However, it is taking the forward instead of the previous value. Thus, 2019 "Submitted" NaN cells are being replaced by 2020 values of the respective countries instead of 2018 values. Do you know how to fix this in your piece of code? – CelloRibeiro Sep 04 '21 at 16:11
  • @MarceloRibeiro change `method=bfill`? – Ami Tavory Sep 04 '21 at 16:35
  • `ffill` worked, but I found another issue with the solution. It is actually retrieving the last available data. I posted just a sample of my dataset (which contains 1070 columns and more years and countries). But, in the example above, your solution is filling NaN with the previous value only if the latter is non NaN, otherwise it takes the previous non NaN available value. E.g., in the table above country CHI Year 2019 becomes "Submitted"=1, "Age12"=267 and "Age14"=value of Year 2017 or before (the last cell with a number different than NaN) and I would like to have NaN instead. – CelloRibeiro Sep 04 '21 at 17:45
0

You can do a conditional ffill() using np.where

import numpy as np
(
    df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
    .assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • Thank you for the answer, I however have several columns (I just posted a sample of my dataset here) and the answer from Ami Tavory suited well – CelloRibeiro Sep 04 '21 at 16:13
0

You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:

cols = df.filter(like='Age').columns

df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())

Result:

print(df)

   Year Country  Submitted  Age12  Age14
0  2018     CHI        1.0  267.0    NaN
1  2019     CHI        NaN  267.0    NaN
2  2020     CHI        1.0  244.0  203.0
3  2018     ALB        1.0  163.0  165.0
4  2019     ALB        1.0    NaN    NaN
5  2020     ALB        1.0  161.0    NaN
6  2018     GER        1.0  451.0  381.0
7  2019     GER        NaN  451.0  381.0
8  2020     GER        1.0  361.0  321.0
SeaBean
  • 22,547
  • 3
  • 13
  • 25
0

I just used a for loop to check and update the values in the dataframe

import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
    if(row['Submitted']=='NaN'):
        df.at[index,'Age12']=prevValue12
        df.at[index,'Age14']=prevValue14
    prevValue12 = row['Age12']
    prevValue14 = row['Age14']
print(df)

output

Year Country Submitted Age12 Age14
0  2018     CHI         1   267    30
1  2019     CHI       NaN   267    30
2  2020     CHI         1   244   203
Shreyas Prakash
  • 604
  • 4
  • 11
  • Thank you for the answer, I however have several columns (I just post a sample of my dataset here) and the answer from Ami Tavory suited well – CelloRibeiro Sep 04 '21 at 16:13