Replace missing values based on value of a specific column in Python

Question

I would like to replace missing values based on the values of the column Submitted.

Find below what I have:

Year	Country	Submitted	Age12	Age14
2018	CHI	1	267	NaN
2019	CHI	NaN	NaN	NaN
2020	CHI	1	244	203
2018	ALB	1	163	165
2019	ALB	1	NaN	NaN
2020	ALB	1	161	NaN
2018	GER	1	451	381
2019	GER	NaN	NaN	NaN
2020	GER	1	361	321

An this is what I would like to have:

Year	Country	Submitted	Age12	Age14
2018	CHI	1	267	NaN
2019	CHI	NaN	267	NaN
2020	CHI	1	244	203
2018	ALB	1	163	165
2019	ALB	1	NaN	NaN
2020	ALB	1	161	NaN
2018	GER	1	451	381
2019	GER	NaN	451	381
2020	GER	1	361	321

I tried using the command df.fillna(axis=0, method='ffill') But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.

I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".

Thank you

Hi @MarceloRiberio Do you actually want the column `Submitted` be with NaN values filled up or not ? Your expected output shows it should NOT be changed. However, the accepted answer will change it as well. In case you really need to change this column as well, please update your question. If not, better select another solution. — SeaBean, Sep 05 '21 at 07:29
Hi @SeaBean thanks for your comment. Keeping `Submitted` either NaN or 1 works for me. The solutions posted here worked in this way — CelloRibeiro, Sep 05 '21 at 09:45

score 1 · Accepted Answer · answered Sep 04 '21 at 12:57

1

Try using where together with what you did:

 df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))

This will replace the entries only when Submitted is null.

answered Sep 04 '21 at 12:57

Ami Tavory

74,578
11
141
185

1

Thank @Ami Tavory this solution is very elegant and almost what I need. However, it is taking the forward instead of the previous value. Thus, 2019 "Submitted" NaN cells are being replaced by 2020 values of the respective countries instead of 2018 values. Do you know how to fix this in your piece of code? – CelloRibeiro Sep 04 '21 at 16:11
@MarceloRibeiro change `method=bfill`? – Ami Tavory Sep 04 '21 at 16:35
`ffill` worked, but I found another issue with the solution. It is actually retrieving the last available data. I posted just a sample of my dataset (which contains 1070 columns and more years and countries). But, in the example above, your solution is filling NaN with the previous value only if the latter is non NaN, otherwise it takes the previous non NaN available value. E.g., in the table above country CHI Year 2019 becomes "Submitted"=1, "Age12"=267 and "Age14"=value of Year 2017 or before (the last cell with a number different than NaN) and I would like to have NaN instead. – CelloRibeiro Sep 04 '21 at 17:45

score 0 · Answer 2 · answered Sep 04 '21 at 12:58

0

You can do a conditional ffill() using np.where

import numpy as np
(
    df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
    .assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)

answered Sep 04 '21 at 12:58

Allen Qin

19,507
8
51
67

Thank you for the answer, I however have several columns (I just posted a sample of my dataset here) and the answer from Ami Tavory suited well – CelloRibeiro Sep 04 '21 at 16:13

score 0 · Answer 3 · answered Sep 04 '21 at 13:06

You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:

cols = df.filter(like='Age').columns

df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())

Result:

print(df)

   Year Country  Submitted  Age12  Age14
0  2018     CHI        1.0  267.0    NaN
1  2019     CHI        NaN  267.0    NaN
2  2020     CHI        1.0  244.0  203.0
3  2018     ALB        1.0  163.0  165.0
4  2019     ALB        1.0    NaN    NaN
5  2020     ALB        1.0  161.0    NaN
6  2018     GER        1.0  451.0  381.0
7  2019     GER        NaN  451.0  381.0
8  2020     GER        1.0  361.0  321.0

score 0 · Answer 4 · answered Sep 04 '21 at 13:26

I just used a for loop to check and update the values in the dataframe

import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
    if(row['Submitted']=='NaN'):
        df.at[index,'Age12']=prevValue12
        df.at[index,'Age14']=prevValue14
    prevValue12 = row['Age12']
    prevValue14 = row['Age14']
print(df)

output

Year Country Submitted Age12 Age14
0  2018     CHI         1   267    30
1  2019     CHI       NaN   267    30
2  2020     CHI         1   244   203

Thank you for the answer, I however have several columns (I just post a sample of my dataset here) and the answer from Ami Tavory suited well — CelloRibeiro, Sep 04 '21 at 16:13

Replace missing values based on value of a specific column in Python

4 Answers4

Linked