Missing values NaN replace conditional Python

Question

I have a dataset containing the values of several indicators of a series of countries and years (3072,1134 rows and columns), but some NaN.

Find below a sample of the dataset:

Year	Country	Submitted	Age12	Age14	...
2017	CHI	1	267	136	...
2018	CHI	1	267	NaN	...
2019	CHI	NaN	NaN	NaN	...
2020	CHI	1	244	203	...
2017	ALB	1	160	148	...
2018	ALB	1	163	165	...
2019	ALB	1	NaN	NaN	...
2020	ALB	1	161	NaN	...
2017	GER	1	NaN	320	...
2018	GER	1	451	381	...
2019	GER	NaN	NaN	NaN	...
2020	GER	1	361	321
2017	BRI	1	NaN	NaN	...
2018	BRI	1	NaN	NaN	...
2019	BRI	NaN	NaN	NaN	...
2020	BRI	1	322	298

I would like to replace missing values based on the values of the column Submitted.

I would like to change the values by the previous row only if the respective Submitted value is "NaN".

And this is what I would like to have:

Year	Country	Submitted	Age12	Age14	...
2017	CHI	1	267	136	...
2018	CHI	1	267	NaN	...
2019	CHI	1	267	NaN	...
2020	CHI	1	244	203	...
2017	ALB	1	160	148	...
2018	ALB	1	163	165	...
2019	ALB	1	NaN	NaN	...
2020	ALB	1	161	NaN	...
2017	GER	1	NaN	320	...
2018	GER	1	451	381	...
2019	GER	1	451	381	...
2020	GER	1	361	321	...
2017	BRI	1	NaN	NaN	...
2018	BRI	1	NaN	NaN	...
2019	BRI	1	NaN	NaN	...
2020	BRI	1	322	298

With some help from the community, I tried using

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))

and

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill',limit=2))

and

df = df.where(~df.Sumbitted.isnull(), interpolate(method="pad", limit=2))

However, using these options I get the following instead. The issue is that the code retrieves the previous non-NaN value, but in the cases that the previous value is a NaN the code retrieves the value before the previous until its finds a non-NaN value.

Output:

Year	Country	Submitted	Age12	Age14	...
2017	CHI	1	267	136	...
2018	CHI	1	267	NaN	...
2019	CHI	1	267	136	...
2020	CHI	1	244	203	...
2017	ALB	1	160	148	...
2018	ALB	1	163	165	...
2019	ALB	1	NaN	NaN	...
2020	ALB	1	161	NaN	...
2017	GER	1	NaN	320	...
2018	GER	1	451	381	...
2019	GER	1	451	381	...
2020	GER	1	361	321
2017	BRI	1	NaN	NaN	...
2018	BRI	1	NaN	NaN	...
2019	BRI	1	361	321	...
2020	BRI	1	322	298

See the wrong outputs of CHI 2019 Age14 and BRI Age12 and Age14

Would you know how to solve this issue by fixing this code? Solutions using a loop or selecting these 2 columns (Age14 and Age12) are not optimal as I have several columns ("..." in the example) and need a systematic solution.

Thank you

Please explain what criteria you want to filter your data on? do you want to replace null values with other values or delete them? — ezzeddin, Sep 04 '21 at 20:19
@ezzeddin OP wants to `change the values by the previous row only if the respective "Submitted" value is "NaN".` See his [last post](https://stackoverflow.com/questions/69055417/replace-missing-values-based-on-value-of-a-specific-column-in-python) for more details. The only difference from the last post is that column `Submitted` also got `NaN` values filled up by previous one row. That maybe why OP selected that answer (though the answer has different output than the expected output by filling also the `Submitted` column while the expected output was to keep that column unchanged). — SeaBean, Sep 04 '21 at 20:40
Thanks for accepting my answer. Please consider also upvoting my answer (see [How to upvote on Stack Overflow?](https://meta.stackexchange.com/questions/173399/how-can-i-upvote-answers-and-comments)). — SeaBean, Sep 05 '21 at 12:18

SeaBean · Accepted Answer · 2021-09-05T08:09:27.093

You can use the parameter limit=1 to limit filling only from the previous one row value and not from further up:

df = df.where(~df.Submitted.isnull(), df.fillna(axis=0, method='ffill',limit=1))

Or better still, simplify your codes, as follows:

Use .mask() instead of .where so that your code don't need extra step to negate the boolean mask. Also, use ffill() which is a simplified form of .fillna(axis=0, method='ffill'):

df = df.mask(df.Submitted.isna(), df.ffill(limit=1))

Result:

print(df)

    Year Country  Submitted  Age12  Age14
0   2017     CHI        1.0  267.0  136.0
1   2018     CHI        1.0  267.0    NaN
2   2019     CHI        1.0  267.0    NaN
3   2020     CHI        1.0  244.0  203.0
4   2017     ALB        1.0  160.0  148.0
5   2018     ALB        1.0  163.0  165.0
6   2019     ALB        1.0    NaN    NaN
7   2020     ALB        1.0  161.0    NaN
8   2017     GER        1.0    NaN  320.0
9   2018     GER        1.0  451.0  381.0
10  2019     GER        1.0  451.0  381.0
11  2020     GER        1.0  361.0  321.0
12  2017     BRI        1.0    NaN    NaN
13  2018     BRI        1.0    NaN    NaN
14  2019     BRI        1.0    NaN    NaN
15  2020     BRI        1.0  322.0  298.0

Missing values NaN replace conditional Python

1 Answers1