I have a dataset containing the values of several indicators of a series of countries and years (3072,1134 rows and columns), but some NaN.
Find below a sample of the dataset:
Year | Country | Submitted | Age12 | Age14 | ... |
---|---|---|---|---|---|
2017 | CHI | 1 | 267 | 136 | ... |
2018 | CHI | 1 | 267 | NaN | ... |
2019 | CHI | NaN | NaN | NaN | ... |
2020 | CHI | 1 | 244 | 203 | ... |
2017 | ALB | 1 | 160 | 148 | ... |
2018 | ALB | 1 | 163 | 165 | ... |
2019 | ALB | 1 | NaN | NaN | ... |
2020 | ALB | 1 | 161 | NaN | ... |
2017 | GER | 1 | NaN | 320 | ... |
2018 | GER | 1 | 451 | 381 | ... |
2019 | GER | NaN | NaN | NaN | ... |
2020 | GER | 1 | 361 | 321 | |
2017 | BRI | 1 | NaN | NaN | ... |
2018 | BRI | 1 | NaN | NaN | ... |
2019 | BRI | NaN | NaN | NaN | ... |
2020 | BRI | 1 | 322 | 298 |
I would like to replace missing values based on the values of the column Submitted
.
I would like to change the values by the previous row only if the respective Submitted
value is "NaN".
And this is what I would like to have:
Year | Country | Submitted | Age12 | Age14 | ... |
---|---|---|---|---|---|
2017 | CHI | 1 | 267 | 136 | ... |
2018 | CHI | 1 | 267 | NaN | ... |
2019 | CHI | 1 | 267 | NaN | ... |
2020 | CHI | 1 | 244 | 203 | ... |
2017 | ALB | 1 | 160 | 148 | ... |
2018 | ALB | 1 | 163 | 165 | ... |
2019 | ALB | 1 | NaN | NaN | ... |
2020 | ALB | 1 | 161 | NaN | ... |
2017 | GER | 1 | NaN | 320 | ... |
2018 | GER | 1 | 451 | 381 | ... |
2019 | GER | 1 | 451 | 381 | ... |
2020 | GER | 1 | 361 | 321 | ... |
2017 | BRI | 1 | NaN | NaN | ... |
2018 | BRI | 1 | NaN | NaN | ... |
2019 | BRI | 1 | NaN | NaN | ... |
2020 | BRI | 1 | 322 | 298 |
With some help from the community, I tried using
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
and
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill',limit=2))
and
df = df.where(~df.Sumbitted.isnull(), interpolate(method="pad", limit=2))
However, using these options I get the following instead. The issue is that the code retrieves the previous non-NaN value, but in the cases that the previous value is a NaN the code retrieves the value before the previous until its finds a non-NaN value.
Output:
Year | Country | Submitted | Age12 | Age14 | ... |
---|---|---|---|---|---|
2017 | CHI | 1 | 267 | 136 | ... |
2018 | CHI | 1 | 267 | NaN | ... |
2019 | CHI | 1 | 267 | 136 | ... |
2020 | CHI | 1 | 244 | 203 | ... |
2017 | ALB | 1 | 160 | 148 | ... |
2018 | ALB | 1 | 163 | 165 | ... |
2019 | ALB | 1 | NaN | NaN | ... |
2020 | ALB | 1 | 161 | NaN | ... |
2017 | GER | 1 | NaN | 320 | ... |
2018 | GER | 1 | 451 | 381 | ... |
2019 | GER | 1 | 451 | 381 | ... |
2020 | GER | 1 | 361 | 321 | |
2017 | BRI | 1 | NaN | NaN | ... |
2018 | BRI | 1 | NaN | NaN | ... |
2019 | BRI | 1 | 361 | 321 | ... |
2020 | BRI | 1 | 322 | 298 |
See the wrong outputs of CHI 2019 Age14 and BRI Age12 and Age14
Would you know how to solve this issue by fixing this code? Solutions using a loop or selecting these 2 columns (Age14 and Age12) are not optimal as I have several columns ("..." in the example) and need a systematic solution.
Thank you