0

I have a dataset containing the values of several indicators of a series of countries and years (3072,1134 rows and columns), but some NaN.

Find below a sample of the dataset:

Year Country Submitted Age12 Age14 ...
2017 CHI 1 267 136 ...
2018 CHI 1 267 NaN ...
2019 CHI NaN NaN NaN ...
2020 CHI 1 244 203 ...
2017 ALB 1 160 148 ...
2018 ALB 1 163 165 ...
2019 ALB 1 NaN NaN ...
2020 ALB 1 161 NaN ...
2017 GER 1 NaN 320 ...
2018 GER 1 451 381 ...
2019 GER NaN NaN NaN ...
2020 GER 1 361 321
2017 BRI 1 NaN NaN ...
2018 BRI 1 NaN NaN ...
2019 BRI NaN NaN NaN ...
2020 BRI 1 322 298

I would like to replace missing values based on the values of the column Submitted.

I would like to change the values by the previous row only if the respective Submitted value is "NaN".

And this is what I would like to have:

Year Country Submitted Age12 Age14 ...
2017 CHI 1 267 136 ...
2018 CHI 1 267 NaN ...
2019 CHI 1 267 NaN ...
2020 CHI 1 244 203 ...
2017 ALB 1 160 148 ...
2018 ALB 1 163 165 ...
2019 ALB 1 NaN NaN ...
2020 ALB 1 161 NaN ...
2017 GER 1 NaN 320 ...
2018 GER 1 451 381 ...
2019 GER 1 451 381 ...
2020 GER 1 361 321 ...
2017 BRI 1 NaN NaN ...
2018 BRI 1 NaN NaN ...
2019 BRI 1 NaN NaN ...
2020 BRI 1 322 298

With some help from the community, I tried using

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))

and

df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill',limit=2))

and

df = df.where(~df.Sumbitted.isnull(), interpolate(method="pad", limit=2))

However, using these options I get the following instead. The issue is that the code retrieves the previous non-NaN value, but in the cases that the previous value is a NaN the code retrieves the value before the previous until its finds a non-NaN value.

Output:

Year Country Submitted Age12 Age14 ...
2017 CHI 1 267 136 ...
2018 CHI 1 267 NaN ...
2019 CHI 1 267 136 ...
2020 CHI 1 244 203 ...
2017 ALB 1 160 148 ...
2018 ALB 1 163 165 ...
2019 ALB 1 NaN NaN ...
2020 ALB 1 161 NaN ...
2017 GER 1 NaN 320 ...
2018 GER 1 451 381 ...
2019 GER 1 451 381 ...
2020 GER 1 361 321
2017 BRI 1 NaN NaN ...
2018 BRI 1 NaN NaN ...
2019 BRI 1 361 321 ...
2020 BRI 1 322 298

See the wrong outputs of CHI 2019 Age14 and BRI Age12 and Age14

Would you know how to solve this issue by fixing this code? Solutions using a loop or selecting these 2 columns (Age14 and Age12) are not optimal as I have several columns ("..." in the example) and need a systematic solution.

Thank you

SeaBean
  • 22,547
  • 3
  • 13
  • 25
CelloRibeiro
  • 160
  • 11
  • Please explain what criteria you want to filter your data on? do you want to replace null values with other values or delete them? – ezzeddin Sep 04 '21 at 20:19
  • 1
    @ezzeddin OP wants to `change the values by the previous row only if the respective "Submitted" value is "NaN".` See his [last post](https://stackoverflow.com/questions/69055417/replace-missing-values-based-on-value-of-a-specific-column-in-python) for more details. The only difference from the last post is that column `Submitted` also got `NaN` values filled up by previous one row. That maybe why OP selected that answer (though the answer has different output than the expected output by filling also the `Submitted` column while the expected output was to keep that column unchanged). – SeaBean Sep 04 '21 at 20:40
  • Thanks for accepting my answer. Please consider also upvoting my answer (see [How to upvote on Stack Overflow?](https://meta.stackexchange.com/questions/173399/how-can-i-upvote-answers-and-comments)). – SeaBean Sep 05 '21 at 12:18

1 Answers1

1

You can use the parameter limit=1 to limit filling only from the previous one row value and not from further up:

df = df.where(~df.Submitted.isnull(), df.fillna(axis=0, method='ffill',limit=1))

Or better still, simplify your codes, as follows:

Use .mask() instead of .where so that your code don't need extra step to negate the boolean mask. Also, use ffill() which is a simplified form of .fillna(axis=0, method='ffill'):

df = df.mask(df.Submitted.isna(), df.ffill(limit=1))

Result:

print(df)

    Year Country  Submitted  Age12  Age14
0   2017     CHI        1.0  267.0  136.0
1   2018     CHI        1.0  267.0    NaN
2   2019     CHI        1.0  267.0    NaN
3   2020     CHI        1.0  244.0  203.0
4   2017     ALB        1.0  160.0  148.0
5   2018     ALB        1.0  163.0  165.0
6   2019     ALB        1.0    NaN    NaN
7   2020     ALB        1.0  161.0    NaN
8   2017     GER        1.0    NaN  320.0
9   2018     GER        1.0  451.0  381.0
10  2019     GER        1.0  451.0  381.0
11  2020     GER        1.0  361.0  321.0
12  2017     BRI        1.0    NaN    NaN
13  2018     BRI        1.0    NaN    NaN
14  2019     BRI        1.0    NaN    NaN
15  2020     BRI        1.0  322.0  298.0
SeaBean
  • 22,547
  • 3
  • 13
  • 25