1

I have a Pandas dataframe containing several NaNs like the following one

df = pd.DataFrame(np.array([[1,np.nan,np.nan, 2, 3, np.nan], [np.nan,np.nan,0,0,np.nan,2], [np.nan,4,np.nan,np.nan,5,np.nan]]).transpose())
print(df)
       0      1      2
0    1.0    NaN    NaN
1    NaN    NaN    4.0
2    NaN    0.0    NaN
3    2.0    0.0    NaN
4    3.0    NaN    5.0
5    NaN    2.0    NaN

Now, I would like to fill up the NaN values by respecting two restrictions:

  1. Only fill the NaNs surrounded by valid values (= don't replace leading or trailing NaN's)
  2. Use method "pad" (=ffill) for replacing the NaNs by the preceding valid number in that column

Desired solution:

       0      1      2
0    1.0    NaN    NaN
1    1.0    NaN    4.0
2    1.0    0.0    4.0
3    2.0    0.0    4.0
4    3.0    0.0    5.0
5    NaN    2.0    NaN

I found Pandas interpolate function which sounded quite promising but unfortunately I'm only able to achieve one of the mentioned restrictions.

When I use

df_padded = df.interpolate(method='pad')

the right values are used (-> preceding number of the respective column) but also the NaNs at the end of column 0 and 2 are replaced which I do not intend.

       0      1      2
0    1.0    NaN    NaN
1    1.0    NaN    4.0
2    1.0    0.0    4.0
3    2.0    0.0    4.0
4    3.0    0.0    5.0
5    3.0    2.0    5.0

When I use

df_padded = df.interpolate(limit_area='inside')

it replaces the right NaNs but it interpolates in a linear way which is also not what I want.

         0      1        2
0    1.000    NaN      NaN
1    1.333    NaN    4.000
2    1.667    0.0    4.333
3    2.000    0.0    4.667
4    3.000    1.0    5.000
5      NaN    2.0      NaN

So I thought using both parameters in the function call would give me the right output

df_padded = df.interpolate(method_'pad', limit_area='inside')

but instead nothing is happening at all.

       0      1      2
0    1.0    NaN    NaN
1    NaN    NaN    4.0
2    NaN    0.0    NaN
3    2.0    0.0    NaN
4    3.0    NaN    5.0
5    NaN    2.0    NaN

What am I doing wrong?

pa1ric6
  • 39
  • 6

2 Answers2

2

Ideally the interpolate solution should have worked, I believe it might be a bug nevertheless here is alternative approach that you can use

s1, s2 = df.ffill(), df.bfill()
result = s1.mask(s1.isna() | s2.isna())

Result

     0    1    2
0  1.0  NaN  NaN
1  1.0  NaN  4.0
2  1.0  0.0  4.0
3  2.0  0.0  4.0
4  3.0  0.0  5.0
5  NaN  2.0  NaN
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • @Shubham_Sharma thanks for your quick reply and for showing me an alternative solution. I might create a bug report on GitHub to resolve the issue. – pa1ric6 Mar 17 '23 at 07:43
1

Just need a little trick:

# This is for generating a mask
df_padded  = df.interpolate(limit_area='inside')

# This is the intepolation with previous value
df_fill = df.fillna(method="ffill")

mask = df_padded.isna()
df_fill[mask] = np.nan
df_fill

You get :

     0    1    2
0  1.0  NaN  NaN
1  1.0  NaN  4.0
2  1.0  0.0  4.0
3  2.0  0.0  4.0
4  3.0  0.0  5.0
5  NaN  2.0  NaN
HMH1013
  • 1,216
  • 2
  • 13