python pandas stop fillna at last non NaN value

Question

I have a dataframe where the index is date increasing and the columns are observations of variables. The array is sparse. My goal is to propogate forward in time a known value to fill NaN but I want to stop at the last non-NaN value as that last value signifies the "death" of the variable.

e.g. for the dataset

	a	b	c
2020-01-01	NaN	11	NaN
2020-02-01	1	NaN	NaN
2020-03-01	NaN	NaN	14
2020-04-01	2	NaN	NaN
2020-05-01	NaN	NaN	NaN
2020-06-01	NaN	NaN	15
2020-07-01	3	NaN	NaN
2020-08-01	NaN	NaN	NaN

I want to output

	a	b	c
2020-01-01	NaN	11	NaN
2020-02-01	1	NaN	NaN
2020-03-01	1	NaN	14
2020-04-01	2	NaN	14
2020-05-01	2	NaN	14
2020-06-01	2	NaN	15
2020-07-01	3	NaN	NaN
2020-08-01	NaN	NaN	NaN

I can identify the index of the last observation using df.notna()[::-1].idxmax() but can't figure out how to use this as a way to limit the fillna function

I'd be grateful for any suggestions. Many thanks

jezrael · Accepted Answer · 2021-01-25T14:17:48.217

Use DataFrame.where for forward filling by mask - testing only non missing values by back filling them:

df = df.where(df.bfill().isna(), df.ffill())
print (df)
              a     b     c
2020-01-01  NaN  11.0   NaN
2020-02-01  1.0   NaN   NaN
2020-03-01  1.0   NaN  14.0
2020-04-01  2.0   NaN  14.0
2020-05-01  2.0   NaN  14.0
2020-06-01  2.0   NaN  15.0
2020-07-01  3.0   NaN   NaN
2020-08-01  NaN   NaN   NaN

Your solution should be used too if compare Series converted to numpy array with broadcasting:

mask = df.notna()[::-1].idxmax().to_numpy() < df.index.to_numpy()[:, None]
df = df.where(mask, df.ffill())
print (df)
              a     b     c
2020-01-01  NaN  11.0   NaN
2020-02-01  1.0   NaN   NaN
2020-03-01  1.0   NaN  14.0
2020-04-01  2.0   NaN  14.0
2020-05-01  2.0   NaN  14.0
2020-06-01  2.0   NaN  15.0
2020-07-01  3.0   NaN   NaN
2020-08-01  NaN   NaN   NaN

thanks both @jezrael and @sophods - Ive taken this answer as for my dataset (3500 x 2000) it is quicker than the `.apply` approach: 400ms vs 1500ms - thanks both — JohnnieL, Jan 25 '21 at 14:18

sophocles · Answer 2 · 2021-01-25T14:23:49.737

2

You can use Series.last_valid_index which is specifically designed for this (to return the index for last non-NA/null value) , to just ffill up to that point:

Assuming your dataset is called df:

df.apply(lambda x: x.loc[:x.last_valid_index()].ffill())

       index    a     b     c
0 2020-01-01  NaN 11.00   NaN
1 2020-02-01 1.00   NaN   NaN
2 2020-03-01 1.00   NaN 14.00
3 2020-04-01 2.00   NaN 14.00
4 2020-05-01 2.00   NaN 14.00
5 2020-06-01 2.00   NaN 15.00
6 2020-07-01 3.00   NaN   NaN
7 2020-08-01  NaN   NaN   NaN

python pandas stop fillna at last non NaN value

2 Answers2