4

I want to fill missing values in my pandas series, if there are less than 3 consecutive NANs.

Original series with missing values:

s=pd.Series(pd.np.random.randn(20))
s[[1,3,5,7,12,13,14,15, 18]]=pd.np.nan

Gives:

0     0.444025
1          NaN
2     0.631753
3          NaN
4    -0.577121
5          NaN
6     1.299953
7          NaN
8    -0.252173
9     0.287641
10    0.941953
11   -1.624728
12         NaN
13         NaN
14         NaN
15         NaN
16    0.998952
17    0.195698
18         NaN
19   -0.788995

BUT, using pandas.fillna() with a limit only fills the # of values specified (not number of CONSECUTIVE NANs, as expected):

s.fillna(value=0, limit=3) #Fails to fill values at position 7 and forward

Desired output would fill NANs with 0 at positions 1,3,5,7, and 18. It would leave series of 4 NaNs in place in position 12-15.

The documentation and other posts on SO have not resolved this issue (e.g. here). Documentation seems to imply that this limit will work on consecutive NANs, not the overall # in entire dataset that will be filled. Thanks!

EHB
  • 1,127
  • 3
  • 13
  • 24
  • Thank you for the solutions. I am just surprised there is not a simpler way to do this! – EHB Mar 21 '18 at 01:37
  • Would it be simpler to use "shift"? i.e., first store the location of all long NAN gaps with `long_nan_gaps= s.index[s.shift(1).isnull() & s.shift(-1).isnull() & s.isnull()]`, then fill ALL NANs with 0, and then restore the saved locations to NAN after the fact? I appreciate the many solutions; just wondering if one you have proposed below is better than what I had figured out here for myself (that I thought was too ugly/ confusing). – EHB Mar 21 '18 at 01:40

4 Answers4

5

We start with finding where the nan values are via pd.Series.notna.

As we use cumsum, whenever we encounter a non-null value, we increment the cumulative sum this generating convenient groups for contiguous nan values.

However, for all but the first group (and maybe the first group) we begin with a non-null value. So, I take the negation of mask and sum the total number of null values within each group.

Now I fillna and use pd.DataFrame.where to mask the spots where the sum of nan values was too much.

mask = s.notna()
c_na = (~mask).groupby(mask.cumsum()).transform('sum')
filled = s.fillna(0).where(c_na.le(3))
s.fillna(filled)

0     1.418895
1     0.000000
2    -0.553732
3     0.000000
4    -0.101532
5     0.000000
6    -1.334803
7     0.000000
8     1.159115
9     0.309093
10   -0.047970
11    0.051567
12         NaN
13         NaN
14         NaN
15         NaN
16    0.623673
17   -0.786857
18    0.000000
19    0.310688
dtype: float64

Here is a fancy Numpy/Pandas way using np.bincount and pd.factorize

v = s.values
m = np.isnan(v)
f, u = pd.factorize((~m).cumsum())
filled = np.where(
    ~m, v,
    np.where(np.bincount(f, weights=mask)[f] <= 3, 0, np.nan)
)

pd.Series(filled, s.index)

0     1.418895
1     0.000000
2    -0.553732
3     0.000000
4    -0.101532
5     0.000000
6    -1.334803
7     0.000000
8     1.159115
9     0.309093
10   -0.047970
11    0.051567
12         NaN
13         NaN
14         NaN
15         NaN
16    0.623673
17   -0.786857
18    0.000000
19    0.310688
dtype: float64
mgilbert
  • 3,495
  • 4
  • 22
  • 39
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 1
    You can save one `fillna` with `masked = s.groupby(m.cumsum()).transform('size').gt(3); s.fillna(0).mask(masked)` – cs95 Mar 21 '18 at 00:29
  • I think should be s.fillna(0).mask(sumna.ge(3)), right ? – BENY Mar 21 '18 at 01:13
  • Could you elaborate a little on what the code is doint? I see it works but don't understand exactly what's going on in the `groupby` – EHB Mar 21 '18 at 01:36
  • Just got to a computer. I'll elaborate in a few minutes – piRSquared Mar 21 '18 at 01:37
  • @piRSquared, no worries, I'm about to leave mine, will look tomorrow. Cheers! – EHB Mar 21 '18 at 01:43
  • 1
    @cᴏʟᴅsᴘᴇᴇᴅ `size` will capture the first non-null plus subsequent nulls and will typically be one more than the number of nulls. Mind you, we are after the number of nulls per group. `'count'` would do better, or the inverse rather. Working on this now (-: – piRSquared Mar 21 '18 at 01:52
2

First, build a na cum_count column. Consecutive nas will have same cum_count.

df = s.to_frame('value').assign(na_ct=s.notna().cumsum())

Then we can group by the na cum_count, check the number of rows in each group and decide weather to fill the nas or not.

df.groupby(df.na_ct).apply(lambda x: x if len(x)>4 else x.fillna(0)).value
Out[76]: 
0     0.195634
1     0.000000
2    -0.818349
3     0.000000
4    -2.347686
5     0.000000
6    -0.464040
7     0.000000
8     0.179321
9     0.356661
10    0.471832
11   -1.217082
12         NaN
13         NaN
14         NaN
15         NaN
16   -0.112744
17   -2.630191
18    0.000000
19   -0.313592
Name: value, dtype: float64
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
2

Maybe try this ?

t=s[s.isnull()];
v=pd.Series(t.index,index=t.index).diff().ne(1).cumsum();
z=v[v.isin(v.value_counts()[v.value_counts().gt(3)].index.values)];
s.fillna(0).mask(s.index.isin(z.index))
Out[348]: 
0    -0.781728
1     0.000000
2    -1.114552
3     0.000000
4     1.242452
5     0.000000
6     0.599486
7     0.000000
8     0.757384
9    -1.559661
10    0.527451
11   -0.426890
12         NaN
13         NaN
14         NaN
15         NaN
16   -1.264962
17    0.703790
18    0.000000
19    0.953616
dtype: float64
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You can try it with rolling operator in the following fashion:

1) Create a function that returns 0 only if there are less than X values in a window

fillnaiflessthan(series, count):
    if series.isnull().sum() < count and series.center == pd.NaN:
         return 0

2) Then use it inside rolling

s.rolling(window=5, center=True, min_periods=0).apply(lambda x: fillnaiflessthan(x, 4))
Julien Perrenoud
  • 1,401
  • 11
  • 20