1

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.

That is, if the input DataFrame is

    2   5   4
    nan nan nan
    nan nan nan
    5   nan nan
    9   3   nan
    7   9   1

I want the output to be:

    2   5   4
    2   5   nan
    2   5   nan
    5   5   nan
    9   3   nan
    7   9   1

The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.

I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?

splinter
  • 3,727
  • 8
  • 37
  • 82

1 Answers1

4

Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)

One way to get what you want would be to use the compare-cumsum-groupby pattern:

In [68]: nulls = df.isnull()
    ...: groups = (nulls != nulls.shift()).cumsum()
    ...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
    ...: df.where(~to_fill, df.ffill())
    ...: 
Out[68]: 
     0    1    2
0  2.0  5.0  4.0
1  2.0  5.0  NaN
2  2.0  5.0  NaN
3  5.0  5.0  NaN
4  9.0  3.0  NaN
5  7.0  9.0  1.0

Okay, another alternative which I don't like because it's too tricky:

def method_2(df):
    nulls = df.isnull()
    filled = df.ffill(limit=3)
    unfilled = nulls & (~filled.notnull())
    nf = nulls.replace({False: 2.0, True: np.nan})
    do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
    return df.where(do_not_fill, df.ffill())

This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

DSM
  • 342,061
  • 65
  • 592
  • 494
  • Thank you @DSM. That is a very nice solution, which indeed gives us what we are looking for. Just one comment: it is quite slow. I used it with a limit of 3 for a DataFrame of size 530x11500, and it took around 32 seconds. So while this solution is great, an alternative solution which lowers the run time involved would be highly appreciated – splinter Nov 05 '16 at 21:09
  • Much fast indeed! Wall time: 9.01 s – splinter Nov 05 '16 at 23:23