4

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.

For example, say I have the dataframe given by:

df = pd.DataFrame({
    'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
    'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})

which looks like

In [27]: df
Out[27]:
   data  group
0   0.0      0
1   1.0      0
2   NaN      0
3   3.0      1
4   NaN      1
5   5.0      0
6   NaN      0
7   NaN      0
8   NaN      1
9   NaN      1

If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be

In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
   group  data
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
8      1   3.0
9      1   NaN

What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:

In [36]: for i, group in df.groupby('group'):
    ...:     print(group)
    ...:
   data  group
0   0.0      0
1   1.0      0
2   NaN      0
5   5.0      0
6   NaN      0
7   NaN      0
   data  group
3   3.0      1
4   NaN      1
8   NaN      1
9   NaN      1

I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:

   group  data
0      0   0.0
1      0   1.0
2      0   1.0
3      1   3.0
4      1   3.0
5      0   5.0
6      0   5.0
7      0   5.0
8      1   NaN
9      1   NaN

FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).

I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.

alkasm
  • 22,094
  • 5
  • 78
  • 94
  • It would be helpful to revise your example to use a DateTimeIndex, and to devise values which actually exercise all the conditions you want satisfied by a valid answer. – unutbu Jan 25 '19 at 03:01
  • @unutbu indeed, it's just difficult to mock up properly. But I agree such an example would be more beneficial to future readers. I was trying to be more general, but this just adds confusion for limited (if any) benefit. – alkasm Jan 25 '19 at 03:02

2 Answers2

4
import numpy as np
import pandas as pd
df = pd.DataFrame({
    'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
    'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})

df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
#    index  data  group  stop_index   mask
# 0      0   0.0      0         2.0   True
# 1      1   1.0      0         3.0   True
# 2      2   1.0      1         4.0   True
# 3      3   3.0      0         5.0   True
# 4      4   1.0      1         4.0   True
# 5      5  22.0      0         7.0   True
# 6      6   NaN      1         4.0  False
# 7      7   5.0      0         9.0   True
# 8      8   NaN      1         4.0  False
# 9      9   NaN      1         4.0  False

# clean up df
df = df[['data', 'group']]
print(df)

yields

   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   1.0      1
5  22.0      0
6   NaN      1
7   5.0      0
8   NaN      1
9   NaN      1

This copies the index into a column, then makes a second stop_index column which is the index augmented by the size of the (time) window.

df = df.reset_index()
df['stop_index'] = df['index'] + 2

Then it makes null rows in stop_index to match null rows in data:

df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))

Then it forward-fills stop_index on a per-group basis:

df['stop_index'] = df.groupby('group')['stop_index'].ffill()

Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:

df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Hi would you like test the Testing data in my answer ? I can not match the out put .. – BENY Jan 25 '19 at 01:45
3

IIUC

l=[]
for i, group in df.groupby('group'):
    idx=group.index
    l.append(group.reindex(df.index).ffill(limit=2).loc[idx])

pd.concat(l).sort_index()
   data  group
0   0.0    0.0
1   1.0    0.0
2   1.0    0.0
3   3.0    1.0
4   3.0    1.0
5   5.0    0.0
6   5.0    0.0
7   5.0    0.0
8   NaN    1.0
9   NaN    1.0

Testing data

   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   NaN      1
5   22       0
6   NaN      1
7   5.0      0
8   NaN      1
9   NaN      1

My method for testing data

   data  group
0   0.0    0.0
1   1.0    0.0
2   1.0    1.0
3   3.0    0.0
4   1.0    1.0
5  22.0    0.0
6   NaN    1.0# here not change , since the previous two do not have valid value for group 1 
7   5.0    0.0
8   NaN    1.0
9   NaN    1.0

Out put with unutbu

   data  group
0   0.0      0
1   1.0      0
2   1.0      1
3   3.0      0
4   1.0      1
5  22.0      0
6   1.0      1# miss match in here
7   5.0      0
8   NaN      1
9   NaN      1
BENY
  • 317,841
  • 20
  • 164
  • 234
  • I was thinking that reindex was probably involved in a solution. Can you expound on the method chains a bit? – alkasm Jan 25 '19 at 01:48
  • @AlexanderReynolds reindex from sub-df back to original df's index , all the row not show up in sub-df will be NaN , then we just need normal ffill with limit , since the index is continue after reindex – BENY Jan 25 '19 at 01:52
  • 1
    Oh that makes a lot of sense! And yes, you've highlighted the problem with the other answer and understood me correctly. I'm going to give it a little bit of time to see if there's any way I can do it without using the indexes from the groupby explicitly. BTW you can use `df.groupby(...).groups` which is a dictionary where the values are the indexes, instead of pulling them out with `group.index` manually. So `for idx in df.groupby(...).groups.values()`. – alkasm Jan 25 '19 at 02:09
  • Hmm...actually, on second thought I don't think this is the answer either. The problem is this still indexes on the *number of rows* that have this feature as opposed to some arbitrary cutoff based on the *value* of the index, no? – alkasm Jan 25 '19 at 02:45
  • @AlexanderReynolds Can you run the method with some edge situation to see whether work or not – BENY Jan 25 '19 at 03:00