How to display grouped by column during ffill() and not agg using pandas?

Question

This isn't duplicate. I already referred this post_1 and post_2

My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.

I have a dataframe like as given below

df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
 'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month

What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output

df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')

df1 = (df.set_index('date')
         .groupby('subject_id')
         .resample('d')
         .last()
         .index
         .to_frame(index=False))
df2 = df1.merge(df, how='left') 

thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())

df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]

df2 = df2.groupby(df2['subject_id']).ffill()  # problem is here #here is the problem

dates = df2['time_1'].dt.normalize() 
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)

As shown in code above, I tried the below approaches

df2 = df2.groupby(df2['subject_id']).ffill()  # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index()  # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill()  # doesn't help

Incorrect output without subject_id

I expect my output to have subject_id column as well

try `df2.set_index('subject_id').groupby(df2.index).ffill()` — Umar.H, Nov 09 '19 at 12:08
Both the suggestions were good and useful to know. Helped me to achieve the output — The Great, Nov 10 '19 at 06:22

score 3 · Accepted Answer · answered Nov 10 '19 at 06:04

Here are 2 possible solutions - specify all columns in list after groupby and assign back:

cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill()  # problem is here #here is the problem

Or create index by subject_id column and grouping by index:

#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()

#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()

dates = df2['time_1'].dt.normalize() 
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
     subject_id       date              time_1  val  day  month  count
0             1 2173-04-03 2173-04-03 12:35:00    5    3    4.0    NaN
1             1 2173-04-03 2173-04-03 12:50:00    5    3    4.0    NaN
2             1 2173-04-04 2173-04-04 12:50:00    5    4    4.0    1.0
3             1 2173-04-05 2173-04-05 12:59:00    5    5    4.0    1.0
32            1 2173-05-04 2173-05-04 13:14:00    5    4    5.0    1.0
33            1 2173-05-05 2173-05-05 13:37:00    1    5    5.0    1.0
95            1 2173-07-06 2173-07-06 13:39:00    6    6    7.0    1.0
96            1 2173-07-07 2173-07-07 13:39:00    6    7    7.0    1.0
97            1 2173-07-08 2173-07-08 11:30:00    5    8    7.0    1.0
98            2 2173-04-08 2173-04-08 16:00:00    5    8    4.0    NaN
99            2 2173-04-09 2173-04-09 22:00:00    8    9    4.0    NaN
100           2 2173-04-10 2173-04-10 22:00:00    8   10    4.0    1.0
101           2 2173-04-11 2173-04-11 04:00:00    3   11    4.0    1.0
102           2 2173-04-12 2173-04-12 04:00:00    3   12    4.0    1.0
103           2 2173-04-13 2173-04-13 04:30:00    4   13    4.0    1.0
104           2 2173-04-14 2173-04-14 08:00:00    6   14    4.0    1.0

How to display grouped by column during ffill() and not agg using pandas?

1 Answers1