0

I want to resample data column using forward fill ffill and backward fill bfill at the frequency of 1min while grouping df by id column.

df:

          id   timestamp                data  

      1    1   2017-01-02 13:14:53.040  10.0
      2    1   2017-01-02 16:04:43.240  11.0  
                           ...
      4    2   2017-01-02 15:22:06.540   1.0  
      5    2   2017-01-03 13:55:34.240   2.0  
                           ...

I used:

pd.DataFrame(df.set_index('timestamp').groupby('id', sort=True)['data'].resample('1min').ffill().bfill())

How can I add an additional condition, by resampling within the window of past 10 days from now? So the last timestamp reading is now and the first timestamp reading is datetime.datetime.now() - pd.to_timedelta("10day"). The goal is to have the same number of readings for each id group.


Update:

Tried:

start = datetime.datetime.now() - pd.to_timedelta("10day")
end = datetime.datetime.now()

r = pd.to_datetime(pd.date_range(start=start, end=end, freq='1h'))

pd.DataFrame(df.reset_index().set_index('timestamp').groupby('id', sort=True).reindex(r)['data'].resample('1h').ffill().bfill())

and returned:

AttributeError: 'DataFrameGroupBy' object has no attribute 'reindex'

so I'm not supposed to apply reindex for groupby object, is there a way that I can work around it?

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122
  • 1
    Filter the entire dataframe for 10day then groupby and resample, them concat back with other portion of the original dataframe older than 10day. – Scott Boston Mar 15 '21 at 01:36
  • Hi @ScottBoston thanks for the suggestion, I thought about that but by filtering the dataframe for 10 day doesn't give me the exact "10 day time interval in actual data" as the actual data for each `id` were not recorded at the same time and may have different start time. – nilsinelabore Mar 15 '21 at 01:41
  • `df[df['timestamp'] >= (pd.Timestamp('now') - pd.Timedelta('10d')).normalize()]` – Scott Boston Mar 15 '21 at 01:48
  • 1
    @ScottBoston Definition of `pandas.Series.dt.normalize` states `Convert times to midnight.` which I think stretched the boundary timestamp to midnight but it doesn't really matter because the actual start time for different `id` are still gonna be different as long as the start time is not specified for backward fill? I'd like different `id` group to start at the same time after resampling and have the same number of readings eventually. CMIIW thanks – nilsinelabore Mar 15 '21 at 02:15
  • @nilsinelabore that sounds like a pivot thing. Resample your data for each id, then pivot. Another way is to generate the common time windows, then `merge_asof`. – Quang Hoang Mar 15 '21 at 03:05
  • @QuangHoang thanks for the ideas. Could you please advise a good way to generate common time windows? I'm thinking about using `date_range` with custom function, something like `def w(x): r = pd.to_datetime(pd.date_range(start='2017-12-23', end='2017-01-02 23:00:00', freq='1h')) return x.reindex(r, method='ffill')` but I'm struggling with indexing terms like `set_index`, `reset_index`, `reindex` ... – nilsinelabore Mar 15 '21 at 03:12

1 Answers1

1

Without a data, I can't really test this. So take this as a suggestion/comment put for proper formatting. Since you are looking to resample with bfill/ffill, I think merge_asof would work:

# common time window
r = pd.to_datetime(pd.date_range(start='2017-12-23', end='2017-01-02 23:00:00', freq='1h')) 

# unique id
unique_ids = df['id'].unique()

# new time reference:
new_df = pd.DataFrame({'id': np.repeat(unique_ids, len(r)),
                       'time': np.tile(r, len(unique_ids)),
                      })

# merge_asof may complain about sorting key, then sort both df by time
# default of merge_asof is `direction='backward'`
# change to `direction='forward'` if you want to *floor* time
out = pd.merge_asof(new_df, df, on='time', by='id')
                   
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74