1

I want to resample data column using forward fill ffill at the frequency of 1min while grouping df by id column:

df:

          id   timestamp                data  

      1    1   2017-01-02 13:14:53.040  10.0
      2    1   2017-01-02 16:04:43.240  11.0  
                           ...
      4    2   2017-01-02 15:22:06.540   1.0  
      5    2   2017-01-03 13:55:34.240   2.0  
                           ...

Expected output:

           id   timestamp                data  

      1    1   2017-01-02 13:14:53.040  10.0
               2017-01-02 13:14:54.040  10.0
               2017-01-02 13:14:55.040  10.0
               2017-01-02 13:14:56.040  10.0
                           ...

      2    1   2017-01-02 16:04:43.240  11.0  
               2017-01-02 16:04:44.240  11.0
               2017-01-02 16:04:45.240  11.0
               2017-01-02 16:04:46.240  11.0
                           ...

      4    2   2017-01-02 15:22:06.540   1.0  
               2017-01-02 15:22:07.540   1.0
               2017-01-02 15:22:08.540   1.0
               2017-01-02 15:22:09.540   1.0
                           ...

      5    2   2017-01-03 13:55:34.240   2.0
               2017-01-03 13:55:35.240   2.0
               2017-01-03 13:55:36.240   2.0
               2017-01-03 13:55:37.240   2.0
                           ...

Something like this post but I tried:

df.set_index('timestamp').groupby('id').resample('1min').asfreq().drop(['id'], 1).reset_index()

and data column returned only NaN values:


    id  timestamp               data
0   1   2017-01-02 13:14:53.040 NaN
1   1   2017-01-02 13:14:54.040 NaN
2   1   2017-01-02 13:14:55.040 NaN
3   1   2017-01-02 13:14:56.040 NaN
4   1   2017-01-02 13:14:57.040 NaN
... ... ... ...

Edit:

  1. Second row of df timestamp changed from 2017-01-02 12:04:43.240 to 2017-01-02 16:04:43.240, ie., rows belonging to the same id should be sorted.
  2. I mistook second for min in expected output, but @jezrael's answer is correct.
nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

1 Answers1

1

Use custom function with define how many need new rows by Timedelta with date_range and DataFrame.reindex:

def f(x):
    new = x.index[0] + pd.Timedelta(5, unit='Min')
    r = pd.date_range(x.index[0], new, freq='Min')
    
    return x.reindex(r, method='ffill')


df = (df.reset_index()
        .set_index('timestamp')
        .groupby(['index','id'], sort=False)['data']
        .apply(f)
        .reset_index(level=0, drop=True)
        .rename_axis(['id','timestamp'])
        .reset_index()
        )

print (df)
    id               timestamp  data
0    1 2017-01-02 13:14:53.040  10.0
1    1 2017-01-02 13:15:53.040  10.0
2    1 2017-01-02 13:16:53.040  10.0
3    1 2017-01-02 13:17:53.040  10.0
4    1 2017-01-02 13:18:53.040  10.0
5    1 2017-01-02 13:19:53.040  10.0
6    1 2017-01-02 12:04:43.240  11.0
7    1 2017-01-02 12:05:43.240  11.0
8    1 2017-01-02 12:06:43.240  11.0
9    1 2017-01-02 12:07:43.240  11.0
10   1 2017-01-02 12:08:43.240  11.0
11   1 2017-01-02 12:09:43.240  11.0
12   2 2017-01-02 15:22:06.540   1.0
13   2 2017-01-02 15:23:06.540   1.0
14   2 2017-01-02 15:24:06.540   1.0
15   2 2017-01-02 15:25:06.540   1.0
16   2 2017-01-02 15:26:06.540   1.0
17   2 2017-01-02 15:27:06.540   1.0
18   2 2017-01-03 13:55:34.240   2.0
19   2 2017-01-03 13:56:34.240   2.0
20   2 2017-01-03 13:57:34.240   2.0
21   2 2017-01-03 13:58:34.240   2.0
22   2 2017-01-03 13:59:34.240   2.0
23   2 2017-01-03 14:00:34.240   2.0

because if use ffill output is different:

df = df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill()
print (df)
id  timestamp          
1   2017-01-02 12:04:00     NaN
    2017-01-02 12:05:00    11.0
    2017-01-02 12:06:00    11.0
    2017-01-02 12:07:00    11.0
    2017-01-02 12:08:00    11.0

2   2017-01-03 13:51:00     1.0
    2017-01-03 13:52:00     1.0
    2017-01-03 13:53:00     1.0
    2017-01-03 13:54:00     1.0
    2017-01-03 13:55:00     1.0
Name: data, Length: 1425, dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hi jezrael, thank you for the answer. I think the second approach was what I wanted. I tried `pd.DataFrame(df.set_index('timestamp').groupby('id', sort=False)['data'].resample('1min').ffill())` and get what I wanted, except for one thing, I need to have all `id` **start at** the timestamp exactly 10 days before the latest timestamp within each id group, to make sure that I get the same number of rows for each id. Do you know what't the best way to do it? I could ask a separate question. – nilsinelabore Mar 15 '21 at 00:00
  • Could you please have a look at this one?https://stackoverflow.com/questions/66631209/groupby-and-resample-using-forward-and-backward-fill-in-window-in-python The various indexing functions are a bit confusing... – nilsinelabore Mar 15 '21 at 05:00