Pandas dataframe group by 10 min intervals with different actions on other columns

Question

I have a pandas dataframe which includes a timestamp and 71 other columns, something like this:

              timestamp           |close_price|highest_price|volume| ...
              2018-09-29 00:00:20 |1809       |1811         |  ... |
              2018-09-29 00:00:34 |1823       |1832         |
              2018-09-29 00:00:59 |1832       |1863         |
              2018-09-29 00:01:09 |1800       |1802         |
              2018-09-29 00:01:28 |1832       |1845         |
              .
              .
              .

I want to put the data into 10 min intervals and I want to do separate operations on each column, for example I want the 10 min intervals of close_price column to show the last value of the corresponding range in the real table, or for the highest_price column, I want the max value of the corresponding range, or for volume I want the mean of the values in that range. I already tried

dataTable = datefram.resample("10min").agg({'first_price':'first',
                                       'close_price':'last',
                                       'highest_price': 'max',
                                       'volume':'mean', 
                                        #other attributes...
                                        })

but the result seems to be incorrect. Is there any other ways to do what I want to do? I will appreciate any comments or thoughts.

Note that there is no specific pattern in timestamp values. In 1 minute, we can have 0 to 60 rows.

How are the results incorrect? It looks like you're on the right track based on this similar question: https://stackoverflow.com/questions/68229747/resample-dataframe-calculate-min-max-and-first-and-last-of-different-columns-w — Chris Sears, Feb 07 '22 at 19:59
@ChrisSears thank you for your comment, Yeah it looks to be ok but the problem is that it I have 12808 rows in the input and 13207 rows in output. — Stanisla Vavrika, Feb 07 '22 at 20:13
Does your data span multiple days? If so, `resample()` could be creating output rows for overnight periods where you don't have any data points, leading to a lot of `NaN` values. — Chris Sears, Feb 07 '22 at 20:31
@ChrisSears OH, yes, you are right. I have so many ```nan``` values. could you please consider writing a simple answer so I can up-vote it? — Stanisla Vavrika, Feb 07 '22 at 20:37

score 1 · Answer 1 · answered Feb 07 '22 at 20:40

1

If your data spans multiple days or periods where you don't have any data points, calling resample() can result in lots of additional rows with NaN values. I think your code is actually correct, you just got the wrong impression from seeing all the extra rows.

answered Feb 07 '22 at 20:40

Chris Sears

6,502
5
32
35

Thank you so much – Stanisla Vavrika Feb 07 '22 at 20:42

S.Fadaei · Accepted Answer · 2022-12-03T12:29:42.063

Your approach is correct. The dataframe.resample("10min").agg() does the calculations for you. You might get more outputs than what you expect and that is because of this: resample method continuously adds 10 minutes to the time and does the calculations that you asked. But if there was no data in any of the 10 min intervals, it creates a NULL row. Maybe your data is not continuous and causes this Null rows.

You can simply delete the NULL rows by using dataframe.dropna()

Pandas dataframe group by 10 min intervals with different actions on other columns

2 Answers2