1

I have a data frame that looks like this:

date week id
20/07/21 12:46:00 1 d1
20/07/21 12:56:00 1 d1
20/07/21 13:09:00 1 d1
20/07/21 14:11:00 1 d1
20/07/21 14:42:00 1 d1

I want to group by date in in 30 minutes interval- so if 2 consecutive rows are more than 30 minutes apart they are on different groups. The output I need looks like this:

week id min_date max_date
1 d1 20/07/21 12:46:00 20/07/21 13:09:00
1 d1 20/07/21 14:11:00 20/07/21 14:11:00
1 d1 20/07/21 14:42:00 20/07/21 14:42:00

I used this code in order to group by:

x=df.groupby(['id','week', pd.Grouper(key='date', freq='30min',origin="start")]).agg({'date':[np.min, np.max]})

Something isn't working with the grouper, any suggestions how to improve it?

EDIT:

Here's an example of my data that causes an issue:

date week id
20/07/21 12:46:00 1 d1
20/07/21 12:56:00 1 d1
20/07/21 13:09:00 1 d1
22/07/21 07:11:00 1 d1
22/07/21 07:14:00 1 d1
22/07/21 07:27:00 1 d1
22/07/21 08:34:00 1 d1
22/07/21 08:36:00 1 d1

The output required is:

week id min_date max_date
1 d1 20/07/21 12:46:00 20/07/21 13:09:00
1 d1 20/07/21 07:11:00 20/07/21 07:27:00
1 d1 20/07/21 08:34:00 20/07/21 08:36:00

This is the output I get:

week id min_date max_date
1 d1 20/07/21 12:46:00 20/07/21 13:09:00
1 d1 20/07/21 07:11:00 20/07/21 08:36:00

I don't understand why it groups the last rows together when there is more than an hour difference between 20/07/21 07:27:00 and 20/07/21 08:34:00.

Thanks!

kri
  • 95
  • 6

1 Answers1

2

You can use:

df['date'] = pd.to_datetime(df['date'])

(df.groupby(df['date'].diff().gt(pd.Timedelta('30min')).cumsum())
 ['date'].agg(['min', 'max'])
)

Or maybe also group by id and week:

df['date'] = pd.to_datetime(df['date'])

(df.groupby(['week', 'id', df['date'].diff().gt(pd.Timedelta('30min')).cumsum()])
 ['date'].agg(['min', 'max'])
  .droplevel(-1).reset_index()
)

Output:

   week  id                 min                 max
0     1  d1 2021-07-20 12:46:00 2021-07-20 13:09:00
1     1  d1 2021-07-20 14:11:00 2021-07-20 14:11:00
2     1  d1 2021-07-20 14:42:00 2021-07-20 14:42:00
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Hi, thanks for helping! after checking the code on my data I see that in some cases the grouping doesn't work, it groups together rows that are 40 minutes or an hour apart, when I need separation of 30 minutes- if 2 rows are 30 minutes or plus apart so this will be a new group. Can you please help me understand the issue? – kri Feb 20 '23 at 12:26
  • Make sure the data is first sorted by `date`: `df = df.sort_values(by='date')` – mozway Feb 20 '23 at 12:28
  • It is sorted, and still have the same problem. anything else I can check? – kri Feb 20 '23 at 12:34
  • Can you try to provide a minimal example that reproduces the issue? Have you checked that the date conversion is correct (you have an ambiguous format)? – mozway Feb 20 '23 at 12:42
  • @kri I am sorry but when I try your new example I get exactly what you expect (3 rows) – mozway Feb 20 '23 at 14:19
  • Here is a reproducible input: `from pandas import Timestamp ; df = pd.DataFrame({'date': [Timestamp('2021-07-20 12:46:00'), Timestamp('2021-07-20 12:56:00'), Timestamp('2021-07-20 13:09:00'), Timestamp('2021-07-22 07:11:00'), Timestamp('2021-07-22 07:14:00'), Timestamp('2021-07-22 07:27:00'), Timestamp('2021-07-22 08:34:00'), Timestamp('2021-07-22 08:36:00')], 'week': [1, 1, 1, 1, 1, 1, 1, 1], 'id': ['d1', 'd1', 'd1', 'd1', 'd1', 'd1', 'd1', 'd1']})` – mozway Feb 20 '23 at 14:19