0

How do you group data by time buckets and count no of observation in the given bucket. If none, fill the empty time buckets with 0s.

I have the following data set in a dataframe: '''

df= 
   Time
   0:10
   5:00
   5:00
   5:02
   5:03
   5:05
   5:07
   5:09
   6:00
   6:00
   6:00

'''

I would like to create 5 min time bucket going from 00:00 to 23:59, and count how many times it appears in that time bucket. If none, then 0. Basically, each time represents a unit in a queue and and I want to calculate how many in the given time bucket.

From the above data (example set), i would like to get the following:

Time     Obs
00:00    0
00:05    0
00:10    1
00:15    0
...
05:00    2
05:05    3
05:10    2
06:00    3
...

I tried the following code

df['time_bucket'] = pd.to_datetime(df['Time']).dt.ceil('5min')

which did not work.

I tried the following as well:

df1= df.resample('5T', on ='time_bucket').count()

which results in :

                    Time    time_bucket
time_bucket     
2020-05-24 00:10:00     1   1
2020-05-24 00:15:00     0   0
2020-05-24 00:20:00     0   0
2020-05-24 00:25:00     0   0
2020-05-24 00:30:00     0   0

The time starts at 00:10 but not at 00:00; seems like it starts from the initial value of the time_bucket column.

Basically in the new column, I want to calculate the count. Eventually, I would like to create a function which takes a parameter, ex: time buckets (5, 10, 15) and create table for given time bucket with counts.

Buckbeak
  • 3
  • 3

1 Answers1

0

I could not find a standard way to address your specific issues, specifically about time (without date) buckets in native pandas functions.

First, instead of your dataset, which seems to be in string format, I used Time().

import random
import datetime
import pandas as pd
from collections import Counter
from datetime import time, timedelta

# generate 10k lines of random data
x = [None]*10000
x = [time(hour=random.randrange(0,24), minute=random.randrange(0,60)) for item in x]

# use Counter to aggregate minute-wise data
y = Counter(x)
z = [{'time':item, 'freq':y.get(item)} for item in y]
df = pd.DataFrame(z)

# create bins 
df['tbin']=df['time'].apply(lambda x: x.hour*12 + int(x.minute/5))
df['binStart']=df['time'].apply(lambda x: time(hour=x.hour, minute=(x.minute - x.minute%5)))
df['binEnd']=df['binStart'].apply(lambda a: (datetime.datetime.combine(datetime.datetime.now(), a)+timedelta(minutes=5)).time())

# grouping also orders the data
df_histogram=df.groupby(['tbin', 'binStart', 'binEnd'])['freq'].sum()

this is probably too late, but I was working on solving a similar but simpler problem, and came across your unanswered question which I thought would be more fun to solve than my own (which got solved in the process.)

Algo
  • 1
  • 1