Problem
I'm looking for suggestions on how to make this more pythonic and improve the efficiency.
I have a dataframe with events, each having at minimum a start and end timestamp. I am expanding the number of records so that the new table has one record for each hour the interval overlaps.
This is basically the same usecase as the IntervalMatch function found in QlikView.
Example: An event from 18:00-20:00 expands to two distinct records, one for 18:00-19:00 and another for 19:00-20:00.
Current solution
I have a fully working solution, but I think it is rather ugly and it is quite slow on large datasets with >100k rows and 10-20 columns.
import pandas as pd
from datetime import timedelta
def interval_match(df):
intervals = []
def perdelta(start, end, delta):
curr = start.replace(minute=0, second=0)
while curr < end:
yield curr
curr += delta
def interval_split(x):
for t in perdelta(x.Start, x.End, timedelta(hours=1)):
_ = ([x.id,
x.Start,
x.End,
max(t, x.Start),
min((t+timedelta(hours=1), x.End))])
intervals.append(_)
df.apply(interval_split, axis=1)
ndf = pd.DataFrame(intervals,
columns=['id',
'Start',
'End',
'intervalStart',
'intervalEnd'])
ndf['Duration'] = ndf.iEnd - ndf.iStart
return ndf
With some example data, the function interval_match()
can be used like this:
# Some example data
df = pd.DataFrame({'End': {0: pd.Timestamp('2016-01-01 09:24:20')},
'Start': {0: pd.Timestamp('2016-01-01 06:56:10')},
'id': {0: 1234562}})
# Running the function
interval_match(df).to_dict()
# Output
{'Duration': {0: Timedelta('0 days 00:03:50'),
1: Timedelta('0 days 01:00:00'),
2: Timedelta('0 days 01:00:00'),
3: Timedelta('0 days 00:24:20')},
'End': {0: Timestamp('2016-01-01 09:24:20'),
1: Timestamp('2016-01-01 09:24:20'),
2: Timestamp('2016-01-01 09:24:20'),
3: Timestamp('2016-01-01 09:24:20')},
'Start': {0: Timestamp('2016-01-01 06:56:10'),
1: Timestamp('2016-01-01 06:56:10'),
2: Timestamp('2016-01-01 06:56:10'),
3: Timestamp('2016-01-01 06:56:10')},
'intervalEnd':{0: Timestamp('2016-01-01 07:00:00'),
1: Timestamp('2016-01-01 08:00:00'),
2: Timestamp('2016-01-01 09:00:00'),
3: Timestamp('2016-01-01 09:24:20')},
'intervalStart': {0: Timestamp('2016-01-01 06:56:10'),
1: Timestamp('2016-01-01 07:00:00'),
2: Timestamp('2016-01-01 08:00:00'),
3: Timestamp('2016-01-01 09:00:00')},
'id': {0: 1234562,
1: 1234562,
2: 1234562,
3: 1234562}}
My desire is to
- Make this more efficient, preferrably using built in Pandas functions or some numpy magic.
- Not have to deal with the columns as I do in the interval_split function today. Just operate on, and expand the entire dataframe.
Appreciations for any suggestions or help.