Resampling and Normalizing Irregular Time Series Data in Pandas

Question

I have irregularly spaced time-series data. I have total energy usage and the duration over which the energy was used.

Start Date  Start Time      Duration (Hours)    Usage(kWh)
1/3/2016    12:28:00 PM     2.233333333         6.23
1/3/2016    4:55:00 PM      1.9                 11.45
1/4/2016    6:47:00 PM      7.216666667         11.93
1/4/2016    7:00:00 AM      3.45                9.45
1/4/2016    7:26:00 AM      1.6                 7.33
1/4/2016    7:32:00 AM      1.6                 4.54

I want to calculate the sum of all the load curves over a 15 minute window. I can round when necessary (e.g., closest 1 minute). I can't use resample immediately because it would average the usage into the next time stamp, which n the case of the first entry 1/3 12:28 PM, would take 6.23 kWH and spread it evenly until 4:55 PM, which is inaccurate. 6.23 kWh should be spread until 12:28 PM + 2.23 hrs ~= 2:42 PM.

Can you show your desired output? – IanS Aug 22 '16 at 11:57 — IanS, Aug 22 '16 at 11:57

score 4 · Accepted Answer · edited May 23 '17 at 10:28

Here is a straight-forward implementation which simply sets up a Series, result, whose index has minute-frequency, and then loops through the rows of df (using df.itertuples) and adds the appropriate amount of power to each row in the associated interval:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'Duration (Hours)': [2.233333333, 1.8999999999999999, 7.2166666670000001, 3.4500000000000002, 1.6000000000000001, 1.6000000000000001], 'Start Date': ['1/3/2016', '1/3/2016', '1/4/2016', '1/4/2016', '1/4/2016', '1/4/2016'], 'Start Time': ['12:28:00 PM', '4:55:00 PM', '6:47:00 PM', '7:00:00 AM', '7:26:00 AM', '7:32:00 AM'], 'Usage(kWh)': [6.2300000000000004, 11.449999999999999, 11.93, 9.4499999999999993, 7.3300000000000001, 4.54]} ) 

df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['start_date'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Start Date', 'Start Time', 'Duration (Hours)'], axis=1)

result = pd.Series(0,
    index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method:
usage = result.resample('15T').sum()
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()

A note regarding performance: Looping through the rows of df is not very fast especially if len(df) is big. For better performance, you may need a more clever method, which handles all the rows "at once" in a vectorized manner:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Here is an example using a larger DataFrame
N = 10**3
dates = pd.date_range('2016-1-1', periods=N*10, freq='H')
df = pd.DataFrame({'Duration (Hours)': np.random.uniform(1, 10, size=N), 
                   'start_date': np.random.choice(dates, replace=False, size=N), 
                   'Usage(kWh)': np.random.uniform(1,20, size=N)})
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)

def using_loop(df):
    result = pd.Series(0,
        index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))
    power_idx = df.columns.get_loc('power (kW/min)')+1
    for row in df.itertuples():
        result.loc[row.start_date:row.end_date] += row[power_idx]
    usage = result.resample('15T').sum()
    return usage

def using_cumsum(df):
    result = pd.melt(df[['power (kW/min)','start_date','end_date']], 
                     id_vars=['power (kW/min)'], var_name='usage', value_name='date')
    result['usage'] = result['usage'].map({'start_date':1, 'end_date':-1})
    result['usage'] *= result['power (kW/min)']
    result = result.set_index('date')
    result = result[['usage']].resample('T').sum().fillna(0).cumsum()
    usage = result.resample('15T').sum()
    return usage

usage = using_cumsum(df)
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()

With len(df) equal to 1000, using_cumsum is over 10x faster than using_loop:

In [117]: %timeit using_loop(df)
1 loop, best of 3: 545 ms per loop

In [118]: %timeit using_cumsum(df)
10 loops, best of 3: 52.7 ms per loop

score 1 · Answer 2 · answered Aug 24 '16 at 18:10

The solution I used below is the itertuples method. Please note using numpy's .sum function did not work for me. I instead used the pandas resample keyword, "how" and set it equal to sum.

I also renamed the columns in my files to make the import easier.

I was not time/resource constrained so I went with the itertuples method because it was easy for me to implement.

Itertuples Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#load data
df = pd.read_excel(r'C:\input_file.xlsx', sheetname='sheet1')

#convert columns
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Duration (Hours)'], axis=1)

#create result df with timestamps
result = pd.Series(0, index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

#iterate through to calculate total energy at each minute
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method
usage = result.resample('15T', how='sum')

#plot
plt.plot(usage)
plt.show()

#write to file
usage.to_csv(r'C:\output_folder\output_file.csv')

Resampling and Normalizing Irregular Time Series Data in Pandas

2 Answers2