Here's a hard one: (I think)
I have a time series-indexed dataframe with multiple columns. I need to resample the data to be every minute and calculate some metric using the columns. The issue is, the values I want to calculate for each minute, depend on the values in ALL of the columns during that minute. I have tried to use resample('T').apply(func)
but it seems like the func
is applied to each column separately, not receiving the other columns as input. Instead, I want every call to func
to receive a dataframe inclusive of ALL columns for the time indices in that minute, then I'd be able to use use all the values from that minute, and generate a new value based on all of them.
Bonus: I actually need to do this with multiple funcs so the resulting dataframe has multiple columns, with different names than the original, where each value in each column is calculated using the values in all columns over the minute it represents.
Surely there's a way to do this without iterating through rows with the almighty pandas... any ideas ?
EXAMPLE: desired input & output are as follows: (using random data and functions since the real code is for work and can't be shared so the contents of func1, func2 don't matter, just their use of all columns):
import pandas as pd
# functions to calculate (some random calculation using all columns)
def func1(df):
return df.iloc[:,0].sum() + df.iloc[:,1].mean()
def func2(df):
return (df.iloc[:,1] - df.iloc[:,2]).mean()
# Dataframe defined with datetime index with precision of 10s
ind = pd.date_range(start='09:00:00', end='09:03:00', freq='10S')
df1 = pd.DataFrame({1: range(len(ind)), 2: range(1, 1+len(ind)), 3:range(2, 2+len(ind))}, index=ind)
print(df1)
# New dataframe with DatetimeIndex at frequency of 1min
new_ind = pd.date_range(start='09:00:00', end='09:03:00', freq='T')
# Different column names to original:
df2 = pd.DataFrame(columns=[4,5], index=new_ind)
for m in range(len(new_ind)-1):
minute_strt, minute_end = new_ind[m], new_ind[m+1]
# The following chunk to be processed includes data from ALL
# columns of original dataframe within a one-minute span
df_to_process = df1.loc[minute_strt:minute_end, :]
df2.iloc[m+1, :] = [func1(df_to_process), func2(df_to_process)]
print(df2)