I have a 1 minute OHLCV historical series that starts every day at 09:00 and ends at 17:54, only business days, which needs to be grouped into another frequencies.
But if I run resample or groupby, the result in some frequencies changes the start dates, end dates and mainly times.
Example: If the frequency is changed to '120T', where the start time should be 09:00, the result generates a start at 08:00 (and other arbitrary craziness to the origin values). I'm thinking about to manually write the code to do this.
Example below. Please test the the two code snippet below changing the frequency to better understand the problem:
import pandas as pd
import numpy as np
# Generate OHLCV timeseries with a frequency of 1 minute, in 1 year, with random values (just to test resample syntaxes)
np.random.seed(123)
start_datetime = pd.Timestamp('2022-01-02 00:00')
end_datetime = pd.Timestamp('2023-01-02 23:59')
total_minutes = int((end_datetime - start_datetime).total_seconds() // 60)
datetime_range = pd.date_range(start=start_datetime, periods=total_minutes + 1, freq='1min')
open_prices = np.random.randint(100, 501, total_minutes + 1)
high_prices = np.random.randint(100, 501, total_minutes + 1)
low_prices = np.random.randint(100, 501, total_minutes + 1)
close_prices = np.random.randint(100, 501, total_minutes + 1)
volume = np.random.randint(100, 501, total_minutes + 1)
df = pd.DataFrame({
'datetime': datetime_range,
'Open': open_prices,
'High': high_prices,
'Low': low_prices,
'Close': close_prices,
'Volume': volume
})
# Filter: for weekdays only
df = df[df['datetime'].dt.weekday.isin([5, 6]) == False]
# Filter: business hours only
df = df[df['datetime'].dt.time <= pd.Timestamp('17:54').time()]
df = df[df['datetime'].dt.time >= pd.Timestamp('09:00').time()]
# Reset the index of the DataFrame
df = df.reset_index(drop=True)
display(df.info())
display(df)
df.to_csv("df_mytest_resample.csv", index=True,sep=';',decimal=',')
### The focus of the problem is to make the resamples with correct syntax (if that's possible) so that all the frequencies below always refer to the real dates contained in the original df.
### Grouping calculations work correctly. (The only thing missing was this one, group shitting calculus)
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M')
df = df.set_index('datetime')
agg_dict = {'Open':'first', 'High':'max', 'Low':'min', 'Close':'last', 'Volume':'sum'}
df_15T = df.resample('15T', label='right', closed='right',origin='').agg(agg_dict)
df_D = df.resample('D', label='right', closed='right',origin='').agg(agg_dict)
df_S = df.resample('W', label='right', closed='right',origin='').agg(agg_dict)
df_M = df.resample('M', label='right', closed='right',origin='').agg(agg_dict) # Month is crazy result
display('15-minute grouping - added 6 minutes past the time', df_15T.tail(30))
display('grouping in days - see that the date added +1 to the real date of the original dataframe and put some NaN, but the values for the dates exist', df_D.tail(30))
display('grouping in weeks - total chaos in OHLCV results', df_S.tail(30))
display('grouping in months - nonsense', df_M.tail(30))
Is there a solution using groupby or resample arguments, without writing an algorithm manually?
Given the example above, I don't know what else to do.