Python Pandas resample year data using day time frame data

Question

This is small sample of the daily ohlc data I have stored in df1.

date                open    close   high    low
2019-01-01 00:00:00 3700    3800    3806    3646
2019-01-02 00:00:00 3800    3857    3880    3750
2019-01-03 00:00:00 3858    3766    3863    3729
2019-01-04 00:00:00 3768    3791    3821    3706
2019-01-05 00:00:00 3789    3772    3839    3756
2019-01-06 00:00:00 3776    3988    4023    3747
2019-01-07 00:00:00 3985    3972    4018    3928

I want to create a dataframe (df2) that represents what the active year candle looks like as it progresses. The close is based on the current days close, the high is the max from Jan1 to current day, the low is the min from Jan1 to current day and the open is based on the open of the year. Which should look something like this:

date                open    close   high    low
2019-01-01 00:00:00 3700    3800    3806    3646
2019-01-02 00:00:00 3700    3857    3880    3646
2019-01-03 00:00:00 3700    3766    3880    3646
2019-01-04 00:00:00 3700    3791    3880    3646
2019-01-05 00:00:00 3700    3772    3880    3646
2019-01-06 00:00:00 3700    3988    4023    3646
2019-01-07 00:00:00 3700    3972    4023    3646

Would love to put some code but I'm lost here, I thought resample would help me but it just summarizes the entire year into one row of data. I also want think I can figure this out iterating through everyday and resampling but I know that slows computation a lot so I'm hoping to see if this is possible with vectorization. This is my first time posting so let me know if there is any guidelines I need to improve on.

---------------EDIT------------------

Here is my full code with year working but other time frames not working, hopefully it will be easier to replicate the bad results as im pulling data from public source yfinance.

import pandas as pd
import yfinance as yf

#not working
def resample_active_week(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.isocalendar().week)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.isocalendar().week)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current week
    df2['open'] = df.groupby(df.index.isocalendar().week)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2
#not working    
def resample_active_month(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.month)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.month)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current month
    df2['open'] = df.groupby(df.index.month)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2

#not working
def resample_active_quarter(df):
    df2 = pd.DataFrame()

    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.quarter)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.quarter)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current quarter
    df2['open'] = df.groupby(df.index.quarter)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2
#working
def resample_active_year(df):
    df2 = pd.DataFrame()
    
    # high is the max from Jan1 to current day
    df2['high'] = df.groupby(df.index.year)['high'].cummax()

    # low is the min from Jan1 to current day 
    df2['low'] = df.groupby(df.index.year)['low'].cummin()

    #close
    df2['close'] = df['close']

    # open is based on the open of the current year
    df2['open'] = df.groupby(df.index.year)['open'].head(1)
    df2=df2.fillna(method='ffill')

    return df2

df = yf.download(tickers='BTC-USD', period = 'max', interval = '1d',auto_adjust = True)
df.rename(columns={'Open':'open', 'High':'high','Low':'low','Close':'close'}, inplace=True)
df = df.drop(['Volume'],axis=1)

df2 = resample_active_week(df)
df3 = resample_active_month(df)
df4 = resample_active_quarter(df)
df5 = resample_active_year(df)

with pd.ExcelWriter('ResampleOut.xlsx', engine="openpyxl", mode="w") as writer:
            df.to_excel(writer, sheet_name='df_original')
            df2.to_excel(writer, sheet_name='df2_week')
            df3.to_excel(writer, sheet_name='df3_month')
            df4.to_excel(writer, sheet_name='df4_quarter')
            df5.to_excel(writer, sheet_name='df5_year')

I can change the initial price and the bottom price at once, but only the high price needs to be processed in a loop. — r-beginners, Apr 26 '21 at 05:04
Can't you get it with this? `df.groupby(df.index.year)['open'].first()` — r-beginners, May 05 '21 at 02:42
@r-beginners i tried that and it finished execution but the resulting column was blank. — psychoMLM, May 05 '21 at 13:36
The code in the comment could not be rewritten, so please try the following code. `df['date'] = pd.to_datetime(df['date']);df['date'] = pd.to_datetime(df['date']);df['open'] = open_price.values[0]` — r-beginners, May 05 '21 at 13:52

imdevskp · Accepted Answer · 2021-04-28T16:34:05.993

0

# set date as the index
df = df.set_index('date')

# high is the max from Jan1 to current day
df['max'] = df.groupby(df.index.year)['max'].cummax()

# low is the min from Jan1 to current day 
df['min'] = df.groupby(df.index.year)['min'].cummin()

# open is based on the open of the year
for ind, row in df.iterrows():
    row['open'] = df.loc[ind.replace(month=1, day=1), 'open']

# OPTIONAL: reset index
df = df.reset_index()

edited Apr 28 '21 at 16:34

answered Apr 26 '21 at 06:02

imdevskp

2,103
2
9
23

When I try df['max'] = df.groupby(df.index.year)['max'].cummax(), i get 'RangeIndex' object has no attribute 'year'. I assume this is because my date column is not my index. So i tried this df['max'] = df.groupby(df.date.year)['max'].cummax() and I get 'Series' object has no attribute 'year'. for your open i get 'int' object has no attribute 'replace'. I double checked and I did type ind instead of int. – psychoMLM Apr 28 '21 at 16:21
set date as the index. See the updated code. at the end, you can reset the index if you want – imdevskp Apr 28 '21 at 16:34
ok, the min and max work perfectly. Thank you for your help. The problem im having with your 'open' code is that first of all it relies on for loop iterating versus vectorization. Is there no way to make it work with vectorization? Secondly my data starts august 2011 so when it starts it crashes as it is looking for Jan1st2011. I know i can modify my data to always start on Jan1 but is there a way to dynamically accomplish this. So for my case it would leave everything blank from august 2011 until end of 2011. Then starting 2012 your loop would catch it all up to most current row. – psychoMLM May 04 '21 at 15:46
I got the open to work with vectorization with df2['open'] = df.groupby(df.index.year)['open'].head(1) df2=df2.fillna(method='ffill') I want to add that this method was easy to change to month. The problem now is i assumed week and quarter would work as well. The code finishes without problems and the output is initially correct in first 500ish rows of data, then after that it stops being accurate without a pattern i can see. Are you familiar with why this works for month and year but not year and quarter? – psychoMLM May 05 '21 at 03:24
@psychoMLM can you include the unexpected output in the *question section* by editing (posting on comment section will be hard to read) along with the output of `df.info()` – imdevskp May 05 '21 at 04:20

Python Pandas resample year data using day time frame data

1 Answers1