0

I'm trying to downsample a dataframe that has minute by minute data into 5 minute bins. Here is my current code:

df = pd.read_csv('stockPrices/closingPrices-apr3.csv',index_col='date',parse_dates=True)
df['close'] = df['close'].shift()
df5min = df.resample('5T').last()
print(df5min.tail())

The link to the csv file is here: https://drive.google.com/file/d/1uvkUaJwrQNsmte5IQIsJ_g5GS8RjVd8B/view?usp=sharing

The output should stop at 2019-04-03 14:40:00 because the last value is 14:48:00, and a 5 minute bin from 14:45-14:49 is not possible. However, I get the following datetime index values that don't exist in my csv file:

2019-04-03 14:45:00  286.35
2019-04-03 14:50:00  286.52
2019-04-03 14:55:00  286.32
2019-04-03 15:00:00  286.45
2019-04-03 15:05:00  280.64

The only fix I can find thus far is using the following code, but then all my data from the previous days get cut off at 14:40:

df5min = df.resample('5T').last().between_time(start_time='9:30',end_time='14:40')

Any help on this is appreciated.

1 Answers1

0

the solution will produce a row that you probably don't want for 4/3/2018 15:05

df = pd.read_csv('./closingPrices-apr3.csv', index_col='date',parse_dates=True)
df.sort_index(inplace = True)
df = df.shift(5)
df_5min = df.resample('5T').first()
Aiden Zhao
  • 633
  • 4
  • 15
  • Thank you for this fix, I'll use it for the time being. Does the .resample() method have a bug that we need such an elaborate workaround? – way_of_the_fist Apr 03 '19 at 22:35
  • not sure if it was a bug, I tried it and it didn't do what I thought it would do. – Aiden Zhao Apr 03 '19 at 22:36
  • I think your older solution worked better. There shouldn't be any values after 14:48:00. I get values at 14:50 - 15:05 in 5 minute increments with your new solutions, none of which appear on the csv – way_of_the_fist Apr 03 '19 at 23:20
  • actually your file is not sorted, they sure look like sorted, but try to sort it yourself. or try to find the max of the index. df[np.array(list(df.index)) == max(list(df.index))] – Aiden Zhao Apr 03 '19 at 23:59
  • look at row 13139, 3755,5630,15033, empty values and data after 14:48:00 – Aiden Zhao Apr 04 '19 at 00:04
  • not only that, your data has duplicates, so we should also remove duplicates first then do the shift – Aiden Zhao Apr 04 '19 at 00:06