2

I'm having trouble with Pandas groupby functionality and Time Series. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and calculate the mean of the volume (average) of the „aggregate “ correctly.

This is my code for import the CSV File:

#CSV Import
import pandas as pd
path = r'Z:\Python\30_Min_Data.txt'

from datetime import datetime
customdateparse = lambda x: datetime.strptime(x, '%Y/%m/%d %H:%M:%S.%f')
df = pd.read_csv(
        path,
        parse_dates={'DateTime': [0, 1]},
        date_parser=customdateparse)

# Set the Date as the Index --> needed for Resampling
df.set_index('DateTime', inplace=True)
df.sort_index()     

This is my DataFrame after the Import:

df
Out[3]: 
                     Volume Session
DateTime                           
2020-12-16 08:00:00    1000    PRTH
2020-12-16 08:30:00    5000    PRTH
2020-12-16 09:00:00    1000     RTH
2020-12-16 09:30:00    3000     RTH
2020-12-17 08:00:00    2000    PRTH
2020-12-17 08:30:00    2000    PRTH
2020-12-17 09:00:00    2000     RTH
2020-12-17 09:30:00    2000     RTH
2020-12-18 08:00:00    1000    PRTH
2020-12-18 08:30:00    1000    PRTH
2020-12-18 09:00:00    1000     RTH
2020-12-18 09:30:00    1000     RTH
2019-11-18 08:00:00    1000    PRTH
2019-11-18 08:30:00    1000    PRTH
2019-11-18 09:00:00    1000     RTH
2019-11-18 09:30:00    1000     RTH

This is what I tried: It calculates the averages per day, because of the Time Series Resampling. I would expect that it sums up the values first and at the end calculates the means. But it does the mean on the whole bunch of data per day.

#2.Volume: Average per Year & Session & Day
funcs_year    = lambda idx: idx.year
(df
   .groupby([funcs_year,'Session', pd.Grouper(freq='D')])
    ['Volume']
   .mean()
)

Out[6]: 
      Session   DateTime  
2019   PRTH     2019-11-18    1000
       RTH      2019-11-18    1000
2020   PRTH     2020-12-16    3000
                2020-12-17    2000
                2020-12-18    1000
       RTH      2020-12-16    2000
                2020-12-17    2000
                2020-12-18    1000
Name: Volume, dtype: int64

This how I want the result to be correctly calculated and displayed (i calculated it manually): Average (mean) volume traded per day (shown separately for the year and the session):

Year    Session     Mean Volume
2020    RTH         3.333,33
        PRTH        4.000,00
2019    RTH         2.000,00
        PRTH        2.000,00

Anyone know what I'm missing / doing wrong?

udi76623
  • 23
  • 5
  • Let me know if following works for you or not `df.groupby([df.index.strftime('%Y'),'Session']).agg({'Volume':['sum','mean']})` – k33da_the_bug Dec 31 '20 at 12:15
  • Thanks for your effort. The sum is correctly, but the mean unfortunately not. For the mean it takes the sum and then it divides it with the rows found in the timeseries. Example for Year 2019 and Session = PRTH: Sum is 2000 and then it calculates the mean 2000 / 2 = 1000. It doesn’t take into account that we did a resampling and it should calculate the mean per Days count and not Rows count. Because both records are of the same date, if we do a resampling by day (like I did) it is just one Day and not two Rows. Correct Mean for Year 2019 and Session = PRTH would be: 2000 / 1 = 2000 – udi76623 Dec 31 '20 at 12:44
  • Correct me if I am wrong, what I understand from your question is You want to calculate sum of volume based on Year combined with mean volume traded per day right ? – k33da_the_bug Dec 31 '20 at 13:38
  • @k33da_lets_debug: Correct. See my example above, how the result should be. Thnx – udi76623 Dec 31 '20 at 13:58
  • ok, see answer section. I have provided alternate solution using groupby chaining. – k33da_the_bug Dec 31 '20 at 16:57

2 Answers2

3

Does this work for you:

df['Year']=df['DateTime'].dt.year
(df
   .groupby(['Year','Session'])
   .apply(lambda x: x['Volume'].sum()/len(x['DateTime'].dt.date.unique()))
)

Note that 'DateTime' should be a column now.

I think this computes the average volume per day for each year and Session. Can you give it a shot?

flow_me_over
  • 182
  • 9
  • Thnx, but it gives me KeyError: 'DateTime' `code` File "C:\Users\UL\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 2900, in get_loc raise KeyError(key) from err – udi76623 Dec 31 '20 at 13:21
  • Nice solution. I was working on this too. – KJDII Dec 31 '20 at 13:23
  • @udi76623: you have put 'DateTime' as the index. Remove the set_index() at the start of your script. I think this will solve your issue. – flow_me_over Dec 31 '20 at 13:25
  • @flow_me_over: Thnx, but without setting the Index and running your solution I get: `File "pandas\_libs\lib.pyx", line 2403, in pandas._libs.lib.map_infer File "", line 1, in funcs_year = lambda idx: idx.year AttributeError: 'int' object has no attribute 'year' ` – udi76623 Dec 31 '20 at 13:52
  • @udi76623: Ah yes, then instead of funcs_year in the groupby, you could try df['DateTime'].dt.year. Or, make a column df['Year']=df['DateTime'].dt.year, and then do the groupby on ['Year', 'Session']. I have changed the example to adding an additional column; see above. – flow_me_over Dec 31 '20 at 14:05
  • @flow_me_over: Thanks it works as it supposed to be! – udi76623 Jan 01 '21 at 11:14
1

Following should also work, based on your question the 'sum' displays 'sum of Volume' based on 'Year' and 'mean' displays 'mean of volums' based on 'Daily mean' both being grouped by 'Session' and 'DateTime'. (Just used some groupy chaining with joins)

import pandas as pd

data = { 
'DateTime':['2020-12-16 08:00:00','2020-12-16 08:30:00','2020-12-16 09:00:00','2020-12-16 09:30:00','2020-12-17 08:00:00','2020-12-17 08:30:00','2020-12-17 09:00:00','2020-12-17 09:30:00','2020-12-18 08:00:00','2020-12-18 08:30:00','2020-12-18 09:00:00','2020-12-18 09:30:00','2019-11-18 08:00:00','2019-11-18 08:30:00','2019-11-18 09:00:00','2019-11-18 09:30:00'],
'Volume':[1000,500,1000,3000,2000,2000,2000,2000,1000,1000,1000,1000,1000,1000,1000,1000],
'Session':['PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH','PRTH','PRTH','RTH','RTH']
}

df = pd.DataFrame(data)
df['DateTime'] = pd.to_datetime(df['DateTime'])
df.index = pd.to_datetime(df['DateTime'])


#See below code 
x = df.groupby([df.index.strftime('%Y'),'Session',df.index.strftime('%Y-%m-%d')]).agg({'Volume':['sum','mean']}).groupby(['DateTime','Session'],level=2).agg(['sum','mean'])
x['Volume'].drop('mean',axis=1,level=0)
k33da_the_bug
  • 812
  • 8
  • 16