1

Hoping answers to my problem will be relevant to others, but I have a data frame that I want to calculate zscores on some columns, ignore two columns but retain all columns in the dataframe, (also new to python/data science so please dont assume high proficiency):

Dataframe:

date,market_exchange,open_int,large_spec_long,large_spec_short
2021-12-28,WHEAT-SRW - CHICAGO BOARD OF TRADE,356439,100581,94646
2021-12-21,WHEAT-SRW - CHICAGO BOARD OF TRADE,357381,103696,94564
2021-12-14,WHEAT-SRW - CHICAGO BOARD OF TRADE,361469,109485,97520
2021-12-07,WHEAT-SRW - CHICAGO BOARD OF TRADE,359186,111973,96306
2021-11-30,WHEAT-SRW - CHICAGO BOARD OF TRADE,371575,115857,97563
2021-11-23,WHEAT-SRW - CHICAGO BOARD OF TRADE,425301,139196,97707
2021-11-16,WHEAT-SRW - CHICAGO BOARD OF TRADE,427786,132167,95406
2021-11-09,WHEAT-SRW - CHICAGO BOARD OF TRADE,405438,117465,98262
2021-11-02,WHEAT-SRW - CHICAGO BOARD OF TRADE,409137,109768,91872
2021-10-26,WHEAT-SRW - CHICAGO BOARD OF TRADE,405625,102322,101559

I want to ignore 'date' and 'market_exchange' from the zscore calculation, but not delete these columns in the resulting new data frame.

I then want to calculate the zscore on open_int, large_spec_long, large_spec_short but retain the dataframe structure so that when I print out or export to csv I have new columns that append a '_z' so that for example on 'open_int' column, 'open_int_z' (being a new column containing the zscore of open_int column) column would be added to the end of the dataframe - the same would apply for the 'large_spec_long' etc adding a new 'columnname_z' to the end of the data frame.

I have tried methods from a number of stack answers, this one is closest but cols.remove doesnt accept more than one argument and I'm not sure its correct just to add another cols.remove as per below:


#sort by date then group market_exchange
fldf.sort_values(by="date", ascending=True).groupby(by="market_exchange", dropna=False) #dropna could be unrequired
# fldf.head(3)
cols = list(fldf.columns)
cols.remove('date')
cols.remove('market_exchange')

for col in cols:
    col_zscore = col + '_z'
    fldf[col_zscore] = (fldf[col] - fldf[col].mean())/fldf[col].std(ddof=0)
fldf.head(10)

Is this the correct way of doing this, I am also wanting to make sure I am not "looking ahead" in my zscore as well?

I have been trying to adapt from this thread Pandas - Compute z-score for all columns

1 Answers1

1

First sort by both columns and assign back, then get colums names with Index.difference and per groups count z-score to new columns in GroupBy.transform:

fldf = fldf.sort_values(['market_exchange',"date"])

cols = fldf.columns.difference(['date','market_exchange'], sort=False)

g = fldf.groupby(by="market_exchange")[cols]
fldf = (fldf.join(fldf[cols].sub(g.transform('mean'))
                            .div(g.transform('std', ddof=0)).add_suffix('_z')))

Your solution:

fldf = fldf.sort_values(['market_exchange',"date"])


cols = fldf.columns.difference(['date','market_exchange'])
g = fldf.groupby(by="market_exchange")

for col in cols:
    fldf[col + '_z'] = (fldf[col] - g[col].transform('mean'))/
                          g[col].transform('std', ddof=0)

Output:

print (fldf)
         date                     market_exchange  open_int  large_spec_long  \
9  2021-10-26  WHEAT-SRW - CHICAGO BOARD OF TRADE    405625           102322   
8  2021-11-02  WHEAT-SRW - CHICAGO BOARD OF TRADE    409137           109768   
7  2021-11-09  WHEAT-SRW - CHICAGO BOARD OF TRADE    405438           117465   
6  2021-11-16  WHEAT-SRW - CHICAGO BOARD OF TRADE    427786           132167   
5  2021-11-23  WHEAT-SRW - CHICAGO BOARD OF TRADE    425301           139196   
4  2021-11-30  WHEAT-SRW - CHICAGO BOARD OF TRADE    371575           115857   
3  2021-12-07  WHEAT-SRW - CHICAGO BOARD OF TRADE    359186           111973   
2  2021-12-14  WHEAT-SRW - CHICAGO BOARD OF TRADE    361469           109485   
1  2021-12-21  WHEAT-SRW - CHICAGO BOARD OF TRADE    357381           103696   
0  2021-12-28  WHEAT-SRW - CHICAGO BOARD OF TRADE    356439           100581   

   large_spec_short  open_int_z  large_spec_long_z  large_spec_short_z  
9            101559    0.634538          -0.992457            2.013090  
8             91872    0.760503          -0.372972           -1.872693  
7             98262    0.627831           0.267395            0.690552  
6             95406    1.429391           1.490557           -0.455086  
5             97707    1.340261           2.075348            0.467923  
4             97563   -0.586741           0.133614            0.410159  
3             96306   -1.031100          -0.189523           -0.094066  
2             97520   -0.949215          -0.396517            0.392910  
1             94564   -1.095840          -0.878144           -0.792841  
0             94646   -1.129627          -1.137303           -0.759948  

EDIT: Seems need expanding of mean and std per groups, but not sure if is z-score:

fldf = fldf.sort_values(['market_exchange',"date"])

cols = fldf.columns.difference(['date','market_exchange'], sort=False)

g = fldf.groupby(by="market_exchange")[cols]
fldf = (fldf.join(fldf[cols].sub(g.expanding().mean().droplevel(0))
                            .div(g.expanding().std(dof=0).droplevel(0)).add_suffix('_z')))

print (fldf)
        date                     market_exchange  open_int  large_spec_long  \
9 2021-10-26  WHEAT-SRW - CHICAGO BOARD OF TRADE    405625           102322   
8 2021-11-02  WHEAT-SRW - CHICAGO BOARD OF TRADE    409137           109768   
7 2021-11-09  WHEAT-SRW - CHICAGO BOARD OF TRADE    405438           117465   
6 2021-11-16  WHEAT-SRW - CHICAGO BOARD OF TRADE    427786           132167   
5 2021-11-23  WHEAT-SRW - CHICAGO BOARD OF TRADE    425301           139196   
4 2021-11-30  WHEAT-SRW - CHICAGO BOARD OF TRADE    371575           115857   
3 2021-12-07  WHEAT-SRW - CHICAGO BOARD OF TRADE    359186           111973   
2 2021-12-14  WHEAT-SRW - CHICAGO BOARD OF TRADE    361469           109485   
1 2021-12-21  WHEAT-SRW - CHICAGO BOARD OF TRADE    357381           103696   
0 2021-12-28  WHEAT-SRW - CHICAGO BOARD OF TRADE    356439           100581   

   large_spec_short  open_int_z  large_spec_long_z  large_spec_short_z  
9            101559         NaN                NaN                 NaN  
8             91872    0.707107           0.707107           -0.707107  
7             98262   -0.621640           1.005479            0.209335  
6             95406    1.480783           1.312050           -0.331934  
5             97707    0.968899           1.240242            0.207434  
4             97563   -1.782029          -0.260805            0.155496  
3             96306   -1.597471          -0.496349           -0.218924  
2             97520   -1.235904          -0.629503            0.180499  
1             94564   -1.179203          -0.970937           -0.811118  
0             94646   -1.071659          -1.078940           -0.720950  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks @jezrael, how would you ensure the zscore function is not looking ahead for example there should be a NaN for the latest date of a grouped market_exchange I think? – TimExcellent Nov 10 '22 at 09:17
  • @TimExcellent - Not understand. – jezrael Nov 10 '22 at 09:20
  • @TimExcellent - What means `I am not "looking ahead" in my zscore as well?` ? need per each group set last row to NaNs? Or something else? – jezrael Nov 10 '22 at 09:50
  • I think I need a rolling window for the mean and std, – TimExcellent Nov 11 '22 at 03:28
  • @TimExcellent - Without expected output hard to know what need. If need `rolling` what is window? Or need `expanding` like in edited answer? – jezrael Nov 11 '22 at 06:40
  • I worked out how to do this in R, its a lot easier... dont want to comment much here, but seems like both langs have good cross over points to work together – TimExcellent Nov 15 '22 at 01:29