Hoping answers to my problem will be relevant to others, but I have a data frame that I want to calculate zscores on some columns, ignore two columns but retain all columns in the dataframe, (also new to python/data science so please dont assume high proficiency):
Dataframe:
date,market_exchange,open_int,large_spec_long,large_spec_short
2021-12-28,WHEAT-SRW - CHICAGO BOARD OF TRADE,356439,100581,94646
2021-12-21,WHEAT-SRW - CHICAGO BOARD OF TRADE,357381,103696,94564
2021-12-14,WHEAT-SRW - CHICAGO BOARD OF TRADE,361469,109485,97520
2021-12-07,WHEAT-SRW - CHICAGO BOARD OF TRADE,359186,111973,96306
2021-11-30,WHEAT-SRW - CHICAGO BOARD OF TRADE,371575,115857,97563
2021-11-23,WHEAT-SRW - CHICAGO BOARD OF TRADE,425301,139196,97707
2021-11-16,WHEAT-SRW - CHICAGO BOARD OF TRADE,427786,132167,95406
2021-11-09,WHEAT-SRW - CHICAGO BOARD OF TRADE,405438,117465,98262
2021-11-02,WHEAT-SRW - CHICAGO BOARD OF TRADE,409137,109768,91872
2021-10-26,WHEAT-SRW - CHICAGO BOARD OF TRADE,405625,102322,101559
I want to ignore 'date' and 'market_exchange' from the zscore calculation, but not delete these columns in the resulting new data frame.
I then want to calculate the zscore on open_int, large_spec_long, large_spec_short but retain the dataframe structure so that when I print out or export to csv I have new columns that append a '_z' so that for example on 'open_int' column, 'open_int_z' (being a new column containing the zscore of open_int column) column would be added to the end of the dataframe - the same would apply for the 'large_spec_long' etc adding a new 'columnname_z' to the end of the data frame.
I have tried methods from a number of stack answers, this one is closest but cols.remove doesnt accept more than one argument and I'm not sure its correct just to add another cols.remove as per below:
#sort by date then group market_exchange
fldf.sort_values(by="date", ascending=True).groupby(by="market_exchange", dropna=False) #dropna could be unrequired
# fldf.head(3)
cols = list(fldf.columns)
cols.remove('date')
cols.remove('market_exchange')
for col in cols:
col_zscore = col + '_z'
fldf[col_zscore] = (fldf[col] - fldf[col].mean())/fldf[col].std(ddof=0)
fldf.head(10)
Is this the correct way of doing this, I am also wanting to make sure I am not "looking ahead" in my zscore as well?
I have been trying to adapt from this thread Pandas - Compute z-score for all columns