I have data that is grouped, and split into training and test sets. I am looking to compute z
-scores. On the training set, this is easy, as I can use built-in functions to compute the mean and standard deviation.
Here is an example, where I am looking for the z-scores by place: import pandas as pd import numpy as np # My example dataframe
train = pd.DataFrame({'place': ['Winterfell','Winterfell','Winterfell','Winterfell','Dorne', 'Dorne','Dorne'],
'temp' : [ 23 , 10 , 0 , -32, 90, 110, 100 ]})
test = pd.DataFrame({'place': ['Winterfell', 'Winterfell', 'Dorne'],
'temp' : [6, -8, 100]})
# get the z-scores by group for the training set
train.loc[: , 'z' ] = train.groupby('place')['temp'].transform(lambda x: (x - x.mean()) / x.std())
Now the training dataframe takes the form:
| Place | temp | z |
|------------|------|-------|
| Winterfell | 23| 0.969 |
| Winterfell | 10| 0.415 |
| Winterfell | 0|-0.011 |
| Winterfell | -32|-1.374 |
| Dorne | 90| 1.000 |
| Dorne | 110|-1.000 |
| Dorne | 100| 0.000 |
which is what I want.
The problem is that I now want to use the mean and standard deviations from the training set to calculate the z-scores in the test set. I can get the mean and standard deviation easily enough:
summary = train.groupby('place').agg({'temp' : [np.mean, np.std]} ).xs('temp',axis=1,drop_level=True)
print(summary)
mean std
place
Dorne 100.00 10.000000
Winterfell 0.25 23.471614
I have some complicated ways of doing what I want, but since this is a task I have to do often, I am looking for a tidy way of doing it. Here is what I have tried so far:
Making a dictionary
dict
out of the summary table, where I can extract the mean and standard devation as a tuple. Then on the test set, I can do an apply:test.loc[: , 'z'] = test.apply(lambda row: (row.temp - dict[row.place][0]) / dict[row.place][1] ,axis = 1)
Why I don't like it:
- dictionary makes it hard to read, need to know what the structure of
dict
is. If a place appears in the test set but not the training set, instead of getting a NaN, the code will throw an error.
Using an index
test.set_index('place', inplace = True) test.loc[:, 'z'] = (test['temp'] - summary['mean'])/summary['std']
Why I don't like it: - Looks like it should work, but instead gives me only NaNs
The final result should be Is there a standard pythonic way of doing this sort of combination?