3

I have data that is grouped, and split into training and test sets. I am looking to compute z-scores. On the training set, this is easy, as I can use built-in functions to compute the mean and standard deviation.

Here is an example, where I am looking for the z-scores by place: import pandas as pd import numpy as np # My example dataframe

train = pd.DataFrame({'place':     ['Winterfell','Winterfell','Winterfell','Winterfell','Dorne', 'Dorne','Dorne'],
                      'temp' : [ 23 , 10 , 0 , -32, 90, 110, 100 ]})
test  = pd.DataFrame({'place': ['Winterfell', 'Winterfell', 'Dorne'],
                      'temp' : [6, -8, 100]})

# get the z-scores by group for the training set
train.loc[: , 'z' ] = train.groupby('place')['temp'].transform(lambda x: (x - x.mean()) / x.std())

Now the training dataframe takes the form:

|    Place   | temp |   z   |
|------------|------|-------|
| Winterfell |    23| 0.969 |
| Winterfell |    10| 0.415 |
| Winterfell |     0|-0.011 |
| Winterfell |   -32|-1.374 |
|      Dorne |    90| 1.000 |
|      Dorne |   110|-1.000 |
|      Dorne |   100| 0.000 | 

which is what I want.

The problem is that I now want to use the mean and standard deviations from the training set to calculate the z-scores in the test set. I can get the mean and standard deviation easily enough:

summary = train.groupby('place').agg({'temp' : [np.mean, np.std]} ).xs('temp',axis=1,drop_level=True)

print(summary)

          mean        std
place                        
Dorne       100.00  10.000000
Winterfell    0.25  23.471614

I have some complicated ways of doing what I want, but since this is a task I have to do often, I am looking for a tidy way of doing it. Here is what I have tried so far:

  1. Making a dictionary dict out of the summary table, where I can extract the mean and standard devation as a tuple. Then on the test set, I can do an apply:

    test.loc[: , 'z'] = test.apply(lambda row: (row.temp - dict[row.place][0]) / dict[row.place][1] ,axis = 1)
    

Why I don't like it:

  • dictionary makes it hard to read, need to know what the structure of dict is.
  • If a place appears in the test set but not the training set, instead of getting a NaN, the code will throw an error.

    1. Using an index

      test.set_index('place', inplace = True)
      test.loc[:, 'z'] = (test['temp'] - summary['mean'])/summary['std']
      

Why I don't like it: - Looks like it should work, but instead gives me only NaNs

The final result should be Is there a standard pythonic way of doing this sort of combination?

Damien Martin
  • 132
  • 1
  • 9
  • This answer may help you:https://stackoverflow.com/questions/24761998/pandas-compute-z-score-for-all-columns – walker_4 Aug 29 '17 at 22:27
  • Thanks! I saw that one while composing my solution, although that one focuses on computing z-scores from the data in the data frame, rather than using the means from a separate data frame. The time series example comes close to doing what I am looking for, though. – Damien Martin Aug 29 '17 at 22:37

1 Answers1

4

Option 1
pd.Series.map

test.assign(z=
    (test.temp - test.place.map(summary['mean'])) / test.place.map(summary['std'])
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000

Option 2
pd.DataFrame.eval

test.assign(z=
    test.join(summary, on='place').eval('(temp - mean) / std')
)

        place  temp         z
0  Winterfell     6  0.244977
1  Winterfell    -8 -0.351488
2       Dorne   100  0.000000
piRSquared
  • 285,575
  • 57
  • 475
  • 624