0

I am attempting calculate z-scores at once for a series of columns, but inspecting the data reveals that the mean values for columns are NOT 0 as you should expect for the calculation of a z-score.

As you can see by running the code below, column a and column d does not have 0 means in the newly created *_zscore column.

import pandas as pd
df = pd.DataFrame({'a': [500,4000,20], 'b': [10,20,30], 'c': [30,40,50], 'd':[50,400,20] })

cols = list(df.columns)
for col in cols:
    col_zscore = col + '_zscore'
    df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)

print(df.describe())

My actual data is obviously different, but the results are similar (i.e.: non-zero means). I have also used

from scipy import stats
stats.zscore(df)

which leads to a similar result. Doing the same transformation in R (i.e.: scaled.df <- scale(df)) works though.

Does anyone have an idea what is going on here? The columns with error contain higher values, but it should also be possible to z-transform them.

EDIT: as Rob pointed out, the results are essentially 0.

peter_c
  • 39
  • 1
  • 7
  • i think its simply a rounding error. the non 0 values are baisically 0 to the 16th digit after comma. maybe R uses a different floating-point arithmetic – luigigi Feb 07 '20 at 09:21

1 Answers1

1

Your mean values are of the order 10^-17, which for all practical purposes is equal to zero. The reason why you do not get exactly zero has to do with the way floating point numbers are represented (finite precision).

I'm surprised that you don't see it in R, but that may have to do with the example you use and the fact that scale is implemented a bit differently in R (ddof=1 e.g.). But in R, you see the same thing happening:

> mean(scale(c(5000,40000,2000)))
[1] 7.401487e-17
Rob
  • 3,418
  • 1
  • 19
  • 27