0

I am trying to calculate z-scores for a dataset using scipy.stats, and am running into a very weird subtle error that I cannot figure out. The code is running, but appears to be producing data that is slightly off, which I am concerned is adversely impacting a PCA that I am running on the normalized dataset.

I have the following data in a list:

mylist = [0.565, 0.629, 0.687, 0.797, 0.56, 0.722]

I run the following commands to Z-score normalize the data using scipy.stats:

import scipy.stats as scipy
zscore_list = scipy.zscore(mylist)
[-1.11793077, -0.36479846,  0.31772769,  1.61217384, -1.17676923, 0.72959692]

However, when I calculate the same data manually, I get a different result:

import statistics as stats
for x in mylist:
`print(str((x-stats.mean(mylist))/stats.stdev(mylist)))`

Result:

-1.0205264990693814
-0.33301391022264026
0.29004437341971895
1.471706635500054
-1.074238420073032
0.6660278204452793

I have tried various things to address the issue, including converting "mylist" into a numpy array, using axis=None and ddof=0 in the call to "scipy.zscore", and nothing changes the result.

user190245
  • 1,027
  • 1
  • 15
  • 31
rpe13002
  • 31
  • 2

1 Answers1

3

UPDATE: So I figured it out, and figured I would leave this up for anyone interested.

There are two ways to calculate standard deviation: the population standard deviation, which divides the sum of the squared differences in the mean by the population size N, or the sample standard deviation, which divides the sum of the squared differences in the mean by the sample size n-1.

As it turns out, NumPy (and hence SciPy) uses the population standard deviation by default. This is rarely appropriate; for most applications where one is inferring population-level stats by drawing on a sample, the sample standard deviation is more appropriate. To correct this, I set ddof=1 in my call to scipy.stats.zscore, and that solved the problem of incorrect calculation.

I hope this is helpful to someone!

rpe13002
  • 31
  • 2