4

I have a matrix of size (61964, 25). Here is a sample:

array([[  1.,   0.,   0.,   4.,   0.,   1.,   0.,   0.,   0.,   0.,   3.,
          0.,   2.,   1.,   0.,   0.,   3.,   0.,   3.,   0.,  14.,   0.,
          2.,   0.,   4.],
       [  0.,   0.,   0.,   1.,   2.,   0.,   0.,   0.,   0.,   0.,   1.,
          0.,   2.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   5.,   0.,
          0.,   0.,   1.]])

Scikit-learn provides a useful function provided that our data are normally distributed:

from sklearn import preprocessing

X_2 = preprocessing.scale(X[:, :3])

My problem, however, is that I have to work on a row basis - which does not consist of 25 observations only - and so the normal distribution is not applicable here. The solution is to use t-distribution but how can I do that in Python?

Normally, values go from 0 to, say, 20. When I see unusually high numbers, I filter out the whole row. The following histogram shows what my actual distribution looks like:

enter image description here

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
user706838
  • 5,132
  • 14
  • 54
  • 78
  • Python 3.4 has a new module [statistics][1], which will do the trick for you: [1]: https://docs.python.org/3/library/statistics.html – tommy.carstensen Feb 09 '15 at 12:01

1 Answers1

5

scipy.stats has the function zscore which allows you to calculate how many standard deviations a value is above the mean (often refered to as the standard score or Z score).

If arr is the example array from your question, then you can compute the Z score across each row of 25 as follows:

>>> import scipy.stats as stats
>>> stats.zscore(arr, axis=1)
array([[-0.18017365, -0.52666143, -0.52666143,  0.8592897 , -0.52666143,
        -0.18017365, -0.52666143, -0.52666143, -0.52666143, -0.52666143,
         0.51280192, -0.52666143,  0.16631414, -0.18017365, -0.52666143,
        -0.52666143,  0.51280192, -0.52666143,  0.51280192, -0.52666143,
         4.32416754, -0.52666143,  0.16631414, -0.52666143,  0.8592897 ],
       [-0.43643578, -0.43643578, -0.43643578,  0.47280543,  1.38204664,
        -0.43643578, -0.43643578, -0.43643578, -0.43643578, -0.43643578,
         0.47280543, -0.43643578,  1.38204664, -0.43643578, -0.43643578,
        -0.43643578, -0.43643578, -0.43643578, -0.43643578, -0.43643578,
         4.10977027, -0.43643578, -0.43643578, -0.43643578,  0.47280543]])

This calculation uses the population mean and standard deviation for each row. To use the sample variance instead (as with the t-statistic), additionally specify ddof=1:

stats.zscore(arr, axis=1, ddof=1)
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Hi, thank you very much for your reply! I didn't know about this function! Btw, are you sure that I should you use `ddof=1`? Also, why do I get skewed results; in fact, on the positive side? Any ideas? Can it be because of the many zeros in the initial table? How can I avoid that? – user706838 Feb 09 '15 at 16:22
  • Only use `ddof=1` if you want to correct for the sample bias - `zscore` uses `ddof=0` by default (i.e. the population SD). Regarding your edit, I'm not sure I follow what you're trying to do in your edit... you want to filter out rows which have anomalously high values? – Alex Riley Feb 09 '15 at 22:31