49

I need to use normaltest in scipy for testing if the dataset is normal distributet. But I cant seem to find any good examples how to use scipy.stats.normaltest.

My dataset has more than 100 values.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
The Demz
  • 7,066
  • 5
  • 39
  • 43

2 Answers2

77
In [12]: import scipy.stats as stats

In [13]: x = stats.norm.rvs(size = 100)

In [14]: stats.normaltest(x)
Out[14]: (1.627533590094232, 0.44318552909231262)

normaltest returns a 2-tuple of the chi-squared statistic, and the associated p-value. Given the null hypothesis that x came from a normal distribution, the p-value represents the probability that a chi-squared statistic that large (or larger) would be seen.

If the p-val is very small, it means it is unlikely that the data came from a normal distribution. For example:

In [15]: y = stats.uniform.rvs(size = 100)

In [16]: stats.normaltest(y)
Out[16]: (31.487039026711866, 1.4543748291516241e-07)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 9
    How do we quantify "very small" here? – AmanArora Jul 08 '15 at 10:21
  • 3
    It is an arbitrary choice: http://stats.stackexchange.com/a/55693/842. Just be sure to you decide what your signficance level is *before* applying a statistical test. – unutbu Jul 08 '15 at 10:30
13

First i found out that scipy.stats.normaltest is almost the same. The mstats library is used for masked arrays. Arrays where you can mark values as invalid and not taken into the calculation.

import numpy as np
import numpy.ma as ma
from scipy.stats import mstats

x = np.array([1, 2, 3, -1, 5, 7, 3]) #The array needs to be larger than 20, just an example
mx = ma.masked_array(x, mask=[0, 0, 0, 1, 0, 0, 0])
z,pval = mstats.normaltest(mx)

if(pval < 0.055):
    print "Not normal distribution"

"Traditionally, in statistics, you need a p-value of less than 0.05 to reject the null hypothesis." - http://mathforum.org/library/drmath/view/72065.html

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
The Demz
  • 7,066
  • 5
  • 39
  • 43
  • 14
    Why `< 0.055` instead of `< 0.05`? – Olli May 05 '14 at 08:52
  • 2
    If the p-val is very small, it means it is unlikely that the data came from a normal distribution. 0.05 is the standard threshold, but to be more certain you can raise the certainty like 0.055 or something else. Its just a threshold of saying yes it is a normal distribution. – The Demz Dec 24 '14 at 12:09
  • 6
    The Demz, raising the threshold to 0.055 would mean less certainty that the data came from a normal distribution. You would want to lower your p value threshold below the standard 0.05 to decrease the chances of erroneously rejecting the null hypothesis that the distribution is normal. – jeffhale Mar 17 '18 at 13:14