0

I have a dataset with 100 rows and 21 columns where columns are the variables. I want to know if these variables came from a multivariate normal distribution. Thus, I've used de Normaltest from Scipy library but I can't understand the results. Here is my code:

import pandas as pd
from scipy import stats

df = pd.DataFrame(np.random.random(2100).reshape(100,21)) # dataset (100x21)
k2, p = stats.normaltest(df)

In this example, p is a 21-array not a single value. Can anybody explain how to interpret this array?

Jimena
  • 3
  • 4

1 Answers1

2

If p[x]<0.05, you may assume that values in column x are not normally distributed. Because with normality test, the null hypothesis is that population is normally distributed. With p<0.05, there is only less than 5% chance that we accept this hypothesis, which is statistically low. Oppositely, it p[i]>0.5, the data are normally distributed. You can easily test it with a normal distribution:

import pandas as pd
from scipy import stats
df = pd.DataFrame(np.random.normal(0,1,2100).reshape(100,21)) # dataset (100x21)
k2, p = stats.normaltest(df)
print (p)

The output is

    [0.97228661 0.49017509 0.97373345 0.97404468 0.03498392 0.61963074
 0.07712131 0.52632157 0.29887186 0.30822356 0.14416431 0.11015074
 0.81773481 0.52919266 0.81859869 0.24855451 0.16817784 0.0117747
 0.76860707 0.40384319 0.97038048]

with most of them larger than 0.05.

For testing of multivariate normality, you may try Henze-Zirkler test:

import pingouin as pg
normal, p = pg.multivariate_normality(df, alpha=.05)

where .05 is the significant value (you may change it if you want, it will not affect the p value you obtain.)

tianlinhe
  • 991
  • 1
  • 6
  • 15
  • Thank you so much for your response. However, I don't see why the output is a 21-array and what does each element of the array represent? – Jimena Mar 17 '20 at 08:17
  • This is because with ´stats.normaltest()´ you test if numbers in each column are distributed normally, so 21 p-values = 21 columns. I will my answer for a possible test of multivariate normality, could you try that and let me know? (I have somehow problem pip install the package) – tianlinhe Mar 17 '20 at 08:41
  • If I've understanded correctly, _stats.normaltest()_ is not a multivariate normal test. It's a univariate normal test that can be used to test multiple variables toghether. I've already try your test and this are my output: Normal: `False` p: `0.04797787634013723` – Jimena Mar 17 '20 at 10:40
  • Yes, it should not have been used in your case. This is why you got this array of p-values from *stats.normaltest()*. – tianlinhe Mar 17 '20 at 10:46
  • I have used the pingouin package and pg.multivariate_normality(x, $\alpha$). But I find that if the dimension of x is greater than 50, the function return nan as the p value. I don't know why. – user1388672 Aug 11 '20 at 07:58