normality test with pre-aggregated data

Question

Using spark I aggregated data for each group (cohort) to only contain the mean, standard deviation, and variance.

Now in a second step using python I would like to test for normality (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html) and afterward for significance using either the t-test stats.ttest_ind or stats.wilcoxon rank test.

However, all these methods expect the data to be fed in as raw record-oriented values. How can I use them with the pre-aggregated data?

You need the pre-aggregated data. The mean and standard deviation alone will tell you *nothing* about whether the original data was from a normal distribution. — Warren Weckesser, Sep 06 '19 at 21:28

score 2 · Accepted Answer · answered Sep 06 '19 at 16:19

Mean, standard deviation and variance are not enough to test for normality in each cohort. Standard deviation is the square root of the variance, so you only have the information of two statistics.

You could also (or instead) calculate the two summary statistics skewness and kurtosis and also save the count of the observations. The Jarque–Bera test is a test for normality which only depends on the skewness, kurtosis and number of observations.

normality test with pre-aggregated data

1 Answers1