R How to generate a vector of probabilities normally distributed to be used at chisq.test

Question

I have a vector of 30 samples I want to test the hypothesis of the sample being from a population which is normally distributed.

> N.concentration
  [1] 0.164 0.045 0.069 0.100 0.050 0.080 0.043 0.036 0.057 0.154 0.133 0.193
  [13] 0.129 0.121 0.081 0.178 0.041 0.040 0.116 0.078 0.104 0.095 0.116 0.038
  [25] 0.141 0.100 0.104 0.078 0.121 0.104

I made a frequency vector using hist

> N.hist <- hist(N.concentration, breaks=10)
> N.freq <- N.hist$count
  [1] 3 5 4 4 5 4 2 2 1

I'm using chisq.test to check the fitness of N.freq to a normal distribution, however, the function requires an argument p = a vector of probabilities of the same length of x, as defined in chisq.test documentation. I'm trying to generate a vector to it but, honestly, I don't know exactly what to generate. I'm trying

> d <- length(N.freq$count)%/%2
> p <- dnorm(c(-d:d))
> p
  [1] 0.0001338302 0.0044318484 0.0539909665 0.2419707245 0.3989422804
  [6] 0.2419707245 0.0539909665 0.0044318484 0.0001338302
> chisq.test(N.freq, p = p)
   Error in chisq.test(p1$count, p = p) : 
   probabilities must sum to 1.

I thought about using rescale.p=TRUE but I'm not sure if this will produce a valid test.

EDIT: If I use rescale.p, I got a warning message

> chisq.test(N.freq, p=p, rescale.p=TRUE)

Chi-squared test for given probabilities

data:  N.freq
X-squared = 2697.7, df = 8, p-value < 2.2e-16

Warning message:
In chisq.test(N.freq, p = p, rescale.p = TRUE) :
Chi-squared approximation may be incorrect

I have to estimate. The data was collected from nature but there is no estimative about mean or variance of the actual population. I want to check if I can consider it normally distributed so I can run a t-test to verify independence from another data. — rvbarreto, Nov 20 '16 at 19:35

Zheyuan Li · Accepted Answer · 2016-11-20T19:50:42.183

As I said, to test normality we have to know the mean and standard error of the normal distribution in Null Hypothesis. Since there are no given values, we have to estimate them from your 30 data.

x <- c(0.164, 0.045, 0.069, 0.1, 0.05, 0.08, 0.043, 0.036, 0.057, 
0.154, 0.133, 0.193, 0.129, 0.121, 0.081, 0.178, 0.041, 0.04, 
0.116, 0.078, 0.104, 0.095, 0.116, 0.038, 0.141, 0.1, 0.104, 
0.078, 0.121, 0.104)

mu <- mean(x)
sig <- sd(x)

Now, as what you have done, we need to bin the data:

h <- hist(x, breaks = 10)
#List of 6
# $ breaks  : num [1:10] 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
# $ counts  : int [1:9] 3 5 4 4 5 4 2 2 1
# $ density : num [1:9] 5 8.33 6.67 6.67 8.33 ...
# $ mids    : num [1:9] 0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17 0.19
# $ xname   : chr "x"
# $ equidist: logi TRUE
# - attr(*, "class")= chr "histogram"

To get the true probability under Null Hypothesis, we need probability for each bin cell, i.e., between breaks.

p <- diff(pnorm(h$breaks, mu, sig))
#[1] 0.05675523 0.10254734 0.15053351 0.17953337 0.17396679 0.13696059 0.08760419
#[8] 0.04552387 0.01921839

I tend not to trust chi-square test with only 30 data. But here is how we can use chisq.test:

chisq.test(h$counts, p = p, rescale.p = TRUE)
#
#   Chi-squared test for given probabilities
#
#data:  h$counts
#X-squared = 3.1476, df = 8, p-value = 0.9248
#
#Warning message:
#In chisq.test(h$counts, p, rescale.p = TRUE) :
#  Chi-squared approximation may be incorrect

Often you need not bother the warning message. If you want to get rid of it, set simulate.p.value = TRUE:

chisq.test(h$counts, p = p, rescale.p = TRUE, simulate.p.value = TRUE)
#
#   Chi-squared test for given probabilities with simulated p-value (based
#   on 2000 replicates)
#
#data:  h$counts
#X-squared = 3.1476, df = NA, p-value = 0.942

This is very good!!! Thank you very much! Once you said you would not trust the chi square test, would you mind to suggest a test for this situation? — rvbarreto, Nov 20 '16 at 20:41

score 2 · Answer 2 · edited Apr 13 '17 at 12:44

There are many statistical tests designed for testing departure from Normality of a specified set of data (e.g., see the nortest package). However, you should be aware that many statisticians feel that normality testing is "essentially useless": in particular (from an answer from the linked CrossValidated question):

The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.

However, going ahead and using the Shapiro-Wilk test from base R (according to the Wikipedia page, Shapiro-Wilk has good power - but note from discussion above that high power is not necessarily what we really want in this case ...)

d <- c(0.164,0.045,0.069,0.100,0.050,0.080,0.043,0.036,0.057,0.154,
       0.133,0.193,0.129,0.121,0.081,0.178,0.041,0.040,0.116,0.078,
       0.104,0.095,0.116,0.038,0.141,0.100,0.104,0.078,0.121,0.104)
shapiro.test(d)
##  Shapiro-Wilk normality test
## 
## data:  d
## W = 0.9547, p-value = 0.2255

Graphical approach:

par(las=1,bty="l")
qqnorm(d)
qqline(d)

The points follow the line reasonably well, and the largest deviations (the three smallest points in the data set) are actually larger than expected, which means the data set is slightly thin-tailed at the lower end, meaning that tests based on the assumption of Normality will typically be slightly conservative.

R How to generate a vector of probabilities normally distributed to be used at chisq.test

2 Answers2