0

I was watching the Statquest video about Independent filtering and I wanted to recreate his example.

https://www.youtube.com/watch?v=Gi0JdrxRq5s&t=208s

So basicaly he made two tests:

In one, he performed 1,000 t-test where he compared triplicates from 2 different normal distribution. Most of the p-value were <0.05 but, due to the 5% error, he showed that +/- 50 of them were >0.05, which are "false negative".

On the other one, he performed 1,000 t-test on triplicates from the same normal distribution. Obivously, not many p-value <0.05 as he obtained an homogenous distribution of p-value, ranging from 0 to 1.

So I used the following code to create something similar for the second "test" and obtained this: density of p-value

Here's the (not perfect, it's a quick and dirty check) code:

pvals <- replicate(10000,t.test(rnorm(3), rnorm(3), alternative = "two.sided", conf.level = 0.95)$p.value)
h <- hist(pvals, breaks=25, plot = FALSE)
cuts <- cut(h$breaks, c(-Inf, 0.0499999999, Inf))
plot(h, col=c("green", "red")[cuts])
length(which(pvals < 0.05))

Clearly, the number of pvalue <0.05 is lower than the other one. Statisticaly, I would expect to have around 500 pvalue <0.05 (5% of 10,000 tests) but it's not the case. I think my code is wrong but I don't see where...

Thank in advance for your help !

1 Answers1

0

The issue comes from the small sample size of the test. It is easy to show by simulation:

replicate(10, sum(replicate(10000,t.test(rnorm(3), rnorm(3), alternative = "two.sided", conf.level = 0.95)$p.value)<.05))
#332 370 397 368 398 339 362 334 325 350
replicate(10, sum(replicate(10000,t.test(rnorm(30), rnorm(30), alternative = "two.sided", conf.level = 0.95)$p.value)<.05))
#504 534 511 475 502 490 497 537 506 527
Baraliuh
  • 593
  • 3
  • 12
  • Hey, thanks! Yeah i've seen that too, but in his video/example Josh was speaking about 3 samples size. Did he then spoke about 3 to illustrate but use more replicates in his code then? :) – Christophe Vanhaver May 28 '21 at 17:58
  • No problem, I must admit I did not watch the video. If you decrease the sd of the ur normal distributions you will get the same effect as increasing ur sample size. Did he mention using a standard normal distribution? – Baraliuh May 28 '21 at 18:08
  • Hi, Not really. He just showed two density plot of two normal distributions but I don't think they were truly generated on R, I think it was just graphical to illustrate his thought. – Christophe Vanhaver May 31 '21 at 07:50
  • Ok, I think the issue comes from the fact that the Probability of error is based on the true parameter values of the random variables while the same might not be true for estimates of the parameters. If you generate random variables with good estimates (e.g., by large sample size) then you will get 5 % false positives (i.e., % error). If you overestimate e.g., the variance, then this will decrease. The same thing will happen fro true positives. – Baraliuh May 31 '21 at 15:58