5

I found this code on internet that compares a normal distribution to different student distributions:

x <- seq(-4, 4, length=100)
hx <- dnorm(x)

degf <- c(1, 3, 8, 30)
colors <- c("red", "blue", "darkgreen", "gold", "black")
labels <- c("df=1", "df=3", "df=8", "df=30", "normal")

plot(x, hx, type="l", lty=2, xlab="x value",
  ylab="Density", main="Comparison of t Distributions")

for (i in 1:4){
  lines(x, dt(x,degf[i]), lwd=2, col=colors[i])
}

I would like to adapt this to my situation where I would like to compare my data to a normal distribution. This is my data:

library(quantmod)
getSymbols("^NDX",src="yahoo", from='1997-6-01', to='2012-6-01')
daily<- allReturns(NDX) [,c('daily')]
dailySerieTemporel<-ts(data=daily)
ss<-na.omit(dailySerieTemporel)

The objectif being to see if my data is normal or not... Can someone help me out a bit with this ? Thank you very much I really appreciate it !

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
jeremy.staub
  • 369
  • 4
  • 12
  • Just keep in mind that there is no sure way to confirm that your data is normally distributed, unless you can prove it analytically. It's like proving the all-swans-are-white-hypothesis, where even if you see a million white swans it takes only one black to disprove the hypothesis. This is why all statistical tests are set up to disprove the null hypothesis rather than proving it, but on the other hand, ignoring the fact that you've seen a million white swan for possibility that there might exists a black one, is ridiculously conservative. – Backlin Aug 06 '12 at 08:20

2 Answers2

8

If you are only concern about knowing if your data is normal distributed or not, you can apply the Jarque-Bera test. This test states that under the null your data is normal distributed, see details here. You can perform this test using jarque.bera.test function.

 library(tseries)
 jarque.bera.test(ss)

    Jarque Bera Test

data:  ss 
X-squared = 4100.781, df = 2, p-value < 2.2e-16

Clearly, from the result, you can see that your data is not normaly distributed since the null has been rejected even at 1%.

To see why your data is not normaly distributed you can take a look at the descriptive statistics:

 library(fBasics)
 basicStats(ss)
                     ss
nobs        3776.000000
NAs            0.000000
Minimum       -0.105195
Maximum        0.187713
1. Quartile   -0.009417
3. Quartile    0.010220
Mean           0.000462
Median         0.001224
Sum            1.745798
SE Mean        0.000336
LCL Mean      -0.000197
UCL Mean       0.001122
Variance       0.000427
Stdev          0.020671
Skewness       0.322820
Kurtosis       5.060026

From the last two rows, one can realize that ss has an excess of kurtosis, and the skewness is not zero. This is the basis of the Jarque-Bera test.

But if you are interested in compare actual distribution of your data agaist a normal distibuted random variable with the same mean and variance as your data, you can first estimate the empirical density function from your data using a kernel and then plot it, finally you only have to generate a normal random variable with same mean and variance as you data, do something like this:

 plot(density(ss, kernel='epanechnikov'))
 set.seed(125)
 lines(density(rnorm(length(ss), mean(ss), sd(ss)), kernel='epanechnikov'), col=2)

enter image description here

In this fashion you can generate other curve from another probability distribution.

The tests suggested by @Alex Reynolds will help you if your interest is to know what possible distribution your data were drawn from. If this is your goal you can take a look at any goodness-of-it test in any statistics texbook. Nevertheless, if just want to know if your variable is normally distributed then Jarque-Bera test is good enough.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • A Kolgomorov-Smirnov test ([ks.test](http://stat.ethz.ch/R-manual/R-devel/library/stats/html/ks.test.html) in R) is another good option for testing normality. – David Robinson Aug 06 '12 at 00:07
  • thank you very much! it's exactly what I wanted ! I'll play around with it a bit now to get more familiar with all this. One more thing, how could I superpose a student distribution with for example dl=3 on the graph you posted ? – jeremy.staub Aug 06 '12 at 00:13
  • for some reason this doesn't work: lines(density(rt(length(ss), 3, mean(ss)), kernel='epanechnikov'), col=2) – jeremy.staub Aug 06 '12 at 00:27
  • 1
    @jeremy.staub FWIW, a substantial number of people in the [statistics community](http://stats.stackexchange.com/q/2492/5055) think that testing a distribution for deviations from normality is anywhere from completely to largely useless. – joran Aug 06 '12 at 01:19
4

Take a look at Q-Q, Shapiro-Wilk or K-S tests to see if your data are normally distributed.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • Thank you for your comment but I actually already did a QQ-plot and kolomogorov test but was trying anyways to get this graph done. I get like an upside down S on my qqplot... does this mean that the kurtosis is a lot higher than in a normal model ? – jeremy.staub Aug 05 '12 at 22:35
  • The (excess) kurtosis of a normal distribution is zero. So any deviation from this gets you away from a normal distribution. QQ is good for exploration, but perhaps use the KS and Shapiro-Wilk to get a numerical *p*-value for how far away your distributions are from a normal. – Alex Reynolds Aug 05 '12 at 22:52
  • In fBasics package you find 8 or 9 tests for normality. – Maciej Aug 05 '12 at 22:54
  • I know my data is not normal I have already tested it. What I would like to do is illustrate it with a graph. and also so I can potentialy visualy see what distribution my data comes from – jeremy.staub Aug 05 '12 at 22:55
  • A QQ plot will show at a glance if a distribution is normal, so that would work. As far as what distribution your data come from, you could explore with Eureqa: http://creativemachines.cornell.edu/eureqa – Alex Reynolds Aug 05 '12 at 23:01
  • Depending on what software you're using, you can also perform QQ plots against other distributions. If the particular distribution is a good match, the data will fit linearly. For extra rigor, you will probably want *p*-values, though. – Alex Reynolds Aug 05 '12 at 23:07