R: Sample Size Considerations in QQ-plots

Question

It is common to use graphics to assess normality of a given sample. However QQ plots require large sample sizes to reliably represent the population being sampled. It is said in some texts that a sample size of at least a thousand is desirable. This is a sample R code that depicts this:

par(mfrow=c(2,3))
for(i in c(10, 100, 1e+3, 1e+4, 1e+5, 1e+6)){
  data <- rnorm(i, mean = 0, sd = 1)
  qqnorm(data, main=sprintf("Sample Size=%d", i)); qqline(data, col='red')
}

The code produces the following:

enter image description here

Question1: How large would my sample be to hit, say a -/+6 sigma on the theoretical ? In theory, A six sigma event occurs (normal dist) occurs 1 in 506797346 ! What do you think ?

Question2: Regardless the sample size, there are always a few points on the extremes that tail off the trend line. It seems this is "normal" and expected behavior. Could somebody post the rationale behind it ?

Thx, Riad

This question appears to be off-topic because it is about statistical theory and the psoter has not defined a coding problem. — IRTFM, Apr 06 '14 at 23:33
You might want to look at the `qqPlot` function in the car package. It adds pointwise confidence envelopes to the plot by default. The envelope is adjusted for sample size so you're really looking to see if points fall outside the envelope. — Dason, Apr 06 '14 at 23:50

Nathaniel Payne · Answer 1 · 2014-04-08T14:33:04.677

In terms of a general response answering your questions, I would first refer you to an excellent post that covers the topic quite nicely here. The comments below summarize the work done by the authors there.

In general, with a Q-Q plot, the basic idea is to compute the theoretically expected value for each data point based on the distribution in question. If the data follows the selected distribution, then the points on the Q-Q plot should be approximately on the straight line.

As a summary helping specify how you might interpret the plots, here are some pointers. Note that that is a subjective element to some of the interpretation which is captured below:

If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line.
If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line. The slope and intercept are visual estimates of the scale and location parameters of the theoretical distribution.
Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters because the -axis of a Q-Q plot is scaled linearly. On the other hand, probability plots are more convenient for estimating percentiles or probabilities.

SAS, which I use at work, has an excellent discussion of Q-Q plot interpretation. As they note, and I quote:

"In general, there are many reasons why the point pattern in a Q-Q plot may not be linear. Chambers et al. (1983) and Fowlkes (1987) discuss the interpretations of commonly encountered departures from linearity. They provide great places to start. Here is a little summary:

all but a few points fall on a line -> outliers in the data
left end of pattern is below the line; right end of pattern is above the line -> long tails at both ends of the data distribution
left end of pattern is above the line; right end of pattern is below the line -> short tails at both ends of the data distribution
curved pattern with slope increasing from left to right -> data distribution is skewed to the right
curved pattern with slope decreasing from left to right -> data distribution is skewed to the left
staircase pattern (plateaus and gaps) -> data have been rounded or are discrete"

Finally, in terms of sample size, the sample size should be taken into account when judging how close the q-q plot is to the straight line. That said, with a small number of n's, you would expect some random change deviations to be picked up at the end of the lines on the Q-Q plot outputs.

Also, I wanted to add that, in terms of the sample size question, the Q-Q plot can be used to formally test the null hypothesis that the data are normal. This is done by computing the correlation coefficient of the n points in the q-q plot. Depending upon n, the null hypothesis is rejected if the correlation coefficient is less than a threshold. The threshold is already quite close to 0.95 for modest sample sizes. You should be able to work from that information to compute a sample size that will provide you with one six sigma event per sample. — Nathaniel Payne, Apr 06 '14 at 07:48
very nice answer, but better for CrossValidated than here ... — Ben Bolker, Apr 06 '14 at 23:35
You are probably well aware by now that it is not OK to copy and paste large chunks of text verbatim from other sources without attribution. Please be sure to update this answer accordingly, as these thoughts do not appear to be your own - http://onlinestatbook.com/2/advanced_graphs/q-q_plots.html https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect040.htm For help on quoting external sources, see http://stackoverflow.com/help/referencing — BoltClock, Apr 08 '14 at 10:00
Thanks @BoltClock. Have absolutely got a better handle on it now. Am making changes to the post as we speak! — Nathaniel Payne, Apr 08 '14 at 14:28

score 1 · Answer 2 · answered Apr 06 '14 at 23:32

I do not think the question is well formed, which is not a surprise to me because my experience with persons teaching the standard Six Sigma course is that they have adopted a religion rather than putting in the effort to learn real statistics. I'm not saying you are such a person and this is an observation based on a sampling within the prevalent culture of one company (GE) about 10 years ago so it is a small sample. The variability of the points on either extreme will follow the distributional parameters of extreme value theory.

All distributions have tail behavior that is characterized by a small number of distribution. If you think about what determines the extreme quantiles, say the 99.99th percentile, the samoling behavior a very small number of points even when the interquartile boundaries are nailed down with high precision.because they each have 25% orf the points on one side and 75% on the other. If the sample size is 100 it won't make any sense to talk about the 99.5the percentile and the same is true of the 99.95th percentile for a samle size of 1000, and you can see the pattern emerging, I hope. Do a Google search on extreme value theory.

This is also the wrong forum. You should clarify by what you mean by "hit a -/+6 sigma on the theoretical". What does the word "hit" actually mean? Once you have defined a meaing for "hit" you should repost the question on CrossValidated.com

Thx a lot for your comments. Agree, my post was inaccurate. I have formulated my question and moved to Cross Validate as suggested by many. here it is: http://stats.stackexchange.com/questions/93971/extreme-value-simulation-monte-carlo. I hope this would make more sense for you. Thx in advance — Riad, Apr 16 '14 at 07:19

R: Sample Size Considerations in QQ-plots

2 Answers2