2

I'd like to make a visual assessment of whether my data conform to a particular distribution function. To do this, I am using R to generate a quantile-quantile (Q-Q) plot. The distribution function is very specific and is not featured in the standard list of probability distributions, so I wrote my own R function to describe it. It is called 'DistFunc' in the code below, and consists of the ratio of two gamma functions.

In brief, what I do in my code is read my data in from a file, 'DistributionEstimate.txt', which contains two columns. Column 1 is the x values and column 2 is the y values. The variables 'a' and 'b' are best-fit parameters which I had determined previously in another program using a least-squares fit of this distribution function to the data. I then define DistFunc and attempt to plot the Q-Q plot with the qqmath function.

The problem occurs at this point. R proceeds to give me a lot of warnings saying that the DistFunc returns values out of range in 'gammafn', and fails to plot anything. That is fair enough, as I know that the function contains a pole close to the origin. As you can see in the code, I attempt to normalise DistFunc to attempt to convert it to a probability distribution (which, I think, is what is required in order to use qqmath?), however, this doesn't help.

Do any of you have any idea how to overcome this problem - for example, by using a different plotting function that doesn't require normalisation, or to convert it to a pseudo-probability distribution, without affecting the outcome too seriously?

I'd be very grateful for any helpful input!

install.packages('lattice')
library(lattice)
x<-read.table("C:/DistributionEstimate.txt", colClasses = c(rep("NULL",1),rep("numeric",1)), header = FALSE)
y<-read.table("C:/DistributionEstimate.txt", colClasses = c(rep("numeric",1),rep("NULL",1)), header = FALSE)
x<-sapply(x, as.numeric)
y<-sapply(y, as.numeric)
a<-16359727025.407821410;
b<-198838619.13262583836;
DistFunc <- function(k,ampl=a,stretch=b) {
    fdist<-ampl*gamma(k*stretch-1/2)/gamma(k*stretch+1)
    fnorm<-fdist/sum(fdist)
}
qqmath(DistFunc(x), y, col="blue", envelope=.95, xlab="Quantiles of the best-fit model", ylab="Quantiles of the data")
abline(0,1, col="red", lwd=2)
grid()
Mirella
  • 21
  • 1
  • 2
  • You're using lattice's `qqmath` function? I don't see an `envelope` parameter. And you seem to be passing `y` as the `data` parameter? Do you get a warning for that? The `distribution=` parameter of qqmath expects a quantile function. It should take probabilities (0,1) and convert to quantiles (0,1). We have no idea what's in `C:/DistributionEstimate.txt` so we can't test. But with such a large stretch value the values of `k` must be really, really small otherwise `gamma` can't return a value. Even `gamma(200)` returns an error. – MrFlick May 24 '14 at 15:59
  • I tried several plot functions, including `qqPlot` from the `car` library ([http://www.inside-r.org/packages/cran/car/docs/qqPlot](http://www.inside-r.org/packages/cran/car/docs/qqPlot)) which does accept the `envelope` option. I switched to `qqmath` last and forgot to take it out. However, R didn't complain about that. I don't get a warning for `y`, but I don't see why I should? I am still a beginner though and some things aren't very clear to me yet. I have uploaded the data here: [link](http://archive.org/details/DistributionEstimate), if that helps... – Mirella May 25 '14 at 13:48

1 Answers1

3

The idea behind a QQ plot is to compare observations believed to come form a certain distribution, against the values you would expect to see from that distribution in a sample of the same size.

So the first problem is that you have both x and y values. A QQ-plot is a univariate plot. You are matching one set of values against a distribution. The second dimension for plotting (x,y) pairs is calculated by the distribution function.

The distribution function qqmath expects is not a density function. It need a function that will turn quantiles into values from the distribution. This is the same as the q* family of distribution functions work in R, like qnrom, or qexp. The function must accept a number in the range 0-1 and convert that to a value in the domain of the distribution (-Inf,Inf) for qnorm or (0, Inf) for qexp. During plotting, qqmath will pass a list of quantiles to this function and will get a list of expected values back. It will then plot the list of expected values against the (sorted) observed values.

As an example, i'm just going to use the qexp function as a "custom" quantile function. Observe that

myDist<-function(x) {
    qexp(x, 5)
}

set.seed(15)
x <- rexp(100, 5)
qqmath(~x, distribution=myDist, main="qqmath")

And this is exactly the same as

exp.x <- myDist(ppoints(length(x)))
xyplot(sort(x)~exp.x, main="xyplot")

qqmath vc xyplot

I think one of the problems you have is that DistFunc looks more like a density then a quantile function. To go from a density function to probabilities, you have to integrate. Here'a helper function to try to make a q-like function for an arbitrary density function

getq <- function(density, from, to, steps=1000) {
    x <- seq(from=from, to=to, length.out=steps) 
    y <- mapply(function(a,b)integrate(density,a,b)$value, x[-steps], x[-1])
    approxfun(c(0,cumsum(y)),x)
}

The first parameter is a one-parameter density function. This will be used during integration. Then the from and to parameters specify where your values have non-zero probabilities. Then steps is the number of points where we will perform the integration. Then we use approxfun to do interpolation between the number of points we actually calculated and the point requested by the final q function. Let's see how this works with a standard density. Again we will use an exponential, rate 5, density

myq <- getq(function(x) dexp(x,5), 0, 4)

Note that we create an anonymous funciton to wrap the dexp with the rate parameter so our density only takes one parameter. Here we just go from 0 to 4 because by that point we are pretty much at a probability of 1.0. Now we can use this function like the standard qexp

> qexp(.5,5)
[1] 0.1386294
> myq(.5)
[1] 0.1386388

You see that we get very similar answers for .5. So that appears to be working. So this is one quick way to transform a density function into a quantile function if your quantile function doesn't have a nice, closed form.

And the last problem I see is that your a and b values are huge. Using them inside the gamma function will quickly lead to numbers that R can't deal with. Now you are dividing one gamma by another, so the hope is that they will cancel out somewhat, but you typically run into overflow using the standard versions. So in trick is to calculate large values is to do so on a log scale, and then exp() when you are all done to return to the natural scale. So you might change your function to

DistFunc <- function(k,ampl=a,stretch=b) {
    fdist <- exp(log(ampl) + lgamma(k*stretch-1/2) - lgamma(k*stretch+1))
    fnorm <- fdist/sum(fdist)
}

Note that lgamma is the log-scaled gamma function. But with your a and b values even that doesn't seem to be enough in most cases. I'm unsure how you can useable numbers from that function given your parameters. I'm also not sure what you think the range of your distribution is. I could not find a way to make that integrate to 1 like a good density function should.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Thanks for the detailed answer. Yes, I think the trick would be to evaluate `DistFunc` discretely, as you suggested, or at least evaluate away from the pole near the origin. Thanks for all the tips. I will come back to the problem in a couple of days' time (unfortunately I have other priorities at the moment) and have another proper go. – Mirella May 27 '14 at 11:29