0

I got two gamma distributions, which could be a good fit to my data, but to be sure I need to do Kolmogorov-Smirnov test as well as chi-square goodness of fit. (If you have any suggestion which other way is suitable to confirm that mixture distribution is suitable for my data, let me know).

I used gammamixEM from mixtools package in R to get parameters for those two gamma distributions. From the histogram it looks promising.

enter image description here

foo_CE <- gammamixEM(data_cexio$diff)
x1 <- seq(0, 70, 1)
hist(data_cexio$diff, freq= FALSE, col="grey",border="black")
lines(density(data_cexio$diff), lwd = 2, col = "blue")
lines(x1, foo_CE$lambda[1]*dgamma(x1, foo_CE$gamma.pars[1,1], 1/foo_CE$gamma.pars[2,1]),
      col="orange", lwd=3)
lines(x1, foo_CE$lambda[2]*dgamma(x1, foo_CE$gamma.pars[1,2], 1/foo_CE$gamma.pars[2,2]),
      col="magenta", lwd=3)
legend("topright", c("0.20 * Gamma ~ (1.70,0.24)","0.80 * Gamma ~ (6.40,0.22)" ), fill=c( "orange","magenta"))

Regarding Kolmogorov-Smirnov test I found in here: https://stats.stackexchange.com/questions/28873/goodness-of-fit-test-for-a-mixture-in-r and I modified for Gamma distribution:

# CDF of mixture of two gammas
CDF_gamma <- function(x, shape, scale, p) {
  p[1]*pgamma(x,shape[1],1/scale[1]) + p[2]*pgamma(x,shape[2],1/scale[2])
}
test_CE <- ks.test(data_cexio$diff, CDF_gamma, shape=foo_CE$gamma.pars[1,], scale=foo_CE$gamma.pars[2,], p=foo_CE$lambda)

Unfortunately, p-value < 2.2e-16, so it is not a good fit according to this test. But I am not able to find how to perform chi-square goodness of fit as well to check if this mixture distribution would be good fit.

Saida
  • 3
  • 2
  • Can you provide the data that you used for the calibration? That would be useful to provide an answer. – Emmanuel Hamel Apr 02 '23 at 13:22
  • Hi Emmanuel, this is quite big data set with 1 mln. values, but it would be enough for me to understand the concept how to perform chi-square goodness of fit with gamma mixture distribution. – Saida Apr 02 '23 at 14:10
  • I would suggest, in your first snippet of code, that you plot the mixture gamma density in place of the blue line generated by `lines(density(data_cexio$diff), lwd = 2, col = "blue")`. In fact, the latter is a nonparametric fit of the histogram and we would like to use the density plots to check visually whether the *gamma mixture* is a good fit for your data (not whether a *nonparametric fit* is a good fit --which most likely is, because of the nonparametric nature of the fit). So, just plot the combination `0.20*Gamma1 + 0.80*Gamma2` and do a visual check of that density w.r.t the histogram. – mastropi Apr 02 '23 at 18:24
  • The KS test looks right, and it confirms that the data are highly unlikely to come from a 2-gamma mixture. It looks more like a 3- or 4- gamma mixture with modes at ~0, 21, and 37, but why a gamma mixture instead of simplify using a kernel density estimate? Does the gamma distribution have some property that you need to use? – jblood94 Apr 03 '23 at 11:19
  • There is no particular reason why I choose two Gamma distributions. Just because one distribution is definitely not enough, I tried mixture distribution and Gamma was one of the options. From the density plot it was promising. My aim is to create a model to predict the time between two events (this is variable data_cexio$diff) and I am starting from finding the distribution which could tell me how data is distributed. Can Kernel density estimate could help me with that? – Saida Apr 03 '23 at 16:58
  • Fitting a parametric distribution (such as a mixture of Gamma) to the data, rather than a nonparametric distribution, is useful when you want to give structure to the problem you are dealing with and then make inferences using such structure. For instance, infer the average inter-event time: if you fit a nonparametric distribution you would estimate the mean as the sample average +/- a confidence interval for the true mean that is based on a Gaussian distribution (using the central limit theorem)... – mastropi Apr 03 '23 at 19:45
  • ...instead, if you fit a parametric distribution (or mixture of them) you could estimate the mean from the parameters of the fitted distribution and use the distribution itself to compute a confidence interval for the true mean (instead of using the Gaussian distribution as above). However, doing so may be difficult because it might not be so easy to derive the confidence interval of the mean depending on the parametric fitted distribution. And, if you have enough data (which seems to be the case in your case), the confidence interval using the Gaussian distribution is good enough. – mastropi Apr 03 '23 at 19:47
  • Also, bear in mind that when you write "[fitting a mixture distribution of Gamma]... from the density plot it was promising", as I mentioned in my first comment, the plot of the density you did is NOT the density of a mixture of Gammas, but simply the density of a nonparametric fit obtained with `density()`, so that conclusion is incorrect. – mastropi Apr 03 '23 at 19:48

0 Answers0