8

I want to know the distribution of my data points, so first I plotted the histogram of my data. My histogram looks like the following: my histogram

Second, in order to fit them to a distribution, here's the code I wrote:

size = 20000
x = scipy.arange(size)
# fit
param = scipy.stats.gamma.fit(y)
pdf_fitted = scipy.stats.gamma.pdf(x, *param[:-2], loc = param[-2], scale = param[-1]) * size
plt.plot(pdf_fitted, color = 'r')

# plot the histogram
plt.hist(y)

plt.xlim(0, 0.3)
plt.show()

The result is:

enter image description here

What am I doing wrong?

aloha
  • 4,554
  • 6
  • 32
  • 40

1 Answers1

13

Your data does not appear to be gamma-distributed, but assuming it is, you could fit it like this:

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

gamma = stats.gamma
a, loc, scale = 3, 0, 2
size = 20000
y = gamma.rvs(a, loc, scale, size=size)

x = np.linspace(0, y.max(), 100)
# fit
param = gamma.fit(y, floc=0)
pdf_fitted = gamma.pdf(x, *param)
plt.plot(x, pdf_fitted, color='r')

# plot the histogram
plt.hist(y, normed=True, bins=30)

plt.show()

enter image description here

  • The area under the pdf (over the entire domain) equals 1. The area under the histogram equals 1 if you use normed=True.

  • x has length size (i.e. 20000), and pdf_fitted has the same shape as x. If we call plot and specify only the y-values, e.g. plt.plot(pdf_fitted), then values are plotted over the x-range [0, size]. That is much too large an x-range. Since the histogram is going to use an x-range of [min(y), max(y)], we much choose x to span a similar range: x = np.linspace(0, y.max()), and call plot with both the x- and y-values specified, e.g. plt.plot(x, pdf_fitted).

  • As Warren Weckesser points out in the comments, for most applications you know the gamma distribution's domain begins at 0. If that is the case, use floc=0 to hold the loc parameter to 0. Without floc=0, gamma.fit will try to find the best-fit value for the loc parameter too, which given the vagaries of data will generally not be exactly zero.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 4
    Note that typically, the `loc` parameter of the gamma distribution is not used (i.e. the PDF should not be shifted), and the value is fixed at 0. By default, the `fit` method treats `loc` as fitting parameter, so you might get a small nonzero shift--check the parameters returned by `fit`. You can tell `fit` to not include `loc` as a fitting parameter by using the argument `floc=0`. – Warren Weckesser Mar 23 '15 at 12:22
  • @WarrenWeckesser: Thanks very much; I didn't know about `floc`. – unutbu Mar 23 '15 at 12:34
  • @unutbu, thanks a lot for the detailed answer! Now I understand what I was doing wrong. Indeed, my data is not gamma-distributed. Any hints what might my distribution be? – aloha Mar 23 '15 at 20:49
  • 1
    @po6: Sorry, I don't recognize what distribution your data might be coming from. If there is a theoretical model for the system you are measuring, then that model would provide a guess for the distribution. Or, if there is no such model, then perhaps you need more samples to "flesh out" that thin tail. If there is no model and you can't obtain more samples, then perhaps you would have to use the data itself as the definition of a discrete [probability mass function](http://en.wikipedia.org/wiki/Probability_mass_function). – unutbu Mar 23 '15 at 21:26
  • @unutbu One question about this fitting using the gamma distribution: for the histogram, since you have used `normed=True` the area of all bars adds up to one, but what if we instead re-normalize such that the y-value of each bar is already the right probability, since all bins are same size, using `weights = np.ones_like(y)/float(len(y))`, that's fine so far, but how do I make my gamma fit match this new normalization? –  May 11 '16 at 12:39
  • I ask because, given the parameter results of the fitting, with `a=param[0]` the shape parameter, and `theta=param[2]` the scale, how can we renormalize them so that the obtained pdf is normed? (are there parameters we could include in `gamma.fit()` so that the resulting pdf is normed?) (to test that it is not normed currently, `sum(pdf_fitted)` doesn't give one) I fixed it by multiplying it by `np.diff(bins)[0].` –  May 11 '16 at 13:06
  • hello!! do you have any idea how to do the same fit, but instead of a gamma distribution, to use a custom distribution like `f(x)=C*exp(a*x+b*x^2+c*x^4+d*x^8)`? – CyberMathIdiot Sep 07 '21 at 06:56
  • Based on matplotlib version 3.5.0, we should be using `plt.hist(y, density=True, bins=30)` instead of `plt.hist(y, normed=True, bins=30)`. – Arun Das Apr 08 '22 at 01:45