Fit the gamma distribution only to a subset of the samples

Question

I have the histogram of my input data (in black) given in the following graph:

I'm trying to fit the Gamma distribution but not on the whole data but just to the first curve of the histogram (the first mode). The green plot in the previous graph corresponds to when I fitted the Gamma distribution on all the samples using the following python code which makes use of scipy.stats.gamma:

img = IO.read(input_file)
data = img.flatten() + abs(np.min(img)) + 1

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins, patches = plt.hist(data, 1000, normed=True)

# slice histogram here

# estimation of the parameters of the gamma distribution
fit_alpha, fit_loc, fit_beta = gamma.fit(data, floc=0)
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, fit_loc, fit_beta)
print '(alpha, beta): (%f, %f)' % (fit_alpha, fit_beta)

# plot estimated model
plt.plot(x, y, linewidth=2, color='g')
plt.show()

How can I restrict the fitting only to the interesting subset of this data?

Update1 (slicing):

I sliced the input data by keeping only values below the max of the previous histogram, but the results were not really convincing:

This was achieved by inserting the following code below the # slice histogram here comment in the previous code:

max_data = bins[np.argmax(n)]
data = data[data < max_data]

Update2 (scipy.optimize.minimize):

The code below shows how scipy.optimize.minimize() is used to minimize an energy function to find (alpha, beta):

import matplotlib.pyplot as plt
import numpy as np
from geotiff.io import IO
from scipy.stats import gamma
from scipy.optimize import minimize


def truncated_gamma(x, max_data, alpha, beta):
    gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
    norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
    return np.where(x < max_data, gammapdf / norm, 0)


# read image
img = IO.read(input_file)

# calculate dB positive image
img_db = 10 * np.log10(img)
img_db_pos = img_db + abs(np.min(img_db))
data = img_db_pos.flatten() + 1

# data histogram
n, bins = np.histogram(data, 100, normed=True)

# using minimize on a slice data below max of histogram
max_data = bins[np.argmax(n)]
data = data[data < max_data]

data = np.random.choice(data, 1000)
energy = lambda p: -np.sum(np.log(truncated_gamma(data, max_data, *p)))
initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x

# plot data histogram and model
x = np.linspace(0, 100)
y = gamma.pdf(x, fit_alpha, 0, fit_beta)
plt.hist(data, 30, normed=True)
plt.plot(x, y, linewidth=2, color='g')
plt.show()

The algorithm above converged for a subset of data, and the output in o was:

x: array([ 16.66912781,   6.88105559])

But as can be seen on the screenshot below, the gamma plot doesn't fit the histogram:

Maybe spice the data to where the max occurs in the histogram? — Demetri Pananos, Dec 20 '16 at 17:19
@DemetriP Please see the updated question (update1). Although I don't think that a better result would have been expected, according to what I've added. — Hakim, Dec 20 '16 at 23:43
@Graipher Do you know a distribution whose the `pdf` graph looks like the previous histogram? — Hakim, Dec 20 '16 at 23:44
I would try a Gaussian as a first approximation, even though it is a lot flatter around the mode. Maybe try a [Argus function](https://books.google.de/books?id=YCA_CgAAQBAJ&lpg=PA34&ots=0XYBiGLQZR&dq=argus%20function&hl=de&pg=PP1#v=onepage&q&f=false) (page 34 in that book)? — Graipher, Dec 21 '16 at 00:09
In the end it depends on what your data actually is, which determines the function describing it. — Graipher, Dec 21 '16 at 00:10
@h4k1m Is KDE out of the question? Any reason why you need to fit the gamma? Can I get a sample of the data? — Demetri Pananos, Dec 21 '16 at 00:19
@DemetriP I'm trying to fit the `Gamma distribution` because the data in question [dump.dat](https://www.dropbox.com/s/6jaw35snnbu555s/dump.dat?dl=0) is a radar image, and in this case theoretically the water should follow a gamma distribution. Please open the attached file (numpy.ndarray) with ```with open(output_file, 'rb') as f: dump_img = cPickle.load(f)``` — Hakim, Dec 21 '16 at 10:58
@h4k1m I'll get to this later tonight. For the time being, have you thought about posting this to Cross Validated? See rules for cross posting. — Demetri Pananos, Dec 21 '16 at 15:30
@DemetriP I thought about posting my question there, but I think it's a programming-related question, so maybe stackoverflow is more adequate for it to be answered. — Hakim, Dec 21 '16 at 16:29

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

2

You can use a general optimization tool such as scipy.optimize.minimize to fit a truncated version of the desired function, resulting in a nice fit: Truncated fit

First, the modified function:

def truncated_gamma(x, alpha, beta):
    gammapdf = gamma.pdf(x, alpha, loc=0, scale=beta)
    norm = gamma.cdf(max_data, alpha, loc=0, scale=beta)
    return np.where(x<max_data, gammapdf/norm, 0)

This selects values from the gamma distribution where x < max_data, and zero elsewhere. The np.where part is not actually important here, because the data is exclusively to the left of max_data anyway. The key is normalization, because varying alpha and beta will change the area to the left of the truncation point in the original gamma.

The rest is just optimization technicalities.

It's common practise to work with logarithms, so I used what's sometimes called "energy", or the logarithm of the inverse of the probability density.

energy = lambda p: -np.sum(np.log(truncated_gamma(data, *p)))

Minimize:

initial_guess = [np.mean(data), 2.]
o = minimize(energy, initial_guess, method='SLSQP')
fit_alpha, fit_beta = o.x

My output is (alpha, beta): (11.595208, 824.712481). Like the original, it is a maximum likelihood estimate.

If you're not happy with the convergence rate, you may want to

Select a sample from your rather big dataset: data = np.random.choice(data, 10000)
Try different algorithms using the method keyword argument.

Some optimization routines output a representation of the inverse hessian, which is useful for uncertainty estimation. Enforcement of nonnegativity for the parameters may also be a good idea.

A log-scaled plot without truncation shows the entire distribution:

edited Jun 20 '20 at 09:12

Community

1
1

answered Dec 23 '16 at 20:34

periphreal

111
9

For me, first I couldn't get the same values of `(alpha, beta)` as you with the code you gave. And, even with the same values you obtained in my previous code it didn't fit. Am I missing something? – Hakim Dec 28 '16 at 19:57
Are you sure it didn't fit? I added a log-log plot to show how tiny the histogram part is at `(11.6, 825)`. What values did you get? @h4k1m – periphreal Dec 30 '16 at 08:30
Please see **update2** where the code used is shown. Actually it converged on a subset but the gamma plot doesn't fit the histogram as you showed on the results you obtained. Let me know if there are modifications to make in my code to get a result similar to yours @periphreal – Hakim Dec 30 '16 at 12:58
Are you sure it converged? Sometimes the convergence flag lies. Maybe the python or scipy versions behave differently. Did you try different algorithms in `minimize`? Some of them require you to add something like `bounds=[(1, None), (1, None)]`. – periphreal Dec 31 '16 at 14:57
According to `o.message = Optimization terminated successfully.`, I think this means that it converged, even though the curve didn't fit the histogram. I tried the other methods but some of them requires that the jacobian is given as a parameter. – Hakim Jan 01 '17 at 12:15
Yeah, that message isn't always accurate. If `minimize` returns the jacobian, it should be close to zero. Anyway, it doesn't look like your green gamma is normed similar to the histogram. Its integral *left* to `max_data` should be one. – periphreal Jan 02 '17 at 12:29
Any idea on how to normalize the plot similarly to the histogram? btw, when I run that piece of code I get a `RuntimeWarning: divide by zero encountered in log` in the `lambda` expression, which comes from a `RuntimeWarning: invalid value encountered in divide` in `truncated_gamma()`. – Hakim Jan 02 '17 at 13:09
the same way as in trusted gamma except you don't need to truncate with where. – periphreal Jan 03 '17 at 14:03
You're right. If I avoid to normalize the pdf the resulting plots fits better. Although it doesn't meet the histogram perfectly (especially for values around the mode), I'm quite happy with the results. – Hakim Jan 03 '17 at 17:58

score 1 · Accepted Answer · 2016-12-28T22:16:16.663

Here's another possible approach using a manually created dataset in excel that more or less matched the plot given.

Raw Data

Outline

Imported data into a Pandas dataframe.
Mask the indices after the max response index.
Create a mirror image of the remaining data.
Append the mirror image while leaving a buffer of empty space.
Fit the desired distribution to the modified data. Below I do a normal fit by the method of moments and adjust the amplitude and width.

Working Script

    # Import data to dataframe.
    df = pd.read_csv('sample.csv', header=0, index_col=0)
    # Mask indices after index at max Y.
    mask = df.index.values <= df.Y.argmax()
    df = df.loc[mask, :]
    scaled_y = 100*df.Y.values

    # Create new df with mirror image of Y appended.
    sep = 6
    app_zeroes = np.append(scaled_y, np.zeros(sep, dtype=np.float))
    mir_y = np.flipud(scaled_y)
    new_y = np.append(app_zeroes, mir_y)

    # Using Scipy-cookbook to fit a normal by method of moments.
    idxs = np.arange(new_y.size)  # idxs=[0, 1, 2,...,len(data)]
    mid_idxs = idxs.mean() # len(data)/2
    # idxs-mid_idxs is [-53.5, -52.5, ..., 52.5, len(data)/2]
    scaling_param = np.sqrt(np.abs(np.sum((idxs-mid_idxs)**2*new_y)/np.sum(new_y)))

    # adjust amplitude
    fmax = new_y.max()*1.2 # adjusted function max to 120% max y.
    # adjust width
    scaling_param = scaling_param*.7 # adjusted by 70%.
    # Fit normal.
    fit = lambda t: fmax*np.exp(-(t-mid_idxs)**2/(2*scaling_param**2))

    # Plot results.
    plt.plot(new_y, '.')
    plt.plot(fit(idxs), '--')
    plt.show()

Result

See the scipy-cookbook fitting data page for more on fitting a normal using method of moments.

Thanks for the link. But, what do `X, x, width` represent? – Hakim Dec 28 '16 at 18:31 — Hakim, Dec 28 '16 at 18:31

Fit the gamma distribution only to a subset of the samples

Update1 (slicing):

Update2 (scipy.optimize.minimize):

2 Answers2