1

Background

I am working on computing a series of best-fit gamma curves for a 2-D dataset in Numpy (ndarray), a prior question for the genesis of this can be found here.

Scipy was previously utilized (scipy.gamma.stats), however this library is not optimized for multi-dimensional arrays and a barebones function was written to meet the objective. I've successfully fit (albeit not as cleanly as Scipy) a curve to the dataset, which is provided below.

Current Issue

I want to obtain the percentile of a given value and vice versa along the calculated gamma distribution. However, I'm not obtaining expected values off the fitted curve. For example, providing the 50th percentile yields a value of 4.471, which does not match up with the curve fit shown below. What modifications or wholesale alterations can be made to yield both percentiles and values from supplied data?

Graph

Gamma Fitting Output

Code

import sys, os, math
import numpy as np
import scipy as sci
import matplotlib.pyplot as plt

data = np.array([0.00, 0.00, 11.26399994, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 17.06399918, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.33279991, 0.00, 7.54879951, 0.00, 0.00, 0.00, 4.58799982, 7.9776001, 0.00, 0.00, 0.00, 0.00, 11.45040035, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 18.73279953, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.94559956, 0.00, 7.73040009, 0.00, 0.00, 0.00, 5.03599977, 8.62639999, 0.00, 0.00, 0.00, 0.00, 11.11680031, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 14.37839985, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 8.16479969, 0.00, 7.30719948, 0.00, 0.00, 0.00, 3.41039991, 7.17280006, 0.00, 0.00, 0.00, 0.00, 10.0099199963, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 13.97839928, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 7.6855998, 0.00, 6.86559963, 0.00, 0.00, 0.00, 3.21600008, 7.93599987, 0.00, 0.00, 0.00, 0.00, 11.55999947, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 18.76399994, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0033039951, 0.00, 8.10639954, 0.00, 0.00, 0.00, 4.76480007, 6.87679958, 0.00, 0.00, 0.00, 0.00, 11.42239952, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 19.42639732, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0052400017, 0.00, 8.2567997, 0.00, 0.00, 0.00, 5.08239985, 7.9776001, 0.00, 0.00, 0.00, 0.00, 10.0099839973, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 11.5855999, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 7.88399982, 0.00, 5.96799994, 0.00, 0.00, 0.00, 3.07679987, 7.81360006, 0.00, 0.00, 0.00, 0.00, 11.51119995, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 20.0030959892, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 10.0050879955, 0.00, 8.20479965, 0.00, 0.00, 0.00, 5.51599979, 9.02879906, 0.00, 0.00])
 
def scigamma(data):
    param = sci.stats.gamma.fit(data)
    x = np.linspace(0, np.max(data), 250)
    cdf = sci.stats.gamma.cdf(x, *param)
    value = np.round((sci.stats.gamma.cdf(0.10, *param) * 100), 2)
    percentile = np.round((sci.stats.gamma.ppf(50.00, *param) * 100), 2)
    return cdf

scicdf = scigamma(data)

# Method of Moments estimation
mean = np.mean(data)
variance = np.var(data)
alpha = mean**2 / variance
beta = variance / mean

# Generate x-axis values for the curves
x = np.linspace(0, np.max(data), 250)
# Calculate the gamma distribution PDF values
pdf = (x ** (alpha - 1) * np.exp(-x / beta)) / (beta ** alpha * np.math.gamma(alpha))
# Calculate the gamma distribution CDF values
cdf = np.zeros_like(x)
cdf[x > 0] = np.cumsum(pdf[x > 0]) / np.sum(pdf[x > 0])
# Estimate the probability of zero values
num_zeros = np.count_nonzero(data == 0)
zero_probability = np.count_nonzero(data == 0) / len(data)
# Calculate the PDF and CDF values at zero
pdf_zero = zero_probability / (beta ** alpha * np.math.gamma(alpha))
cdf_zero = zero_probability

value = 2.50
percentile = 0.50
index = np.argmax(pdf >= value)
# Calculate the percentile using numerical integration
pct = np.trapz(pdf[:index+1], dx=1) + (value - pdf[index]) * (cdf[index] - cdf[index-1]) / (pdf[index-1] - pdf[index])
index = np.argmax(cdf >= percentile)
# Calculate the value using numerical integration
val = np.trapz(cdf[:index+1], dx=1) + (percentile - cdf[index-1]) * (pdf[index] - pdf[index-1]) / (cdf[index] - cdf[index-1])

# Plot the data histogram
plt.hist(data, bins=30, density=True, alpha=0.5, label='data')

# Plot the gamma distribution CDF curve
plt.plot(x, cdf, 'b', label='Gamma CDF | Custom Fit')
plt.plot(x, scicdf, 'k', label='Gamma CDF | SciPy Fit')

# Set plot labels and legend
plt.xlabel('data')
plt.ylabel('Probability')
plt.legend()
TornadoEric
  • 399
  • 3
  • 16

2 Answers2

1

I'm not clear on why you don't trust sci.stats. If it's because of performance, have you actually profiled? It's worth noting that neither your fit nor scipy's fit are particularly good, because your data are strongly skewed toward 0.

When initialising your second x, don't start it at 0 because your calculations are not well-defined for that point, which (a) dumps warnings on the console and (b) requires a bunch of indexing gymnastics later.

I'm also not clear on why you're attempting trapz. Why integrate when you already have a CDF? Just use the CDF; and also don't argmax; use searchsorted.

Finally, your plotting is slightly a disaster. For apples-to-apples comparison it's very important that you pass cumulative to hist.

import numpy as np
import scipy as sci
import matplotlib.pyplot as plt

data = np.array([
        0.        ,  0.        , 11.26399994,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       17.06399918,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  8.33279991,  0.        ,  7.54879951,
        0.        ,  0.        ,  0.        ,  4.58799982,  7.9776001 ,
        0.        ,  0.        ,  0.        ,  0.        , 11.45040035,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , 18.73279953,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  8.94559956,
        0.        ,  7.73040009,  0.        ,  0.        ,  0.        ,
        5.03599977,  8.62639999,  0.        ,  0.        ,  0.        ,
        0.        , 11.11680031,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        , 14.37839985,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  8.16479969,  0.        ,  7.30719948,  0.        ,
        0.        ,  0.        ,  3.41039991,  7.17280006,  0.        ,
        0.        ,  0.        ,  0.        , 10.00992   ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        , 13.97839928,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  7.6855998 ,  0.        ,
        6.86559963,  0.        ,  0.        ,  0.        ,  3.21600008,
        7.93599987,  0.        ,  0.        ,  0.        ,  0.        ,
       11.55999947,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        , 18.76399994,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       10.003304  ,  0.        ,  8.10639954,  0.        ,  0.        ,
        0.        ,  4.76480007,  6.87679958,  0.        ,  0.        ,
        0.        ,  0.        , 11.42239952,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
       19.42639732,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , 10.00524   ,  0.        ,  8.2567997 ,
        0.        ,  0.        ,  0.        ,  5.08239985,  7.9776001 ,
        0.        ,  0.        ,  0.        ,  0.        , 10.009984  ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , 11.5855999 ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  7.88399982,
        0.        ,  5.96799994,  0.        ,  0.        ,  0.        ,
        3.07679987,  7.81360006,  0.        ,  0.        ,  0.        ,
        0.        , 11.51119995,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        , 20.00309599,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        , 10.005088  ,  0.        ,  8.20479965,  0.        ,
        0.        ,  0.        ,  5.51599979,  9.02879906,  0.        ,
        0.        ])


def scigamma(data: np.ndarray) -> np.ndarray:
    param = sci.stats.gamma.fit(data)
    x = np.linspace(0, np.max(data), 250)
    cdf = sci.stats.gamma.cdf(x, *param)
    return cdf


scicdf = scigamma(data)

# Method of Moments estimation
mean = np.mean(data)
variance = np.var(data)
alpha = mean**2 / variance
beta = variance / mean

# Generate x-axis values for the curves
# Do not start at 0
x = np.linspace(0.1, np.max(data), 250)

# Calculate the gamma distribution PDF values
pdf = x**(alpha - 1) * np.exp(-x / beta) / (beta**alpha * np.math.gamma(alpha))

# Calculate the gamma distribution CDF values
cdf = np.cumsum(pdf) / np.sum(pdf)


def plot() -> None:
    plt.hist(data, bins=50, density=True, cumulative=True, alpha=0.5, label='data')
    plt.plot(x, cdf, 'b', label='Gamma CDF | Custom Fit')
    plt.plot(x, scicdf, 'k', label='Gamma CDF | SciPy Fit')

    plt.xlabel('data')
    plt.ylabel('Probability')
    plt.legend()
    plt.show()


def check_percentiles() -> None:
    # Calculate the percentile using numerical integration
    value = 2.50
    index = np.searchsorted(x, value)
    pct = cdf[index]
    print(f'x={value} is at percentile {pct:.3f}')

    # Calculate the value using numerical integration
    pct = 0.50
    index = np.searchsorted(cdf, pct)
    val = cdf[index]
    print(f'percentile {pct} is x={val:.3f}')


check_percentiles()
plot()
x=2.5 is at percentile 0.680
percentile 0.5 is x=0.503

cdf comparison

Much more directly, though,

param = sci.stats.gamma.fit(data)

# Calculate the percentile
value = 2.50
pct = scipy.stats.gamma.cdf(value, *param)
print(f'x={value} is at percentile {pct:.3f}')

# Calculate the value
pct = 0.50
val = scipy.stats.gamma.ppf(pct, *param)
print(f'percentile {pct} is x={val:.3f}')

It's also worth noting that if you want scipy to produce a fit similar to yours, then pass method='MM' to fit():

MM method

As in the comments, a single gamma will not fit your data well. Consider instead a simple bimodal where the first CDF is a Heaviside step and the second CDF is still assumed gamma:

x = np.linspace(start=0, stop=20, num=250)

# first mode of the bimodal distribution produces a heaviside step in the CDF
nonzero_data = data[data.nonzero()]
heaviside = 1 - nonzero_data.size/data.size
cdf = np.full_like(x, fill_value=heaviside)

# second mode of the bimodal distribution assumed still gamma
params = scipy.stats.gamma.fit(nonzero_data)
print('Gamma params:', params)
cdf += scipy.stats.gamma.cdf(x, *params)*(1 - heaviside)

plt.hist(data, bins=50, density=True, cumulative=True, alpha=0.5, label='data')
plt.plot(x, cdf, label='fit')
plt.xlabel('data')
plt.ylabel('Probability')
plt.legend(loc='right')
plt.show()
Gamma params: (3.9359070026702017, 1.3587246910542263, 2.0440735483494272)

bimodal

Reinderien
  • 11,755
  • 5
  • 49
  • 77
  • Thanks for your detailed response! A follow-up question on calculating the value underneath the check_percentiles() function: placing a probability of 0.50 should yield an expected value somewhere just over 1.10 on the x-axis, as opposed to 0.503. Should that portion read `val = x[index]` instead to map the corresponding cdf[index] to the x value? – TornadoEric May 22 '23 at 20:00
  • 1
    Why should it read 1.1? Really, it should read neither 1.1 nor 0.503; it should read 0. Think about it: you have a roughly bimodal distribution with a massive spike at 0. A single gamma is not going to fit this distribution well. If you wanted more accurate PPF calculations, then you need to replace your model. – Reinderien May 22 '23 at 20:06
  • That's actually a fair assessment I glazed over while coding this/working with the data initially. Would you recommend say a quantile regression model over the use of a gamma distribution? – TornadoEric May 22 '23 at 20:13
  • 1
    Start with a bimodal normal. That will be easy and probably get you 90% of the way there. – Reinderien May 22 '23 at 20:51
  • 1
    @TornadoEric see edit for bimodal demo. – Reinderien May 22 '23 at 21:10
  • This is excellent, thank you! I'll accept your answer, hopeful this helps other folks as well; besides myself – TornadoEric May 22 '23 at 21:17
0

Using the variables below

x = np.linspace(0, np.max(data), 250)
cdf[x > 0] = np.cumsum(pdf[x > 0]) / np.sum(pdf[x > 0])

you could fit a curve using scipy's interpolate to fit functions to obtain percentile from values and vice versa:

from scipy import interpolate

f_x_to_percentile = interpolate.interp1d(x, cdf)
x_val = 2.5
print(f"Percentile of value {x_val}: {f_x_to_percentile(x_val)}")
# Percentile of value 2.5: 0.6864894891059873

f_percentile_to_x= interpolate.interp1d(cdf, x)
percentile_val = 0.5
print(f"Value at percentile {percentile_val*100}: {f_percentile_to_x(percentile_val)}")
# Value at percentile 50.0: 1.015695412520532

Update:

If you decide to continue using scipy to fit your distribution, you can directly use the fitted parameters in your param object to obtain the results using the .cdf() and .ppf() methods:

from scipy import stats

param = stats.gamma.fit(data)

x_val = 2.5
percentile = stats.gamma(*param).cdf(x_val)
print(f"Percentile of value {x_val}: {percentile}")
# Percentile of value 2.5: 0.7681790728985926

percentile_val = 0.5
value_at_percentile = stats.gamma(*param).ppf(0.5)
print(f"Value at percentile {percentile_val*100}: {value_at_percentile}")
# Value at percentile 50.0: 0.7536517552749796
Simon David
  • 663
  • 3
  • 13