1

I have two list. Both include normalized percent:

  • actual_population_distribution = [0.2,0.3,0.3,0.2]
  • sample_population_distribution = [0.1,0.4,0.2,0.3]

I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.

I have already able to get KL.

This is the function I used to calculate gamma:

def gamma_random_sample(data_list):
    mean = np.mean(data_list)
    var = np.var(data_list)
    g_alpha = mean * mean / var
    g_beta = mean / var
    for i in range(len(data_list)):
        yield random.gammavariate(g_alpha, 1/g_beta)

Fit two lists into gamma distribution:

actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]

This is the code I used to calculate KL:

kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))

The code above does not produce any errors.

But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.

Indeed, the number is different to:

mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')

if I use this way.

By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.

What do I miss?

Evgeny Minkevich
  • 2,319
  • 3
  • 28
  • 42
Jennifer
  • 19
  • 2
  • 6

1 Answers1

0

See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std

I don't understand what you are trying to do with:

actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]

It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.

The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .

I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.

I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.

EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.

  • Thanks Pedro. Is that possible to give the Python code for the analytical solution of the generalized distribution? The paper is quite hard for me. – Jennifer Aug 09 '19 at 06:24
  • Probably. Someone wrote on stack the equation for the KL divergence of the two parameter gamma distribution: https://stats.stackexchange.com/questions/11646/kullback-leibler-divergence-between-two-gamma-distributions . You just need to code up the equation into a function and then throw in your estimated hyper parameters to get your KL divergence. – Pedro Relich Aug 13 '19 at 20:33