0

I have a dataset which has 683 samples and 9 features. I want to compare KLDivergence of two datasets for each column.

originalAttribute = np.asarray(originalData[:, i]).reshape(row)
histOriginal = np.histogram(originalAttribute, bins=binSize)
hist_original_dist = st.rv_histogram(histOriginal)


generatedAttribute = np.asarray(generatedData[:, i]).reshape(row)
histGenerated = np.histogram(generatedAttribute, bins=binSize)
hist_generated_dist = st.rv_histogram(histGenerated)

x = np.linspace(-5, 5, 100)
summation += st.entropy(hist_original_dist.pdf(x), hist_generated_dist.pdf(x))

It returns infinitive but I think I did something wrong. In hist_original_dist.pdf(x) function, I have some values such as 2.65 which shouldn't exist for pdf in python

user3104352
  • 1,100
  • 1
  • 16
  • 34
  • I don't know about the value of 2.65. It might have something to do with the fact that it's a histogram, and histograms represent density with respect to the bin size chosen. In any case though, with regard to KL, this is a sum of ratios. If the value in the denominator for even one of your bins is 0 (i.e. an empty bin with no data), then that particular element in the summation will be inifinity, resulting in your whole KL divergence being infinity. – Tasos Papastylianou Jul 15 '17 at 00:10
  • also, see [this post](https://stats.stackexchange.com/questions/14127/how-to-compute-the-kullback-leibler-divergence-when-the-pmf-contains-0s) over at Cross-Validated – Tasos Papastylianou Jul 15 '17 at 00:19

0 Answers0