2

Suppose I had two 2D sets of 1000 samples that look something like this:

enter image description here

I'd like to have a metric for the amount of difference between the distributions and thought the KL divergence would be suitable.

I've been looking at sp.stats.entropy(), however from this answer:

Interpreting scipy.stats.entropy values it appears I need to convert it to a pdf first. How can one do this using a 4 1D arrays?

The example data above was generated as follows:

dist1_x = np.random.normal(0, 10, 1000)
dist1_y = np.random.normal(0, 5, 1000)

dist2_x = np.random.normal(3, 10, 1000)
dist2_y = np.random.normal(4, 5, 1000)

plt.scatter(dist1_x, dist1_y)
plt.scatter(dist2_x, dist2_y)
plt.show()

For my real data I only have the samples, not the distribution from which they came (although if need be one could calculate the mean and variance and assume Gaussian). Is it possible to calculate the KL divergence like this?

Dan Jackson
  • 185
  • 1
  • 9
  • My take on this would be to first do a 2D histogram of each of your data sets and then normalize it. In that way you will have a direct crude estimate of a PDF. Then use the KL divergence for the discrete case. – Swike Jul 28 '23 at 21:45

1 Answers1

2

There is a paper called "Kullback-Leibler Divergence Estimation of Continuous Distributions (2008)"

And you might find the open source implementation here https://gist.github.com/atabakd/ed0f7581f8510c8587bc2f41a094b518

Chenghao
  • 345
  • 1
  • 3
  • 10