I have two datasets that contain 40000 samples. I want to calculate the Kullback-Leibler divergence between these two datasets in python. Is there any efficient way of doing this in python?
-
2Is [this](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html) what you are looking for? – Sam Craig Jul 13 '17 at 16:50
-
2OP, bear in mind that KL divergence is only defined for distributions -- if you have sample data, you will have to fit some distribution or distributions to the data and then compute KL divergence from that. – Robert Dodier Jul 13 '17 at 16:56
-
How can I best fit one dataset to distribution? Actually my problem is fitting dataset to distribution. – user3104352 Jul 13 '17 at 16:58
-
1Classic example of the [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) :D – Antimony Oct 11 '17 at 20:58
-
Refer to my answer on this page: https://stackoverflow.com/a/63370136/8653046 – Amir Charkhi Aug 12 '20 at 05:12
1 Answers
Edit:
OK. I figured out it doesn't work in the input space. So the old explanation is probably wrong but I'll keep it anyway.
Here is my new thoughts:
In my senior project, I'm using the algorithm called AugMix. In this algorithm they calculated the Shannon-Jensen Divergence between two augmented images, which is the symmetrical form of the KL Divergence.
They used the model output as the probability distribution of the dataset. The idea is to fit a model to a dataset, then interpret the output of the model as the probability density function.
For example, you fitted a dataset without overfitting. Then (assuming this is an classification problem) you feed your logits (the output of the last layer) to the softmax function for each class (sometimes the softmax function is added as a layer to the end of the network, careful). The output of your softmax function (or layer) can be interpreted as P(Y|X_{1}) where X_{1} is the input sample and the Y is the groundtruth class. Then you make a prediction for another sample X_{2}, P(Y|X_{2}), where X_{1} and X_{2} comes from different datasets (say dataset_1 and dataset_2) and the model is not trained with any of those datasets.
Then the KL divergence between dataset_1 and dataset_2 can be calculated by KL(dataset_1 || dataset_2) = P(Y|X_{1}) * log(P(Y|X_{1}) / P(Y|X_{2}))
Make sure that X_{1} and X_{2} belongs to the same class.
I'm not sure if this is the correct way. Alternatively, you can train two different models (model_1 and model_2) using different datasets (dataset_1 and dataset_2) and then calculate the KL divergence on the predictions of those two models using the samples of another dataset called dataset_3. In other words:
KL(dataset_1 || dataset_2) = sum x in dataset_3 model_1(x) * log(model_1(x) / model_2(x))
where model_1(x) is the softmax output of model_1, which is trained using dataset_1 without overfitting, for the correct label.
The latter sounds more reasonable to me but I'm not sure either of them. I could not find a proper answer on my own.
The things I'm going to explain are adopted from the blog of the Jason Brownlee from machinelearningmastery.com KL Divergence
As far as I understood, firstly, you have to convert your datasets into the probability distribution so that you can calculate the probability of each of the samples from the union (or intersect?) of the both datasets.
KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))
However, most of the time the intersection of the datasets are none. For example, if you want to measure the divergence between CIFAR10 and ImageNet, there is not any samples in common. The only way you can calculate this metric is to sample from the same dataset to create two different datasets. Therefore you can have samples that are present in both datasets, and calculate the KL divergence.
Lastly, maybe you want to check the Wasserstein Divergence that is used in GANs in order to compare the source distribution and the target distribution.

- 1
- 1
- 1
-
how can i use scipy to get the generator of a probability distribution with min KL divergence in python? – yishairasowsky May 18 '21 at 16:24
-
Idk exactly but you can calculate using the new technique I proposed in the edit. It requires a well trained classification model. – egirgin May 25 '21 at 21:54