0

I have two datasets which are same shape: (576, 450, 5) where 576 is the number of examples, 450 is the time points and 5 is the number of channels.

I want to calculate entropy and KL-divergence between these two datas. But I know that the entropy and kl-divergence are calculated between probability distributions but the datas are just numerical values(not probability distributions). So how can I calculate these for my datas? Should I convert my data to probability distributions? If so how can I do it with my 3d data? Thank you.

ali
  • 119
  • 1
  • 3
  • 15

1 Answers1

0

You can use quantiles to derive the empirical distribution of each dataset if it were binned and use that to compute the entropy, mutual information, etc. (any measure or distance that relates to one or more probability distributions) between the binned distributions.

In tensorflow, this can be achieved by using tfp.stats.quantiles as follows tfp.stats.quantiles(x, num_quantiles=4, interpolation='nearest'), where you can replace x with a dataset and set num_quantiles to any reasonable number.

The crucial thing to be careful of here is that the cut points should be the same for the two datasets (i.e., both binned random variables must have the same support).

More generally, you need to train/estimate a statistical model of the two datasets and then use that model to compute these metrics. In the above, the statistical model is a categorical distribution.

In sum, you can either:

  1. Call tfp.stats.quantiles with num_quantiles on one dataset and then re-use the cut_points to compute quantiles for the other dataset. To do so you will need tfp.stats.find_bins.

  2. Decide on the cut_points based on some other metric (equal partitions of the support of the data?) and then call tfp.stats.find_bins on both datasets.

The alternative I would favour is a variant of option 2. You can use quantiles to get the cut_points that correspond to both datasets if the datasets were concatenated together. You can then use those cut_points for binning both datasets.

Once you have the quantiles and/or the bins, you have a categorical probability distribution describing each dataset and from there these measures/distances can be computed easily.