You can use quantiles
to derive the empirical distribution of each dataset if it were binned and use that to compute the entropy, mutual information, etc. (any measure or distance that relates to one or more probability distributions) between the binned distributions.
In tensorflow
, this can be achieved by using tfp.stats.quantiles
as follows tfp.stats.quantiles(x, num_quantiles=4, interpolation='nearest')
, where you can replace x
with a dataset and set num_quantiles
to any reasonable number.
The crucial thing to be careful of here is that the cut points should be the same for the two datasets (i.e., both binned random variables must have the same support).
More generally, you need to train/estimate a statistical model of the two datasets and then use that model to compute these metrics. In the above, the statistical model is a categorical distribution.
In sum, you can either:
Call tfp.stats.quantiles
with num_quantiles
on one dataset and then re-use the cut_points
to compute quantiles for the other dataset. To do so you will need tfp.stats.find_bins
.
Decide on the cut_points
based on some other metric (equal partitions of the support of the data?) and then call tfp.stats.find_bins
on both datasets.
The alternative I would favour is a variant of option 2. You can use quantiles
to get the cut_points
that correspond to both datasets if the datasets were concatenated together. You can then use those cut_points
for binning both datasets.
Once you have the quantiles and/or the bins, you have a categorical probability distribution describing each dataset and from there these measures/distances can be computed easily.