5

In this link total variation distance between two probability distribution is given.

I tried to calculate it in python. I have two datasets and firstly I calculated their probability distribution functions from histograms. Then I tried to get max differences of between two distributions. But it returns me very small values. It seems that I am doing a mistake in it. Can you please help to fix it?

import scipy.stats as st
#original data has shape of [45222,1] and it is numpy array
#synthetic data has shape of [45222,1] and it is numpy array
summation = 0
minOriginal = min(original)
minGenerated = min(synthetic)

maxOriginal = max(original)
maxGenerated = max(synthetic)

minHist = min(minOriginal, minGenerated)
maxHist = max(maxOriginal, maxGenerated)

originalHist = np.histogram(original, range=(minHist, maxHist))
hist_dist1 = st.rv_histogram(originalHist)

generatedHist = np.histogram(synthetic, range=(minHist, maxHist))
hist_dist2 = st.rv_histogram(generatedHist)

x = np.linspace(minHist, maxHist, 45000)
summation += max(abs(hist_dist1.pdf(x)-hist_dist2.pdf(x)))
user3104352
  • 1,100
  • 1
  • 16
  • 34
  • Some sample data and explanation on what `summation`, `hist_original_dist` and `hist_generated_dist` are will help. – Egal Aug 19 '17 at 21:07
  • I edited them. When I was copying I did a mistake. Thank you for fixing it. – user3104352 Aug 19 '17 at 21:10
  • It seems you're assuming the result is wrong without backing up the claim. I'd suggest starting with some smaller sample datasets and verifying that the result is mathematically incorrect. Post your findings so you can get help debugging it. – Egal Aug 19 '17 at 21:29
  • 1
    I think your problem is a misunderstanding of what total variation distance is. You think you find total variation distance by finding the point where the difference in probability is largest. But really, it's about finding the _set_ of points where the difference in probability is largest. Computationally, there's a few ways to do that. As the wikipedia article explains, one is just to sum up the absolute differences in all the probabilities, and divide by 2. – user54038 Jun 17 '18 at 19:43

0 Answers0