-1

I am implementing an online topic modeling as outlined in the paper - On-line Trend Analysis with Topic Models: #twitter trends detection topic model online I need to find the Jensen-Shannon divergence measure between the word distribution of each topic t before and after an update, and classify a topic as being novel if the measure exceeds a threshold. At each update the vocabulary is updated, so the word distribution over the vocabulary has different length after each update. How can I calculate the JS divergence between two distributions of unequal length?

tan
  • 1,569
  • 5
  • 14
  • 30
  • Its more of a conceptual stat and probability question rather than testing what code I have written. Only after the concept is clear one can venture into implementing the correct code. May be stackoverflow is not right forum to ask.I should have gone to stackexchange. Thanks @Tiger1 for your helpful remarks. – tan Nov 30 '13 at 17:14

2 Answers2

1

Jensen-Shannon divergence is the relative entropy of two probability distributions, it is a symmetrical form of Kullback-Leibler (KL) divergence. It is the average of the KL divergence when the two arguments that you are comparing with respect to divergence are swapped.

You will need a good understanding of KL divergence before you can proceed. Here is a good starting point:

Given two probability distributions ie P and Q.

P=(p1....pi), Q=(q1....qi)

KL(P||Q)= sum([pi * log(pi/qi) for i in P if i in Q])

KL is not symmetric, hence it is not a metric. In order to make KL symmetric Jensen and Shannon proposed the Jensen-Shannon divergence which is the average of the KL divergence when the two arguments are swapped ie

JSd(P||Q)= (KL(P,Z) + KL(Q,Z))/2

where Z=(P + Q)/2

In simple terms, the Jensen-Shannon divergence is the average of the averaged KL divergence between two probability distributions.

I hope this helps.

Tiger1
  • 1,327
  • 5
  • 19
  • 40
0

Using random.choice sample data to make the same length of distribution of p and q

def jsd(p, q, base=np.e):
    '''
        Implementation of pairwise `jsd` based on  
        https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
    '''
    if len(p)>len(q):
        p = np.random.choice(p,len(q)) # random.choice make same length to p/q
    elif len(q)>len(p):
        q = np.random.choice(q,len(p))
    ## convert to np.array
    p, q = np.asarray(p), np.asarray(q)
    ## normalize p, q to probabilities
    p, q = p/p.sum(), q/q.sum()
    m = 1./2*(p + q)
    return scipy.stats.entropy(p,m, base=base)/2. +  scipy.stats.entropy(q, m, base=base)/2.
Terence Yang
  • 558
  • 6
  • 9