4

I've used the following code to compute the Mutual Information and Chi Square values for feature selection in Sentiment Analysis.

MI = (N11/N)*math.log((N*N11)/((N11+N10)*(N11+N01)),2) + (N01/N)*math.log((N*N01)/((N01+N00)*(N11+N01)),2) + (N10/N)*math.log((N*N10)/((N10+N11)*(N00+N10)),2) + (N00/N)*math.log((N*N00)/((N10+N00)*(N01+N00)),2)

where N11,N01,N10 and N00 are the observed frequencies of the two features in my data set.

NOTE : I am trying to calculate the mutual information and Chi Squared values between 2 features and not the mutual information between a particular feature and a class. I'm doing this so I'll know if the two features are related in any way.

The Chi Squared formula I've used is :

E00 = N*((N00+N10)/N)*((N00+N01)/N)
E01 = N*((N01+N11)/N)*((N01+N00)/N)
E10 = N*((N10+N11)/N)*((N10+N00)/N)
E11 = N*((N11+N10)/N)*((N11+N01)/N)

chi = ((N11-E11)**2)/E11 + ((N00-E00)**2)/E00 + ((N01-E01)**2)/E01 + ((N10-E10)**2)/E10  

Where E00,E01,E10,E11 are the expected frequencies.

By the definition of Mutual Information, a low value should mean that one feature does not give me information about the other and by the definition of Chi Square, a low value of Chi Square means that the two features must be independent.

But for a certain two features, i got a Mutual information score of 0.00416 and a Chi Square value of 4373.9. This doesn't make sense to me since the Mutual information score indicates the features aren't closely related but the Chi Square value seems to be high enough to indicate they aren't independent either. I think I'm going wrong with my interpretation

The values I got for the observed frequencies are

N00 = 312412
N01 = 276116
N10 = 51120
N11 = 68846
Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • 4
    Why does this question have a python tag? In fact, why is this question in stackoverflow? Wouldn't a math or statistics Q&A be more appropriate? – Warren Weckesser Jan 20 '13 at 14:12

3 Answers3

2

MI and Pearson's Large Sample Statistic are, under the usual conditions concerning sample size, directly proportional. This is quite well known. S proof is given here.

Morris, A.C. (2002) "An information theoretic measure of sequence recognition performance". Can be downloaded from this page.

https://sites.google.com/site/andrewcameronmorris/Home/publications

Therefore, unless there is some mistake in your calculations, if one is high/low the other must be high/low.

2

The chi-squared independence test is examining raw counts while the mutual information score is examining only marginal and joint probability distributions. Hence, chi-squared also takes into account the sample size.

If the dependence between x and y is very subtle, then knowing one won't help very much in terms of predicting the other. However, as the size of the dataset increases we can become increasingly confident that some relationship exists.

aes
  • 46
  • 3
0

You can try https://github.com/ranmoshe/Inference - it calculates both MI, and the p-value statistic using chi-square.

It also knows to calculate the degrees of freedom for each feature, including taking into account a conditional group (where the dof for a feature may be different between different values)

Ran Moshe
  • 350
  • 1
  • 12