Pointwise mutual information on text

Question

I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories.

mbatchkarov · Accepted Answer · 2016-11-12T20:07:58.430

39

PMI is a measure of association between a feature (in your case a word) and a class (category), not between a document (tweet) and a category. The formula is available on Wikipedia:

                  P(x, y)
pmi(x ,y) = log ------------ 
                  P(x)P(y)

In that formula, X is the random variable that models the occurrence of a word, and Y models the occurrence of a class. For a given word x and a given class y, you can use PMI to decide if a feature is informative or not, and you can do feature selection on that basis. Having less features often improves the performance of your classification algorithm and speeds it up considerably. The classification step, however, is separate- PMI only helps you select better features to feed into your learning algorithm.

Edit: One thing I didn't mention in the original post is that PMI is sensitive to word frequencies. Let's rewrite the formula as

                  P(x, y)             P(x|y)
pmi(x ,y) = log ------------ = log ------------ 
                  P(x)P(y)             P(x)

When x and y are perfectly correlated, P(x|y) = P(y|x) = 1, so pmi(x,y) = 1/P(x). Less frequent x-es (words) will have a higher PMI score than frequent x-es, even if both are perfectly correlated with y.

edited Nov 12 '16 at 20:07

answered Nov 21 '12 at 12:02

mbatchkarov

15,487
9
60
79

14

`P(x)` is the probability of the word `x` (lowercase) occurring, which is the ratio between the number of documents that contain the word and the total number of documents. `P(y)` is the probability of class (category) `y`, which is calculated in a similar fashion. `P(x,y)` if the ratio between the number of documents that are *both* of label `y` and contain word `x` and the total number of documents. – mbatchkarov Nov 21 '12 at 21:30
1

Do you really need to normalize the counts into probabilities by dividing by the number of documents? I know you get a different pmi() number, but the relative pmi() between different pairs of (X,Y) stay the same and the actual value of the pmi doesn't mean anything anyways right? I can only see the normalization useful if comparing pmi's across different document sets (with different document counts) – kane Jun 26 '15 at 19:39
3

Gerlof Bouma wrote an paper titled "Normalized (Pointwise) Mutual Information in Collocation Extraction" that I believe addresses sensitivity to word frequencies. Basically just divide pmi by `-log(P(x, y))`. – Kevin Jin Dec 12 '15 at 04:49
@mbatchkarov Thanks for detailed answer. I have a small followup question. What if P(x,y) is equal to zero ? Log(0) is undefined, what should I do ? What does it tell to me ? Does it tell to me that that x always associated with y' and therefore a very important or should I ignore it? – zwlayer Mar 24 '20 at 10:56
`P(x, y) == 0` means a word `x` was never observed in a document of class `y`. You should use smoothing to work around that. Smoothing is a bit of an art in its own right, but the simplest thing you can do is to add a small positive number to all probabilities. Look up `additive smoothing` – mbatchkarov Mar 24 '20 at 16:25

Pointwise mutual information on text

1 Answers1

Linked