-1

Consider the usual example that replicates example from 13.1 of An Introduction to Information Retrieval https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")

trainingset <- dfm(txt, tolower = FALSE)
trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE)

tmod1 <- textmodel_nb(trainingset, y = trainingclass, prior = "docfreq")

According to the docs, PcGw is the posterior class probability given the word. How it is computed? I thought what we cared about was the other way around, that is P(word / class).

> tmod1$PcGw
       features
classes   Chinese   Beijing  Shanghai     Macao     Tokyo     Japan
      N 0.1473684 0.2058824 0.2058824 0.2058824 0.5090909 0.5090909
      Y 0.8526316 0.7941176 0.7941176 0.7941176 0.4909091 0.4909091

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

1

The application is clearly explained in the book chapter you cite, but in essence, the different is that PcGw is the "probability of the class given the word", and PwGc is the "probability of the word given the class". The former is the posterior and what we need for computing the probability of a class membership for a group of words using the joint probability (in quanteda, this is applied using the predict() function). The latter is simply the likelihood that comes from the relative frequencies of the features in each class, smoothed by default by adding one to the counts by class.

You can verify this if you want as follows. First, group the training documents by training class, and then smooth them.

trainingset_bygroup <- dfm_group(trainingset[1:4, ], trainingclass[-5]) %>%
    dfm_smooth(smoothing = 1)
trainingset_bygroup
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs Chinese Beijing Shanghai Macao Tokyo Japan
#    N       2       1        1     1     2     2
#    Y       6       2        2     2     1     1

Then you can see that the (smoothed) word likelihoods are the same as PwGc.

trainingset_bygroup / rowSums(trainingset_bygroup)
# Document-feature matrix of: 2 documents, 6 features (0.0% sparse).
# 2 x 6 sparse Matrix of class "dfm"
#     features
# docs   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#    N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#    Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

tmod1$PwGc
#        features
# classes   Chinese   Beijing  Shanghai     Macao      Tokyo      Japan
#       N 0.2222222 0.1111111 0.1111111 0.1111111 0.22222222 0.22222222
#       Y 0.4285714 0.1428571 0.1428571 0.1428571 0.07142857 0.07142857

But you probably care more about the P(class|word), since that's what Bayes formula is all about and incorporates the prior class probabilities P(c).

Ken Benoit
  • 14,454
  • 27
  • 50
  • thanks but I am still confused. Normally the posterior we care about is P(class / sentence) which can be broken down into the prior times the product of the (smoother) likelihoods of each word P(word / class) which are returned by `PwGc`. Therefore, which part of the formula does `PcGw` refers to? Do you see my point? I woud expect `PcGw` to only return a `Y/N` probability for each `sentence`. Instead, `PcGw` returns something at the word level – ℕʘʘḆḽḘ Feb 07 '19 at 14:40
  • in other words in the book there is `P(word / class)`, `P(class / document)`, `P(word)` and `P(class)` but there is no `P(class/word)` which you seem to return in `PcGw`. – ℕʘʘḆḽḘ Feb 07 '19 at 16:06
  • The PcGw is an intermediate quantity that you can choose to focus on or not. Aggregated to a text, this involves multiplying the (naively assumed) independent probabilities based on its words. If you think of "d5" as a single word from the training set, however, then PcGw is equivalent to the P(c|d5) on p261 of the IIR text. – Ken Benoit Feb 08 '19 at 00:12