I trained Ngram language models (unigram and bigram) on a corpus of English and I'm trying to compute the probabilities of sentences from a disjoint corpus.
For example, the training corpus consists of the 3 sentences:
1: I, am, Sam
2: Sam, I, am
3: I, do, not, like, green, eggs, and, ham
N = 14 (length of the corpus)
For unigram, I end up with probabilities:
Pr("i") = #("i") / N = 3/14, Pr("am") = 2/14, Pr("like") = 1/14, and so forth...
For bigram, I end up with probabilities:
Pr("am"|"i") = 2/3, Pr("do"|"i") = 1/3, and so forth...
Now, I'm trying to compute the probability of the following sentence where not all ngrams (uni or bi) appear in the training corpus:
I, ate, a, burrito
For unigram, I need the following probability estimates:
Pr("i"), Pr("ate"), Pr("a"), and Pr("burrito")
and for bigram, I need the following probabilities estimates:
Pr("ate"|"i"), Pr("a"|"ate"), Pr("burrito"|"a")
Apparently not all unigrams ("ate", "burrito") and bigrams (like ("i", "ate")) appear in the training corpus.
I understand that you can do smoothing (like add-one smoothing) to deal with these cases:
For example, the vocabulary of the training corpus is
i, am, sam, do, not, like, green, eggs, and, ham
and you can expand the vocabulary by including new words from the new sentence:
ate, a, burrito
So the size of the expanded vocabulary would be V = 13
So for unigram, the original probability estimates Pr(w_i) = #(w_i)/N would be turned into (#(w_i) + 1) / (N + V)
So Pr("i") = 4/27, Pr("am") = 3/27, Pr("sam") = 3/27, Pr("do") = 2/27, Pr("not") = 2/27, Pr("like") = 2/27, Pr("green") = 2/27, Pr("eggs") = 2/27, Pr("and") = 2/27, Pr("ham") = 2/27
And for the 3 new words: Pr("ate") = 1/27, Pr("a") = 1/27, Pr("burrito") = 1/27
And the these probabilities would still sum to 1.0
Though this can handle the cases where some ngrams were not in the original training set, you would have to know the set of "new" words when you estimate the probabilities using (#(w_i) + 1) / (N + V) (V = sum of vocabulary of the original training set (10), and the test corpus (3)). I think this is equivalent to assuming the all new unigram or bigram in the test corpus occur only once, no matter how many times they actually occur.
My question is this the way out-of-vocabulary tokens are typically handled when computing the probability of a sentence?
The NLTK module nltk.module.NGramModel seem be have been removed due to bugs nltk ngram model, so I have to implement on my own. Another question: is there python modules other than NLTK that implements Ngram training and computing probability of a sentence ?
Thanks in advance!