1

I trained Ngram language models (unigram and bigram) on a corpus of English and I'm trying to compute the probabilities of sentences from a disjoint corpus.

For example, the training corpus consists of the 3 sentences:

1: I, am, Sam

2: Sam, I, am

3: I, do, not, like, green, eggs, and, ham

N = 14 (length of the corpus)

For unigram, I end up with probabilities:

Pr("i") = #("i") / N = 3/14, Pr("am") = 2/14, Pr("like") = 1/14, and so forth...

For bigram, I end up with probabilities:

Pr("am"|"i") = 2/3, Pr("do"|"i") = 1/3, and so forth...

Now, I'm trying to compute the probability of the following sentence where not all ngrams (uni or bi) appear in the training corpus:

I, ate, a, burrito

For unigram, I need the following probability estimates:

Pr("i"), Pr("ate"), Pr("a"), and Pr("burrito")

and for bigram, I need the following probabilities estimates:

Pr("ate"|"i"), Pr("a"|"ate"), Pr("burrito"|"a")

Apparently not all unigrams ("ate", "burrito") and bigrams (like ("i", "ate")) appear in the training corpus.

I understand that you can do smoothing (like add-one smoothing) to deal with these cases:

For example, the vocabulary of the training corpus is

i, am, sam, do, not, like, green, eggs, and, ham

and you can expand the vocabulary by including new words from the new sentence:

ate, a, burrito

So the size of the expanded vocabulary would be V = 13

So for unigram, the original probability estimates Pr(w_i) = #(w_i)/N would be turned into (#(w_i) + 1) / (N + V)

So Pr("i") = 4/27, Pr("am") = 3/27, Pr("sam") = 3/27, Pr("do") = 2/27, Pr("not") = 2/27, Pr("like") = 2/27, Pr("green") = 2/27, Pr("eggs") = 2/27, Pr("and") = 2/27, Pr("ham") = 2/27

And for the 3 new words: Pr("ate") = 1/27, Pr("a") = 1/27, Pr("burrito") = 1/27

And the these probabilities would still sum to 1.0

Though this can handle the cases where some ngrams were not in the original training set, you would have to know the set of "new" words when you estimate the probabilities using (#(w_i) + 1) / (N + V) (V = sum of vocabulary of the original training set (10), and the test corpus (3)). I think this is equivalent to assuming the all new unigram or bigram in the test corpus occur only once, no matter how many times they actually occur.

My question is this the way out-of-vocabulary tokens are typically handled when computing the probability of a sentence?

The NLTK module nltk.module.NGramModel seem be have been removed due to bugs nltk ngram model, so I have to implement on my own. Another question: is there python modules other than NLTK that implements Ngram training and computing probability of a sentence ?

Thanks in advance!

Community
  • 1
  • 1
cccjjj
  • 91
  • 2
  • 6
  • Yes, that is a common way to deal with new vocabulary: count them once, now that you know they've occurred once. As for other Python packages, I can certainly recommend Google's TensorFlow. – Prune Oct 24 '16 at 21:28

1 Answers1

0

My answer is based on a solution in "Speech and Language Processing" Jurafsky & Martin, on a scenario which you are building your vocabulary based on your training data (you have an empty dictionary).

In this case, you treat any first instance of a new word out of vocabulary (OOV) as an unknown token <UNK>.

This way all rare words will be one token similar to unseen words. To understand the reason consider the fact that one instance is not enough for your model to decide based on that. This way the unknown token also helps your accuracy on seen tokens as well.

I found this pdf version: https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

About your second question, I think with a tweak and preprocessing on your text you can use CountVectorizer in scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Mehdi
  • 4,202
  • 5
  • 20
  • 36