Why "add one smoothing" in language model does not count the in denominator

Question

English is not my native language , Sorry for any grammatical mistakes.

I saw many documents for add one smoothing in language model, and I still very confused about the variable V in the formula:

P (wi |w_i-1 ) = c(w_i-1 ,wi )+1  / c(w_i-1 )+V

as for this example corpus and I use bigram

<s> John read Moby Dick </s>
<s> Mary read a different book </s>
<s> She read a book by Cher </s>

if i want to calculate any P(wi | w_i-1) . The V will be 11 because the count of combination of [ w_i-1 , w ] is 11 . But I found it does not include the case [w_i-1 , "<"/s">"] (or the V will be 12) Why we do not need to include this case ? Isn't it the case that w_i-1 is in the end of an article or sentence ?

alvas · Answer 1 · 2018-11-08T09:01:13.657

There's a nice tutorial here: https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

Consider an ngram language model (without smoothing):

p(w_i | w_i-1) = c(w_i-1 w_i) / c(w_i-1)

p(w_1, w_2 ... w_n) = product_i=1_to_n( p(w_i | w_i-1) )

In code:

from collections import Counter
from functools import reduce, partial
from operator import mul

from nltk import ngrams

def prob_product(prob_list):
    return reduce(mul, prob_list, 1)

text = [['<s>', 'John', 'read', 'Moby', 'Dick', '</s>'], 
        ['<s>', 'Mary', 'read', 'a', 'different', 'book', '</s>'], 
        ['<s>', 'She', 'read', 'a', 'book', 'by', 'Cher', '</s>']]

bigram_counts = sum([Counter(ngrams(t, 2)) for t in text], Counter())
unigram_counts = sum([Counter(ngrams(t, 1)) for t in text], Counter())

count_S_John = bigram_counts[('<s>', 'John')]
count_S = unigram_counts[('<s>',)]

sentence = '<s> John read a book </s>'.split()
prob_S_John_read_a_book = prob_product([bigram_counts[bg]/unigram_counts[bg[:-1]]
                                        for bg in ngrams(sentence, 2)])

print(prob_S_John_read_a_book) # 0.555555

for bg in ngrams(sentence, 2):
    print(bg, bigram_counts[bg], unigram_counts[bg[:-1]])

[out]:

0.55555
('<s>', 'John') 1 3
('John', 'read') 1 1
('read', 'a') 2 3
('a', 'book') 1 2
('book', '</s>') 1 2

With add-one smoothing, aka Laplace smoothing,

p(w_i | w_i-1) = (1 + c(w_i-1 w_i)) / (|V| + c(w_i-1))

where |V| is the number of tokens (usually without <s> and </s>).

So in code:

laplace_prob_S_John_read_a_book = prob_product([(1+bigram_counts[bg]) / (len(unigram_counts)-2 + unigram_counts[bg[:-1]])
                                                for bg in ngrams(sentence, 2)])

print(laplace_prob_S_John_read_a_book)

for bg in ngrams(sentence, 2):
    print(bg, 1+bigram_counts[bg], len(unigram_counts)-2 + unigram_counts[bg[:-1]])

[out]:

0.00012075836251660427
('<s>', 'John') 2 14
('John', 'read') 2 12
('read', 'a') 3 14
('a', 'book') 2 13
('book', '</s>') 2 13

Note: len(unigram_counts)-2 accounts for removing <s> and </s> from the no. of words in vocabulary.

The above it the how.

Q: Why doesn't the |V| takes into account <s> and </s>?

A: One possible reason is because we never consider empty sentences in language models, so the <s> and </s> can't stand by itself and the vocabulary |V| excludes them.

Is it okay to add them in |V|?

A: Actually if |V| is sufficiently large, having +2 for <s> and </s> would make little difference. As long as |V| is consistent and fixed consistent in all the computation and it's sufficiently large, the language model probabilities of any sentence relative to another sentence with the same language model shouldn't be too different.

Thanks for the detailed answer. But I want to ask , for example in the p.8 of the tutorial . When we want to calculate p(READ|CHER) , we also take [CHER , ] into account, so the denominator is 1 . But why we don't take it into account when we consider the add one smoothing ? — Jeffese, Nov 09 '18 at 06:07

Why "add one smoothing" in language model does not count the in denominator

1 Answers1