There's a nice tutorial here: https://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf
Consider an ngram language model (without smoothing):
p(w_i | w_i-1) = c(w_i-1 w_i) / c(w_i-1)
p(w_1, w_2 ... w_n) = product_i=1_to_n( p(w_i | w_i-1) )
In code:
from collections import Counter
from functools import reduce, partial
from operator import mul
from nltk import ngrams
def prob_product(prob_list):
return reduce(mul, prob_list, 1)
text = [['<s>', 'John', 'read', 'Moby', 'Dick', '</s>'],
['<s>', 'Mary', 'read', 'a', 'different', 'book', '</s>'],
['<s>', 'She', 'read', 'a', 'book', 'by', 'Cher', '</s>']]
bigram_counts = sum([Counter(ngrams(t, 2)) for t in text], Counter())
unigram_counts = sum([Counter(ngrams(t, 1)) for t in text], Counter())
count_S_John = bigram_counts[('<s>', 'John')]
count_S = unigram_counts[('<s>',)]
sentence = '<s> John read a book </s>'.split()
prob_S_John_read_a_book = prob_product([bigram_counts[bg]/unigram_counts[bg[:-1]]
for bg in ngrams(sentence, 2)])
print(prob_S_John_read_a_book) # 0.555555
for bg in ngrams(sentence, 2):
print(bg, bigram_counts[bg], unigram_counts[bg[:-1]])
[out]:
0.55555
('<s>', 'John') 1 3
('John', 'read') 1 1
('read', 'a') 2 3
('a', 'book') 1 2
('book', '</s>') 1 2
With add-one smoothing, aka Laplace smoothing,
p(w_i | w_i-1) = (1 + c(w_i-1 w_i)) / (|V| + c(w_i-1))
where |V|
is the number of tokens (usually without <s>
and </s>
).
So in code:
laplace_prob_S_John_read_a_book = prob_product([(1+bigram_counts[bg]) / (len(unigram_counts)-2 + unigram_counts[bg[:-1]])
for bg in ngrams(sentence, 2)])
print(laplace_prob_S_John_read_a_book)
for bg in ngrams(sentence, 2):
print(bg, 1+bigram_counts[bg], len(unigram_counts)-2 + unigram_counts[bg[:-1]])
[out]:
0.00012075836251660427
('<s>', 'John') 2 14
('John', 'read') 2 12
('read', 'a') 3 14
('a', 'book') 2 13
('book', '</s>') 2 13
Note: len(unigram_counts)-2
accounts for removing <s>
and </s>
from the no. of words in vocabulary.
The above it the how.
Q: Why doesn't the |V|
takes into account <s>
and </s>
?
A: One possible reason is because we never consider empty sentences in language models, so the <s>
and </s>
can't stand by itself and the vocabulary |V|
excludes them.
Is it okay to add them in |V|
?
A: Actually if |V|
is sufficiently large, having +2 for <s>
and </s>
would make little difference. As long as |V|
is consistent and fixed consistent in all the computation and it's sufficiently large, the language model probabilities of any sentence relative to another sentence with the same language model shouldn't be too different.