Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

Question

In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this

where V is the size of the vocabulary. My question is why do we add V when we are only considering the count of the previous word.

I only have a rough idea that every word is incremented by one so we have to normalize it by V time but I still don't understand it properly. As I said we are only considering the count of previous word right so why don't just add 1 to it.

I also saw that if we add V then the addition of all bigrams will be 1 which is what it should be. But is there any other explanation of why V?

Good question, might be flag as better on datascience.stackexchange.com but I think the question deserves an answer =) — alvas, Aug 12 '23 at 11:37
BTW, there's a missing summation terms over the `c(...)` in both the numerator and denorminator you have in the formula in the question. It's key to understand the puzzle of why `+ |V|`. — alvas, Aug 12 '23 at 11:43

alvas · Accepted Answer · 2023-08-14T02:15:34.813

The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from:

First, we start with the naive assumption that if we add 1 to the numerator, we also add 1 to denominator to avoid math division error.

But instead of adding +1 to all terms in the vocabulary, we could simply add the size of the vocab, thus you see the sum(c(wi-1)) + |V| in the denominator, instead of sum(c(wi-1) + 1), note the scope of the "sum" function.

More details

Sometimes I find it easier to see the math in code, consider this ngram without laplace:

# From https://docs.python.org/3/library/itertools.html#itertools.pairwise

try:
    from itertools import pairwise
except:
    from itertools import tee
    def pairwise(iterable):
        # pairwise('ABCDEFG') --> AB BC CD DE EF FG
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)

    
def count_numerator(ngrams, sentences):
    return sum([list(pairwise(s)).count(ngrams) for s in sentences])

def count_denominator(word, sentences):
    return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) for s in sentences])


s1 = ["<s>", "this", "is", "foo", "bar", '</s>']
s2 = ["<s>", "foo", "bar", "is", "good", '</s>']
s3 = ["<s>", "fruit", "bar", "is", "better", '</s>']
s4 = ["<s>", "this", "is", "good", "</s>"]
s5 = ["<s>", "i", "like", "this", "thing", '</s>']

sentences = [s1, s2, s3, s4, s5]


prob_bos_this = count_numerator(('<s>', 'this'), sentences) / count_denominator("<s>", sentences) 

prob_this_is = count_numerator(('this', 'is'), sentences) / count_denominator("this", sentences)

prob_is_good = count_numerator(('is', 'good'), sentences) / count_denominator("is", sentences)

prob_good_eos = count_numerator(('good', '</s>'), sentences) / count_denominator("good", sentences)

print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos

print_this_is_good # Outputs: 0.1333

Now consider the laplace smoothing with the incremental +1 on the numerator and denominator:

def count_numerator_laplace(ngrams, sentences):
    return sum([list(pairwise(s)).count(ngrams) + 1  for s in sentences])

def count_denominator_laplace(word, sentences):
    return sum([sum([int(ng[0] == word) + 1 for ng in pairwise(s)])  for s in sentences])


prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences) 

prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)

prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)

prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)

print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
print_this_is_good  # 0.004212103350034384

Note that the +1 on the numerator is adding simple +1 to each wi-1, wi count. The denominator is adding +1 for each ngram that exists in the corpus containing the wi-1, *.

Now, we see that we are summing +1 for every ngrams that occurs in a similar fashion, we can just add the no. of ngrams that exists, e.g.

def count_numerator_laplace(ngrams, sentences):
    return sum([list(pairwise(s)).count(ngrams) + 1  for s in sentences])

def count_denominator_laplace(word, sentences):
    return sum([sum([int(ng[0] == word) for ng in pairwise(s)]) + len(list(pairwise(s))) for s in sentences])


prob_bos_this = count_numerator_laplace(('<s>', 'this'), sentences) / count_denominator_laplace("<s>", sentences) 

prob_this_is = count_numerator_laplace(('this', 'is'), sentences) / count_denominator_laplace("this", sentences)

prob_is_good = count_numerator_laplace(('is', 'good'), sentences) / count_denominator_laplace("is", sentences)

prob_good_eos = count_numerator_laplace(('good', '</s>'), sentences) / count_denominator_laplace("good", sentences)

print_this_is_good = prob_bos_this * prob_this_is * prob_is_good * prob_good_eos
print_this_is_good  # 0.004212103350034384

This is a good prove of concept for how the +1 in the inner sum becomes a + |V| in when you carry the +1 out of the summation, e.g.

my_list = [1,2,3]
sum(i+1 for i in my_list) == sum(i for i in my_list) + len(my_list)

**c(wi-1)** is just the count of one particular word type used before our current word right? but now as per your answer there is a summation. so are we also considering other word types? P.S. the summation is also mentioned in my reference (edited my question based on that) — heyharshal, Aug 13 '23 at 04:15
Taka a look at the code examples, how that helps explains the difference you're asking. There are other formulations to the denominator but the example I use, should be one of the easiest one to count. — alvas, Aug 14 '23 at 01:51

Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

1 Answers1

More details