1

I start working on a problem related with language modelling, but some calculation does not clear to me. For example consider the following simple text:

I am Sam Sam I am I do not like green eggs and ham

I have used berkelylm to create the n-gram probability count and the ARPA file. Here is the generated ARPA file:

\data\
ngram 1=12
ngram 2=14
ngram 3=14
ngram 4=13
ngram 5=12
ngram 6=11
ngram 7=10
ngram 8=0
ngram 9=0

\1-grams:
-1.146128   am  -0.062148
-1.146128   like    -0.062148
-1.146128   not -0.062148
-99.000000  <s> -0.062148
-1.146128   green   -0.062148
-1.146128   and -0.062148
-0.669007   I   -0.238239
-0.845098   Sam -0.062148
-1.146128   </s>
-1.146128   ham -0.062148
-1.146128   eggs    -0.062148
-1.146128   do  -0.062148

\2-grams:
-0.720159   am Sam
-0.597943   Sam I
-0.709435   and ham
-0.709435   not like
-0.709435   like green
-0.720159   Sam Sam
-0.709435   ham </s>
-0.709435   green eggs
-0.496144   <s> I
-0.377737   I am
-0.597943   am I
-0.709435   do not
-0.709435   eggs and
-1.066947   I do

\3-grams:
-0.597943   Sam Sam I
-0.377737   <s> I am
-0.709435   do not like
-0.720159   I am Sam
-1.066947   am I do
-0.377737   Sam I am
-0.709435   green eggs and
-0.709435   like green eggs
-0.597943   I am I
-0.709435   eggs and ham
-0.709435   and ham </s>
-0.709435   I do not
-0.709435   not like green
-0.720159   am Sam Sam

the probability count for the 1-grams are clear me, but it is not clear to me how the 2-grams and 3-grams data are created. There are a total of 13 bigrams there and the bigram "I am" appears two times So, 2-gram probability count for "I am" should be log(2/13) or -0.81291, in log scale, but it is -0.37 in the generated file).

I might missing something because of my lack of experience, but I would appreciate an example to explain a calculation.

Thanks.

Stefanus
  • 1,619
  • 3
  • 12
  • 23
Muhammad Asaduzzaman
  • 1,201
  • 3
  • 19
  • 33

1 Answers1

2

What you probably miss is the smoothing technique used when computing the log probabilities. Smoothing takes some probability weight from n-grams and transfer it to unseen ngrams so that bigrams like "I Sam" will not get a zero probability (because it has never been seen), but some probability which takes into account the unigram probabilities of "I" and "Sam".

From what I've seen in BerkeleyLM documentation it is using modified KN smoothing, which is the most popular among LM tools. You can read about smoothing in general here and see the exact calculations for different smoothing method in SRILM's man page.

Beka
  • 725
  • 6
  • 22