3

kenlm paper seems good for LM. I feel that minimal documentation is given, felt difficulty in understanding.

So, as part of understanding kenlm, I need to understand the output format of querying the model. Please do provide some detail on it.

I couldn't tag correctly on lm, kenlm as tags are not available.

Details:

Executed:

bin/query trainingdata.binary < temp.txt

Output:

city=274 2 -3.71333 </s>=2 1 -0.914832  Total: -4.62817 OOV: 0

new=1037 2 -2.64194 york=2124 2 -2.27023    </s>=2 1 -0.867251  Total: -5.77943 OOV: 0

samsung=3 2 -2.39176    galaxy=4 3 -0.193832    s5=5 4 -0.536524    </s>=2 5 -0.595418  Total: -3.71753 OOV: 0

fingers=6 2 -4.25789    crossed=7 3 -1.00535    samsung=3 4 -0.766757   </s>=2 5 -0.757035  Total: -6.78703 OOV: 0

jessica=8 2 -3.77437    simpson=9 3 -0.45866    collection=10 4 -1.24209    </s>=2 5 -0.144034  Total: -5.61916 OOV: 0

plexus=11 2 -4.46277    slim=12 3 -0.804323 </s>=2 4 -0.606899  Total: -5.87399 OOV: 0

under=13 2 -3.23437 armour=14 3 -0.575785   outlet=15 4 -1.32109    </s>=2 5 -0.18898   Total: -5.32022 OOV: 0

amazon=16 2 -2.05178    seller=17 3 -2.5683 central=18 4 -0.94366   </s>=2 5 -0.643415  Total: -6.20716 OOV: 0

garcinia=19 2 -2.6464   cambogia=20 3 -0.101819 reviews=21 4 -1.86317   </s>=2 5 -0.0987858 Total: -4.71017 OOV: 0

womens=22 2 -3.10627    organic=23 3 -1.64262   lube.=24 4 -1.12123 </s>=2 5 -0.505745  Total: -6.37587 OOV: 0

doc=25 2 -3.00747   mcstuffins=26 3 -0.130808   </s>=2 4 -0.485077  Total: -3.62336 OOV: 0
</s>=2 1 -0.975736  Total: -0.975736 OOV: 0

Perplexity including OOVs:  30.9347

Perplexity excluding OOVs:  30.9347
OOVs:   0

Total time including destruction:

Name:query  VmPeak:30664 kB VmRSS:1748 kB   RSSMax:3132 kB  user:0.000999   sys:0   CPU:0.000999    real:0.000817598
Nikem
  • 5,716
  • 3
  • 32
  • 59
Venkatarao N
  • 245
  • 3
  • 14

2 Answers2

3

The output format is a sequence of words in the format

word=ID LENGTH LOG_PROB

where ID the internal ID of the word (in the language model), LENGTH is the length of the n-gram match, and LOG_PROB is the probability for that word.

matt
  • 189
  • 1
  • 5
  • Thanks for the reply. Please let me understand `samsung=3 2 -2.39176 galaxy=4 3 -0.193832 s5=5 4 -0.536524 =2 5 -0.595418 Total: -3.71753 OOV: 0` also. Questions: `1. Is 2 in samsung=3 2 -2.39176 means samsung occurred in 2-gram word? 2. What is this =2 5 -0.595418? ` Thanks, – Venkatarao N Jun 25 '14 at 20:22
1

Matt's answer is correct and to the point as per I see, but for better understanding of beginners let me try this in bit more detail.

The Output for word is of format:

word=ID LENGTH LOG_PROB

where ID will be the id of the word in the trained language model, LENGTH will be the length of the longest n-gram the word is a part of along with it's neighbors considered while training our language model and LOG_PROB is the logarithmic probability of the word to appear in our trained language model.

For the sentence level we see:

w1=ID1 len1 log_prob1 w2=ID2 len2 log_prob2 ...... Total: t

Here Total is the log probability of the sentence. Since it's the logarithm, you need to compute the 10 to the power of that number "t".

SilentFlame
  • 487
  • 5
  • 15