Need to understand the output format of kenlm querying

Question

kenlm paper seems good for LM. I feel that minimal documentation is given, felt difficulty in understanding.

So, as part of understanding kenlm, I need to understand the output format of querying the model. Please do provide some detail on it.

I couldn't tag correctly on lm, kenlm as tags are not available.

Details:

Executed:

bin/query trainingdata.binary < temp.txt

Output:

city=274 2 -3.71333 </s>=2 1 -0.914832  Total: -4.62817 OOV: 0

new=1037 2 -2.64194 york=2124 2 -2.27023    </s>=2 1 -0.867251  Total: -5.77943 OOV: 0

samsung=3 2 -2.39176    galaxy=4 3 -0.193832    s5=5 4 -0.536524    </s>=2 5 -0.595418  Total: -3.71753 OOV: 0

fingers=6 2 -4.25789    crossed=7 3 -1.00535    samsung=3 4 -0.766757   </s>=2 5 -0.757035  Total: -6.78703 OOV: 0

jessica=8 2 -3.77437    simpson=9 3 -0.45866    collection=10 4 -1.24209    </s>=2 5 -0.144034  Total: -5.61916 OOV: 0

plexus=11 2 -4.46277    slim=12 3 -0.804323 </s>=2 4 -0.606899  Total: -5.87399 OOV: 0

under=13 2 -3.23437 armour=14 3 -0.575785   outlet=15 4 -1.32109    </s>=2 5 -0.18898   Total: -5.32022 OOV: 0

amazon=16 2 -2.05178    seller=17 3 -2.5683 central=18 4 -0.94366   </s>=2 5 -0.643415  Total: -6.20716 OOV: 0

garcinia=19 2 -2.6464   cambogia=20 3 -0.101819 reviews=21 4 -1.86317   </s>=2 5 -0.0987858 Total: -4.71017 OOV: 0

womens=22 2 -3.10627    organic=23 3 -1.64262   lube.=24 4 -1.12123 </s>=2 5 -0.505745  Total: -6.37587 OOV: 0

doc=25 2 -3.00747   mcstuffins=26 3 -0.130808   </s>=2 4 -0.485077  Total: -3.62336 OOV: 0
</s>=2 1 -0.975736  Total: -0.975736 OOV: 0

Perplexity including OOVs:  30.9347

Perplexity excluding OOVs:  30.9347
OOVs:   0

Total time including destruction:

Name:query  VmPeak:30664 kB VmRSS:1748 kB   RSSMax:3132 kB  user:0.000999   sys:0   CPU:0.000999    real:0.000817598

What is the exact problem is understanding the output? Put sample of output so that we can see which part of it you don't get. — Daniel, Jun 18 '14 at 23:42
In training data, lot of such good sentences. In temp.txt, "samsung galaxy for sale" etc would be there. — Venkatarao N, Jun 23 '14 at 06:34

score 3 · Answer 1 · answered Jun 25 '14 at 18:05

3

The output format is a sequence of words in the format

word=ID LENGTH LOG_PROB

where ID the internal ID of the word (in the language model), LENGTH is the length of the n-gram match, and LOG_PROB is the probability for that word.

answered Jun 25 '14 at 18:05

matt

189
1
5

Thanks for the reply. Please let me understand `samsung=3 2 -2.39176 galaxy=4 3 -0.193832 s5=5 4 -0.536524 =2 5 -0.595418 Total: -3.71753 OOV: 0` also. Questions: `1. Is 2 in samsung=3 2 -2.39176 means samsung occurred in 2-gram word? 2. What is this =2 5 -0.595418? ` Thanks, – Venkatarao N Jun 25 '14 at 20:22

score 1 · Answer 2 · answered May 18 '18 at 08:41

Matt's answer is correct and to the point as per I see, but for better understanding of beginners let me try this in bit more detail.

The Output for word is of format:

word=ID LENGTH LOG_PROB

where ID will be the id of the word in the trained language model, LENGTH will be the length of the longest n-gram the word is a part of along with it's neighbors considered while training our language model and LOG_PROB is the logarithmic probability of the word to appear in our trained language model.

For the sentence level we see:

w1=ID1 len1 log_prob1 w2=ID2 len2 log_prob2 ...... Total: t

Here Total is the log probability of the sentence. Since it's the logarithm, you need to compute the 10 to the power of that number "t".

Need to understand the output format of kenlm querying

2 Answers2