I found tutorials where class based LM is implemented using Brown clustering passing just number of classes you want but I want to implement a class based model where I give class assignments initially. I tried this http://projects.csail.mit.edu/cgi-bin/wiki/view/SLS/SriLM. But this gives -99 to all ngrams in LM. There is very less documentation regarding this, Can anyone help me out?
Asked
Active
Viewed 671 times
1 Answers
2
I've done this before but it was several years ago. Let me see if I can retrace the steps for you.
The first step is to create the file that specifies the classes. It should have three columns. First is the class id, then the probability of that word given the class, and lastly the word.
Next step is to replace all the words in the training data with their class ids. You can use the SRILM replace-words-with-classes
script or you can write your own script to do it.
Now you train a language model using ngram-count
just like you would for a regular non-class n-gram model.
For evaluation you just specify the language model and also the class file.
ngram -ppl test_data.txt -lm class.lm -classes class_definition_file.txt

Aaron
- 2,354
- 1
- 17
- 25
-
I am doing these steps- uniform-classes amiClasses.txt > amiClassesWithProbs.txt replace-words-with-classes addone=1 normalize=1 outfile=ami.counts classes=amiClassesWithProbs.txt ami-train.txt > amiTrainReplaced.txt ngram-count -vocab amiVocabWithClass.txt -order 3 -text amiTrainReplaced.txt -lm amiClassBased.srilm ngram -lm amiClassBased.srilm -ppl ami-dev.txt -classes ami.counts I'm using part of speech of each token as class and when I tested this LM my perplexity increased as compared to simple 3 gram without class based LM. Do you have any idea. – Ranjeet Singh May 10 '18 at 11:18
-
It's normal for the perplexity to increase in a classlm. Usually to get improved perplexity you have to interpolate the classlm with a regular n-gram lm. – Aaron May 10 '18 at 22:01
-
Oh Thanks alot. – Ranjeet Singh May 11 '18 at 10:33