Mallet: Tokenization by N-grams (1,2)

Question

I was wondering whether it would be possible to tokenize words in Mallet by n-gram size between 1 and 2?

This is the code that I have used so far:

bin\mallet import-dir --input sample-data\web\en --output sample.txt --keep-sequence-bigrams --remove-stopwords
bin\mallet train-topics  --input sample.txt  --num-topics 20 --optimize-interval 10 --output-doc-topics sample_composition.txt --output-topic-keys sample_keys.txt

Thank you in advance.

score 0 · Answer 1 · answered Sep 22 '21 at 15:23

The topic model trainer doesn't use the bigrams feature, it would make the code much more complicated. Two ways to add bigrams would be to modify the input data file before importing it, such that

the cat sat

would become

the cat sat the_cat cat_sat

You can also create a post-hoc report that identifies pairs of words that frequently occur together and get assigned to the same topic with --xml-topic-phrase-report FILENAME.

Mallet: Tokenization by N-grams (1,2)

1 Answers1