1

We would like to build a topic model with bigrams. What is the recommended way to implement this in Java?

Currently, we use Mallet Java API. Specifically, ParallelTopicModel while passing tokens as a string to data parameter of Instance object.

Thank you.

Esther
  • 11
  • 1

1 Answers1

1

The easiest and most reliable way to account for n-grams is to modify the input. For example, you might replace new york with new_york, and then tokenize using a pattern that accepts _ as a letter character. Mallet allows you to specify a file with strings to treat as single tokens when you import documents:

bin/mallet import-file --help
A tool for creating instance lists of feature vectors from comma-separated-values
...
--replacement-files FILE [FILE ...]
  files containing string replacements, one per line:
    'A B [tab] C' replaces A B with C,
    'A B' replaces A B with A_B
  Default is (null)

This mode of use requires you to identify specific n-grams. You could also modify the input file to include all bigrams, so to be or not to be would become to_be be_or or_not not_to to_be. I don't know whether that would produce anything useful.

There are also topic model variants that "natively" support n-gram identification, but at a significant cost in training time and model quality. I would not recommend using any of them.

David Mimno
  • 1,836
  • 7
  • 7