I am using Mallet for topic modeling. A large amount of words in my input text include both letters and digits; e.g., A54, D892. I just noticed that Mallet automatically removes the digits and only keeps the letters in the words. I even do not use the --remove-stopwords option when importing my text file. Does anyone know how I can fix this problem.
Asked
Active
Viewed 140 times
1 Answers
0
bin/mallet import-dir
has an option --token-regex
which determines what to accept as part of the word. One of the following two choices may suit your needs: [\p{L}\p{D}]+
accepts any combinations of letters and digits; \p{L}[\p{L}\p{D}]*
accepts alphanumerical strings starting with a letter.

Sir Cornflakes
- 675
- 13
- 26