0

I am using Mallet for topic modeling. A large amount of words in my input text include both letters and digits; e.g., A54, D892. I just noticed that Mallet automatically removes the digits and only keeps the letters in the words. I even do not use the --remove-stopwords option when importing my text file. Does anyone know how I can fix this problem.

SM.
  • 1
  • 1

1 Answers1

0

bin/mallet import-dir has an option --token-regex which determines what to accept as part of the word. One of the following two choices may suit your needs: [\p{L}\p{D}]+ accepts any combinations of letters and digits; \p{L}[\p{L}\p{D}]* accepts alphanumerical strings starting with a letter.

Sir Cornflakes
  • 675
  • 13
  • 26