2

In Java Mallet, the default token should be one or more characters in [A-Za-z] according to their website. However, when I have a text such as:

lower(location select testing) top

It thinks "lower(location" is one word. But default token should be all letter words. How can I deal with this situation?

1 Answers1

1

The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:

As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:

For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.

To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

David Mimno
  • 1,836
  • 7
  • 7
  • I have fixed it by using --token-regex '\p{L}+' because I only need real words, something like "abc(def" should be considered 2 words. I think '\p{L}+' is better than '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' to deal with both "anc(def)" and "abc-def", plus using 2 gram. I notice that you change the online documentation. Thanks for doing that! –  Mar 01 '18 at 20:53