The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:
As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}'
, which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:
For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+'
, which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
To include short words, use \p{L}+
(letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}'
(letters possibly including punctuation).