I want to truncate all tokens in a corpus to have a maximum length of 5 characters. Is there a way to set the --token-regex import option in MALLET to accomplish this? The code I'm currently using to import documents is this:
mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\p{L}[\p{L}\p{P}]*\p{L}'
If this is not possible in the MALLET import command, Iād appreciate suggestions on how to do the same in R.