1

I want to truncate all tokens in a corpus to have a maximum length of 5 characters. Is there a way to set the --token-regex import option in MALLET to accomplish this? The code I'm currently using to import documents is this:

mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\p{L}[\p{L}\p{P}]*\p{L}'

If this is not possible in the MALLET import command, I’d appreciate suggestions on how to do the same in R.

Jim
  • 21
  • 4
  • **"I want to truncate all tokens to have a maximum length of 5 characters."** The previous sentence is the bottom-line of your question. You'd have a higher chance for responses to your question if you took out the rest of the verbiage (as well as removed some of the tags). And welcome to SO. – Sabuncu Sep 11 '14 at 16:51
  • Thanks for the tip, and the welcome. I've edited the question. – Jim Sep 12 '14 at 10:26

1 Answers1

0

Yes you can modify the token-regex so that it reads words of maximum 5 or n characters using this regular expression:

\b\w{1,5}\b

where \b is a word boundary, \w is a word and {1,5} defines the minimum (1) and the maximum (5).

Your command line should be:

mallet-2.0.7/bin/mallet import-dir --input mallet-2.0.7/data/journals/ --output mallet-2.0.7/tmp/topic-input-journals.mallet --keep-sequence --remove-stopwords --stoplist-file mallet-2.0.7/stoplists/tr.txt --token-regex '\b\w{1,5}\b'

In Java:

pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\b\\w{1,5}\\b")));

Hope this helps.

c-chavez
  • 7,237
  • 5
  • 35
  • 49