I'm trying to perform topic modeling on a dataset that's in a whitespace delimited file, with no label. I can't get mallet to load all the tokens. I'm using version 2.0.8 on linux and mac.
As a test for the issue, I created a file with the one line:
1 2 3 4 5
Then ran
mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --input testData --output testLoaded mallet train-topics --input testLoaded
I should get 4 tokens, but I only get 3:
Data loaded. max tokens: 3 total tokens: 3
It gets even worse if I try to use the --data flag (same result whether I use it and --label 0 or --data 2 on its own):
mallet import-file --token-regex [0-9]+ --keep-sequence true --label 0 --data 2 --input testData --output testLoaded2 mallet train-topics --input testLoaded2
Data loaded. max tokens: 1 total tokens: 1
So either I lose the first token, or I only get the first token (2 is appearing in the output later on, so I know it's not loading the rest of the line as a single token in the latter case).