I've been trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) to learn a CRF- Model for POS-Tagging.
I am now starting to get worried/confused as my computer has been learning for this one model for over a week. It does not seem to be hung up, as it sill gives me output in the form:
...
Punkte NN->Puppenk�nig NN(Puppenk�nig NN) Punkte NN,Puppenk�nig NN
Punkte NN->Obere NN(Obere NN) Punkte NN,Obere NN
Punkte NN->Entfernung NN(Entfernung NN) Punkte NN,Entfernung NN
...
So I wanted to ask, if it is normal for Mallet to take this long, or did something go wrong?
I used the command specified on the webpage:
hough@gobur:~/tagger-test$ java -cp
"/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar"
cc.mallet.fst.SimpleTagger
--train true --model-file nouncrf sample
The training data contains 96903 Tokens.
Edit:
We're assuming, it might have something to do with the form of the input. The website specifies the form:
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
And the documentation for the SimpleTagger(http://mallet.cs.umass.edu/api/) states that each instance should be a separate block, separated by blank lines. While I'm not sure what is meant by instance, I thought, the expected form is something like this:
word pos
word pos
. $.
word pos
word pos
word pos
. $.
word pos
word pos
. $.
...
Is this the right format? Does maybe someone have an example file, to show what the format should look like?