2

I've been trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) to learn a CRF- Model for POS-Tagging.

I am now starting to get worried/confused as my computer has been learning for this one model for over a week. It does not seem to be hung up, as it sill gives me output in the form:

...  
Punkte  NN->Puppenk�nig NN(Puppenk�nig  NN) Punkte  NN,Puppenk�nig  NN  
Punkte  NN->Obere   NN(Obere    NN) Punkte  NN,Obere    NN  
Punkte  NN->Entfernung  NN(Entfernung   NN) Punkte  NN,Entfernung   NN  
...

So I wanted to ask, if it is normal for Mallet to take this long, or did something go wrong?

I used the command specified on the webpage:

hough@gobur:~/tagger-test$ java -cp  
 "/home/hough/mallet/class:/home/hough/mallet/lib/mallet-deps.jar"
 cc.mallet.fst.SimpleTagger
 --train true --model-file nouncrf  sample

The training data contains 96903 Tokens.

Edit:
We're assuming, it might have something to do with the form of the input. The website specifies the form:

Bill CAPITALIZED noun  
slept non-noun   
here LOWERCASE STOPWORD non-noun

And the documentation for the SimpleTagger(http://mallet.cs.umass.edu/api/) states that each instance should be a separate block, separated by blank lines. While I'm not sure what is meant by instance, I thought, the expected form is something like this:

word pos  
word pos  
. $.  

word pos  
word pos  
word pos  
. $.  

word pos  
word pos    
. $.  

...

Is this the right format? Does maybe someone have an example file, to show what the format should look like?

Kai
  • 21
  • 4

1 Answers1

2

A week for a 100k token corpus seems much too long. I would expect on the order of a half hour at most.

David Mimno
  • 1,836
  • 7
  • 7