0

I was trying the SimpleTagger tutorial provided here. I've run the exact same commands as provided on the page i.e.

java -cp "class:lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file nouncrf sample

and

java -cp "class:lib/mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file nouncrf stest.

Here are my sample and stest files.

$ cat sample

Bill CAPITALIZED noun  
slept non-noun  
here LOWERCASE STOPWORD non-noun

$ cat stest

CAPITAL Al  
        slept  
        here

However, my output is different to the one on their page. This is the output I get.

Number of predicates: 9  
noun   
non-noun   
non-noun 

My questions are

  1. What does the "number of predicates" denote?
  2. Why do I get 9 predicates whereas, the official source claims 5 predicates for the same input files?

I'm using Mallet 2.0.8, if that matters.

iamwhoiam
  • 287
  • 1
  • 6
  • 15

1 Answers1

0

When you start training, the first message that SimpleTagger gives you is:

Number of features in training data: x
Number of predicates: y

The number of predicates, y, is the number of distinct tokens (or lines) that your training data contains.

When you label a file using the model from the previous train (that had y predicates), you get a message:

Number of predicates: z

This z, is the sum of y and the number of distinct tokens (or lines) that the file you want to label contains. That is why z is always greater (or equal) than y. If for example you try to label an empty of content text file with a model that had y predicates, you will get a number of predicates y, which is y + 0 = y, cause your empty file had 0 labels.

Dawny33
  • 10,543
  • 21
  • 82
  • 134
pebox11
  • 3,377
  • 5
  • 32
  • 57