I am trying to train a CRF sequence model using the Mallet library but I am missing some important information. I found a an example in the library itself at https://github.com/mimno/Mallet/blob/master/src/cc/mallet/examples/TrainCRF.java however the example does not state the format of the input training data so I do not know how to recreate it.
Mallet does have a data import example at http://mallet.cs.umass.edu/import-devel.php but the particular example seems to be for document classification and not CRF sequence models which is my use case.
I tried putting the input training data in the form used at http://mallet.cs.umass.edu/sequences.php i.e.
Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun
and test data in the form
CAPITAL Al
slept
here
however based on the output logs it does not seem to be the correct format. For example one line in the log is INFO: testing label slept P � R 0 F1 �
but slept
is not a label - the labels should be noun
or non-noun
.
So if someone could tell me what format the training data should be in that would be great.