I am trying to use the MALLET machine-learning library in a project for word sense disambiguation. My feature vectors consist of a fixed-size token window of x tokens to the left and right of the target token. The MALLET training instances are created like this:
// Create training list
Pipe pipe = new TokenSequenceLowercase();
InstanceList instanceList = new InstanceList(pipe);
Instance trainingInstance = new Instance(data, senseID, instanceID, text);
instanceList.add(trainingInstance);
...
// Training
ClassifierTrainer classifierTrainer = new NaiveBayesTrainer();
Classifier classifier = classifierTrainer.train(trainingList);
where
- "data" is an ArrayList<String> with the feature tokens
- "senseID" is the class label of the respective word sense
- "instanceID" is just a String to identify the training instance
- "text" is the original source text
I would have expected that the dataAlphabet and targetAlphabet properties of the InstanceList are built on the fly as training instances are being added, but this is not the case. Consequently, my code fails in the last line above with an NPE, since the targetAlphabet property of the NB trainer is NULL.
Looking at the MALLET code (thanks to open-source), I can see that the root-cause for the non-construction of the Alphabets is that my data and labels don't implement the AlphabetCarrying interface. Therefore, NULL is returned in the Instance class here:
public Alphabet getDataAlphabet() {
if (data instanceof AlphabetCarrying)
return ((AlphabetCarrying)data).getAlphabet();
else
return null;
}
I find this rather confusing, because the documentation says that data and labels can be of any object type. But this error above seems to indicate on the contrary that I need to construct a specific data / label class that implements AlphabetCarrying.
I feel like I am I missing something important on the conceptual level regarding these Alphabets. Also, I am not clear, if the data alphabet should be derived from all the training instances or just one. Can someone explain the error here?
Cheers,
Martin