1

I am trying to use the MALLET machine-learning library in a project for word sense disambiguation. My feature vectors consist of a fixed-size token window of x tokens to the left and right of the target token. The MALLET training instances are created like this:

// Create training list
Pipe pipe = new TokenSequenceLowercase();
InstanceList instanceList = new InstanceList(pipe);
Instance trainingInstance = new Instance(data, senseID, instanceID, text);
instanceList.add(trainingInstance);
...
// Training
ClassifierTrainer classifierTrainer = new NaiveBayesTrainer();
Classifier classifier = classifierTrainer.train(trainingList);

where

  • "data" is an ArrayList<String> with the feature tokens
  • "senseID" is the class label of the respective word sense
  • "instanceID" is just a String to identify the training instance
  • "text" is the original source text

I would have expected that the dataAlphabet and targetAlphabet properties of the InstanceList are built on the fly as training instances are being added, but this is not the case. Consequently, my code fails in the last line above with an NPE, since the targetAlphabet property of the NB trainer is NULL.

Looking at the MALLET code (thanks to open-source), I can see that the root-cause for the non-construction of the Alphabets is that my data and labels don't implement the AlphabetCarrying interface. Therefore, NULL is returned in the Instance class here:

public Alphabet getDataAlphabet() {
    if (data instanceof AlphabetCarrying)
        return ((AlphabetCarrying)data).getAlphabet();
    else
        return null;
}

I find this rather confusing, because the documentation says that data and labels can be of any object type. But this error above seems to indicate on the contrary that I need to construct a specific data / label class that implements AlphabetCarrying.

I feel like I am I missing something important on the conceptual level regarding these Alphabets. Also, I am not clear, if the data alphabet should be derived from all the training instances or just one. Can someone explain the error here?

Cheers,

Martin

martin_wun
  • 1,599
  • 1
  • 15
  • 33

1 Answers1

4

Answering my own question here: The solution was to add some pipes, specifically a TokenSequence2FeatureSequence pipe to build the data alphabet and a Target2Label to build the label alphabet. Also, the trainining instances need to be added using instanceList.addThruPipe(trainingInstance).

This is based on answers from the Mallet mailing list.

martin_wun
  • 1,599
  • 1
  • 15
  • 33
  • Problem you might have had was importing the data which this guide should help with: http://mallet.cs.umass.edu/import-devel.php – c-chavez Jun 18 '17 at 09:43