I try to implement a document classifier with Mallet in Java. I already have a file that essential contains feature values. So I don't want to run through an entire raw text
processing pipeline.
A line in my feature file looks like this at the moment (2 features, ID and NrOfToken, document label is "A")
ID=3 NrofTokens=279.0 A
I try to read in this file and put it into a classifier like this:
Pipe instancePipe = new SerialPipes(new Pipe[] {
new CharSequence2TokenSequence(),
new TokenSequence2FeatureSequence(),
new Target2Label(),
});
InstanceList trainData = new InstanceList(instancePipe);
InstanceList testData = new InstanceList(instancePipe);
Reader trainFileReader = new InputStreamReader(new FileInputStream(fileTrain), "UTF-8");
trainData.addThruPipe(new LineGroupIterator(trainFileReader, Pattern.compile("^\\s*$"), true));
Reader testFileReader = new InputStreamReader(new FileInputStream(fileTest), "UTF-8");
testData.addThruPipe(new LineGroupIterator(testFileReader, Pattern.compile("^\\s*$"), true));
// Create a classifier trainer, and use it to create a classifier
@SuppressWarnings("rawtypes")
ClassifierTrainer naiveBayesTrainer = new NaiveBayesTrainer();
Classifier classifier = naiveBayesTrainer.train(trainData);
At the moment I get this exception:
java.lang.IllegalArgumentException: Alphabets don't match: Instance: [6, null], InstanceList: [6, 0]
at cc.mallet.types.InstanceList.add(InstanceList.java:335)
at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
at
Anyone an idea why the Alphabet is breaking?