3

I try to implement a document classifier with Mallet in Java. I already have a file that essential contains feature values. So I don't want to run through an entire raw text processing pipeline.

A line in my feature file looks like this at the moment (2 features, ID and NrOfToken, document label is "A")

ID=3 NrofTokens=279.0 A

I try to read in this file and put it into a classifier like this:

Pipe instancePipe = new SerialPipes(new Pipe[] {
                new CharSequence2TokenSequence(),
                new TokenSequence2FeatureSequence(),
                new Target2Label(),
        });

        InstanceList trainData = new InstanceList(instancePipe);
        InstanceList testData = new InstanceList(instancePipe);

        Reader trainFileReader = new InputStreamReader(new FileInputStream(fileTrain), "UTF-8");
        trainData.addThruPipe(new LineGroupIterator(trainFileReader, Pattern.compile("^\\s*$"), true));

        Reader testFileReader = new InputStreamReader(new FileInputStream(fileTest), "UTF-8");
        testData.addThruPipe(new LineGroupIterator(testFileReader, Pattern.compile("^\\s*$"), true));

        // Create a classifier trainer, and use it to create a classifier
        @SuppressWarnings("rawtypes")
        ClassifierTrainer naiveBayesTrainer = new NaiveBayesTrainer();
        Classifier classifier = naiveBayesTrainer.train(trainData);

At the moment I get this exception:

java.lang.IllegalArgumentException: Alphabets don't match: Instance: [6, null], InstanceList: [6, 0]
    at cc.mallet.types.InstanceList.add(InstanceList.java:335)
    at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
    at 

Anyone an idea why the Alphabet is breaking?

toobee
  • 2,592
  • 4
  • 26
  • 35
  • 1
    Could someone please help on this query? Am also facing the issue while running the Topic modelling code from http://mallet.cs.umass.edu/. – Neethu Prem Dec 26 '16 at 12:12
  • Is anyone facing the same issue in 2020 with the mallet version 3.0.8? What I found is it happens in rare scenarios. I have also look for official documentation but I haven't found any solution. – Urmay Shah Feb 14 '21 at 12:07

2 Answers2

0

This is not really an answer, but I found the exceptions in Mallet not very informative so far. I also got this error, changing my regex that parses the data lines and removing an empty line at the end made it go away.

i.e. the regex in this part

CsvIterator reader = new CsvIterator(new FileReader(tempTrainPath), "(\\w+)\\s+(\\S+)\\s+(.*)", 3, 2, 1);
testInstances.addThruPipe(reader);

At the end of a whole day of debugging, I was too annoyed to try out which of the two was the actual culprit. But maybe this info helps other people.

Igor
  • 1,251
  • 10
  • 21
0

I had the same error when trying to evaluate a classifier from the command line. Adding --use-pipe-from train_input.mallet option as described at https://mallet-dev.cs.umass.narkive.com/NFtumW1r/mallet-2-0-7-ge-maxent-alphabets-don-t-match solved the problem.