1

I am trying to instantiate a naive Bayes classifier to classify text blocks (with a pre-defined classification). The example below just tries to do it with male/female. I have tried loading data from file (CSVloader) and by creating instances below. The problem is the trainer.train() method throws a null pointer exception. It seems to be because the targetDictionary is null. The data dictionary is populated. How do I force the targetDictionary to be populated on an instance?

My actual goal is to classify paper abstracts I have in a database as being 'Science, Politics, Law, Health etc. It appears the Bayes classifier is the right choice for this.

I've iterated over the loaded instanceList and it seems to be populated correctly, and the dataDictionary is populated, but TargetDictionary is null.

Using Mallet 2.0.8 on windows

public TestMallet() throws IOException {

ArrayList<Pipe> pipelist = new ArrayList<Pipe>();

    pipelist.add (new CharSequenceLowercase() ) ;
    pipelist.add (new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) ) ;

    pipelist.add (new TokenSequenceRemoveStopwords (new File ("c:\\test\\config\\stopwords_en.txt"), "UTF-8", false, false, false) ) ;
    pipelist.add (new TokenSequence2FeatureSequence()) ;
    pipelist.add (new FeatureSequence2FeatureVector()) ; // Added but doesnt make any difference

    InstanceList instances = new InstanceList (new SerialPipes(pipelist)) ;

    Instance instance0 = new Instance("Hello World I am here and i am male my name is roger",   "Male",   "roger", "test") ;
    Instance instance1 = new Instance("Hello World I am here and i am male my name is phil",    "Male",   "phil",  "test") ;
    Instance instance2 = new Instance("Hello World I am here and i am male my name is joe",     "Male",   "joe",   "test") ;
    Instance instance3 = new Instance("Hello World I am here and i am female my name is vira",  "Female", "vira",  "test") ;
    Instance instance4 = new Instance("Hello World I am here and i am female my name is josie", "Female", "josie", "test") ;

    instances.addThruPipe (instance0) ;
    instances.addThruPipe (instance1) ;
    instances.addThruPipe (instance2) ;
    instances.addThruPipe (instance3) ;
    instances.addThruPipe (instance4) ;

    // Using Instance List to train
    // ----------------------------

    ClassifierTrainer trainer = new NaiveBayesTrainer();
    trainer.train(instances); 

// Null pointer exception here ( debugging, it looks like TargetDictionary is null) 

}

Expecting trainer to analyse correctly.

G_real
  • 1,137
  • 1
  • 18
  • 28

1 Answers1

0

A classifier learns to predict an output based on input features. In both cases we usually need to convert strings into numeric representations. You're telling Mallet how to do this transformation for the input features, but not the output label.

Adding a Target2Label() pipe should do it, see the Csv2Vectors class for an example.

David Mimno
  • 1,836
  • 7
  • 7