0

I am using a Cleartk (V. 2.0) simple pipeline to develop a binary classifier for individual sentences in a CAS. However, even though training data gets generated, the classifier does not pick it up during training, see below.

I am working off of this example, specifically this code snippet:

AnalysisEngineFactory.createPrimitiveDescription(
    <name-of-your-cleartk-annotator>.class,
    CleartkAnnotator.PARAM_IS_TRAINING, true,
    DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
    <your-output-directory-file>,
    DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME,
    <name-of-your-selected-classifier's-data-writer>.class);

So my initialization code looks like this:

AnalysisEngine trainClassifier = AnalysisEngineFactory.createPrimitive(MyClassifier.class, 
        CleartkAnnotator.PARAM_IS_TRAINING, true,
        DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY, "target/classifier-data/",
        DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME, MalletCrfStringOutcomeDataWriter.class.getName());

When I run my pipeline the data gets created and stored into target/classifier-data/training-data.malletcrf, where each line is a feature vector with individual entries being of the format <featurename>_<value> and my boolean target attribute. I can open it in a text editor and look at it.

I am using a String outcome classifier because my target variable annotator inherits from CleartkSequenceAnnotator and, as I understand from prior answers to the Cleartk list, there does not seem to be a boolean classifier that is able to work with multiple classification tasks per CAS.

My rough classifier code:

public class MyClassifier extends CleartkSequenceAnnotator<String> {

@Override
public void process(JCas jCas) throws AnalysisEngineProcessException {

    // retrieve sentences in the cas
    for (Sentence sentence : sentences) {
        // apply feature extractors here to add features
        // add target variable
    }

    if (this.isTraining()) {

        // write the features and outcomes as training instances
        this.dataWriter.write(Instances.toInstances(targets, featureLists));

        try {
            System.out.println("training the classifier ... ");
            Train.main("target/classifier-data/");
            System.out.println("done training classifier");
        } catch (Exception e) {
            System.out.println("ERROR while training the classifier.");
            e.printStackTrace();
        }

    } else /* Classification */ {...}
}

Here is the pipeline code:

SimplePipeline.runPipeline(reader,
        trainClassifier,
        XmiWriter);

When I run the pipeline, even though the training data has been written, I get the following console output:

... reader initialization ...
Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file.
Perhaps the 'resources' directories weren't copied into the 'class' directory.
Continuing.
starting pipeline
training the classifier ... 
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Number of features in training data: 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Number of predicates: 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Labels: O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF addOrderNStates
INFORMATION: Preparing O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF addOrderNStates
INFORMATION: O->O(O) O,O
State #0 "O"
initialWeight=0.0, finalWeight=0.0
#destinations=1
-> O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger train
INFORMATION: Training on 0 instances
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF setWeightsDimensionAsIn
INFORMATION: CRF weights[O,O] num features = 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF setWeightsDimensionAsIn
INFORMATION: Number of weights = 1
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF about to train with 1 iterations
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFOptimizableByLabelLikelihood getValue
INFORMATION: getValue() (loglikelihood, optimizable by label likelihood) = 0.0
Okt 02, 2014 11:19:48 PM cc.mallet.optimize.LimitedMemoryBFGS optimize
INFORMATION: L-BFGS initial gradient is zero; saying converged
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF finished one iteration of maximizer, i=0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF training has converged, i=0
done training classifier

... which suggests to me that the classifier somehow is not picking up the training data from the file.

What am I doing wrong? Thanks in advance!

  • Could you post the exact way how you initialize your Engine? Why do you use `MyClassifier.class`? Have you tried to set a debug breakpoint (or a few System.out.println) within the `process` method? I suspect that `if (this.isTraining())` might never be called. – Renaud Oct 03 '14 at 07:28
  • I do have debug output, but omitted it here to keep it short. I edited the question to include the debug prints statements. They do show up in the console so I guess the method gets called. `MyClassifier` is not the actual name in my code. I just replaced my actual name with it so it's easier to read. However, I had not changed it twice, so I corrected it. Also, I added the pipeline code. Finally, I changed the initialization code to `CleartkAnnotator.PARAM_IS_TRAINING, true`. I still get the same output when running it. – Matthias Grabmair Oct 03 '14 at 09:24
  • Ok, sounds good. Just to make sure, add `public void initialize(UimaContext context) throws ResourceInitializationException { super.initialize(context); }` – Renaud Oct 03 '14 at 13:36
  • It's difficult to help, since I don't have the whole code. If I were you, I would continue to step-debug until you find what's wrong – Renaud Oct 03 '14 at 13:39

1 Answers1

0

My guess would be that you imported the wrong Sentence class. You can easily find out if I'm right by debugging the for loop in the process-method of MyClassifier.

Bram Vandewalle
  • 1,624
  • 3
  • 20
  • 29