1

The classifier frequently fails with OutOfMemoryError. Please suggest.

We have UIMA pipeline which invokes 5 model jars(based on mallet CRF) around 30MB each. -Xms is set to 2G and -Xmx is set to 4G.

Is there any guidelines/bench marking on setting the heap space? Please point if there are any guidelines for multi threaded environment.

I did try applying the patch https://code.google.com/p/cleartk/issues/detail?id=408, this did not resolve the issue.

Heap dump shows that 42% of heap size is char[] and 15% is String.

java.lang.OutOfMemoryError: Java heap space
    at cc.mallet.types.IndexedSparseVector.setIndex2Location(IndexedSparseVector.java:109)
    at cc.mallet.types.IndexedSparseVector.dotProduct(IndexedSparseVector.java:157)
    at cc.mallet.fst.CRF$TransitionIterator.<init>(CRF.java:1856)
    at cc.mallet.fst.CRF$TransitionIterator.<init>(CRF.java:1835)
    at cc.mallet.fst.CRF$State.transitionIterator(CRF.java:1776)
    at cc.mallet.fst.MaxLatticeDefault.<init>(MaxLatticeDefault.java:252)
    at cc.mallet.fst.MaxLatticeDefault.<init>(MaxLatticeDefault.java:197)
    at cc.mallet.fst.MaxLatticeDefault$Factory.newMaxLattice(MaxLatticeDefault.java:494)
    at cc.mallet.fst.MaxLatticeFactory.newMaxLattice(MaxLatticeFactory.java:11)
    at cc.mallet.fst.Transducer.transduce(Transducer.java:124)
    at org.cleartk.ml.mallet.MalletCrfStringOutcomeClassifier.classify(MalletCrfStringOutcomeClassifier.java:90)

Model is created based on MalletCrfStringOutcomeDataWriter.

AnalysisEngineFactory.createEngineDescription(DataChunkAnnotator.class,
        CleartkSequenceAnnotator.PARAM_IS_TRAINING, true, DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
        options.getModelsDirectory(), DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME, MalletCrfStringOutcomeDataWriter.class)

The annotator code looks as follows.

if (this.isTraining()) {
        List<DataAnnotation> namedEntityMentions = JCasUtil.selectCovered(jCas, DataAannotation.class, sentence);
        List<String> outcomes = this.chunking.createOutcomes(jCas, tokens, namedEntityMentions);
        this.dataWriter.write(Instances.toInstances(outcomes, featureLists));
      } else {
        List<String> outcomes = this.classifier.classify(featureLists);
        this.chunking.createChunks(jCas, tokens, outcomes);
      }

Thanks

Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129
Tilak
  • 323
  • 1
  • 5
  • 18

1 Answers1

0

You can either try to:

  1. increase Xmx
  2. dive deeper into analyzing the heap: all strings are backed up by char[] - so knowing numbers like 42% and 15% is not helpful - you should investigate which part of your program allocates these strings.
  3. Since it looks like the error is triggered in line:
    List<String> outcomes = this.classifier.classify(featureLists);
    you can start from there: try to figure out what's in featureLists, what's its size etc, and see what does the method classify do, and if you can "help" it become more efficient in term of memory. For example, reduce the use of String and replace it with StringBuilder and append (just an example).
Nir Alfasi
  • 53,191
  • 11
  • 86
  • 129