0

I have an input file (~31GB in size) containing consumer reviews about some products which I'm trying to lemmatize and find the corresponding lemma counts of. The approach is somewhat similar to the WordCount example provided with Hadoop. I've 4 classes in all to carry out the processing: StanfordLemmatizer [contains goodies for lemmatizing from Stanford's coreNLP package v3.3.0], WordCount [the driver], WordCountMapper [the mapper], and WordCountReducer [the reducer].

I've tested the program on a subset (in MB's) of the original dataset and it runs fine. Unfortunately, when I run the job on the complete dataset of size ~31GB, the job fails out. I checked the syslog for the job and it contained this:

java.lang.OutOfMemoryError: Java heap space at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:109) [...]

Any suggestions on how to handle this?

Note: I'm using the Yahoo's VM which comes pre-configured with hadoop-0.18.0. I've also tried the solution of assigning more heap as mentioned in this thread: out of Memory Error in Hadoop

WordCountMapper code:

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

  private final IntWritable one = new IntWritable(1);
  private final Text word = new Text();
  private final StanfordLemmatizer slem = new StanfordLemmatizer();

  public void map(LongWritable key, Text value,
      OutputCollector output, Reporter reporter) throws IOException {

    String line = value.toString();

    if(line.matches("^review/(summary|text).*"))    //if the current line represents a summary/text of a review, process it! 
    {
        for(String lemma: slem.lemmatize(line.replaceAll("^review/(summary|text):.", "").toLowerCase()))
        {
            word.set(lemma);
            output.collect(word, one);
        }
    }
  }
}
Community
  • 1
  • 1
Aditya
  • 1,693
  • 19
  • 27

2 Answers2

2

You need to make the size of the individual units that you are processing (i.e., each Map job in the map-reduce) reasonable. The first unit is the size of document that you are providing to the StanfordCoreNLP's annotate() call. The whole of the piece of text that you provide here will be tokenized and processed in memory. In tokenized and processed form, it is over an order of magnitude larger than its size on disk. So, the document size needs to be reasonable. E.g., you might pass in one consumer review at a time (and not a 31GB file of text!)

Secondly, one level down, the POS tagger (which precedes the lemmatization) is annotating a sentence at a time, and it uses large temporary dynamic programming data structures for tagging a sentence, which might be 3 orders of magnitude larger in size than the sentence. So, the length of individual sentences also needs to be reasonable. If there are long stretches of text or junk which doesn't divide into sentences, then you may also have problems at this level. One simple way to fix that is to use the pos.maxlen property to avoid POS tagging super long sentences.

p.s. And of course you shouldn't run annotators like parse, dcoref that you're not using, if you only need the lemmatizer.

Christopher Manning
  • 9,360
  • 34
  • 46
  • Thank you Prof. Manning for the detailed explanation and suggestions. Will try them out and see if I can manage some workaround :) – Aditya Nov 28 '13 at 01:34
0

Configuring the hadoop heap space might not help you if your StanfordLemmatizer is not part of the mapreduce job. Can you provide the code of the job? So, I believe that what limits you is Java Heap space in general.

Before considering configuring it check this first:

I had a look at the code of edu.stanford.nlp.sequences.ExactBestSequenceFinder (you should try that too here)

I don't know which version of stanford.nlp you use and I am not familiar with it but it seems to do some operations based on the "SequenceModel" you put as input. It starts like this:

private int[] bestSequenceNew(SequenceModel ts) {
    // Set up tag options
    int length = ts.length();
    int leftWindow = ts.leftWindow();
    int rightWindow = ts.rightWindow();
    int padLength = length + leftWindow + rightWindow;
    int[][] tags = new int[padLength][];  //operations based on the length of ts
    int[] tagNum = new int[padLength];   //this is the guilty line 109 according to grepcode

So the output of ts.length() is pretty huge (or there is no more Java heap space for this array). Can you make it smaller?

Edit

So obviously the String

 line.replaceAll("^review/(summary|text):.", "").toLowerCase()

is too much for the Java heap. Can you check if this is really the one you want? Can you print its length? Maybe you should consider reorganising your 31GB dataset so that it has more and much smaller lines than now (if that is possible) for your job. It might be that one line is too big by mistake and the cause of the problem.

If this cannot be done, please print the full stack trace of Exceptions.

Artem Tsikiridis
  • 682
  • 6
  • 11
  • Thanks Artem, I'm using v3.3.0 of the Stanford coreNLP package. Just added the code of my mapper class in the question itself if you want to have a look. Rather than tinkering with coreNLP's sourcecode, I would prefer tweaking my own program since it would be far less complicated for me :) – Aditya Nov 27 '13 at 16:48
  • Thanks Artem, that makes perfect sense. I'll try and see if I can pre-process the dataset before passing it on for execution by Hadoop. I've tried to search for another workaround in the time being, but no luck. Since I felt it was a separate question in itself, I've asked it here if you want to have a look: http://stackoverflow.com/questions/20256197/use-wget-with-hadoop – Aditya Nov 28 '13 at 01:39
  • @Aditya You're welcome again :) If you believe that my answer is the solution to your problem can you please accept it ? Ok, I'll take a look. cheers. – Artem Tsikiridis Nov 28 '13 at 01:47