I have an input file (~31GB in size) containing consumer reviews about some products which I'm trying to lemmatize and find the corresponding lemma counts of. The approach is somewhat similar to the WordCount example provided with Hadoop. I've 4 classes in all to carry out the processing: StanfordLemmatizer [contains goodies for lemmatizing from Stanford's coreNLP package v3.3.0], WordCount [the driver], WordCountMapper [the mapper], and WordCountReducer [the reducer].
I've tested the program on a subset (in MB's) of the original dataset and it runs fine. Unfortunately, when I run the job on the complete dataset of size ~31GB, the job fails out. I checked the syslog for the job and it contained this:
java.lang.OutOfMemoryError: Java heap space at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence(ExactBestSequenceFinder.java:109) [...]
Any suggestions on how to handle this?
Note: I'm using the Yahoo's VM which comes pre-configured with hadoop-0.18.0. I've also tried the solution of assigning more heap as mentioned in this thread: out of Memory Error in Hadoop
WordCountMapper code:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class WordCountMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final IntWritable one = new IntWritable(1);
private final Text word = new Text();
private final StanfordLemmatizer slem = new StanfordLemmatizer();
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches("^review/(summary|text).*")) //if the current line represents a summary/text of a review, process it!
{
for(String lemma: slem.lemmatize(line.replaceAll("^review/(summary|text):.", "").toLowerCase()))
{
word.set(lemma);
output.collect(word, one);
}
}
}
}