I have a set of data that is basically the Mapping results of a simple Word Count (text files w/ word & count pairs, tab delimited), and I need to reduce it. There's about 160 GB of data, compressed into bz2 files.
When I run my job on Amazon Web Services Elastic Map Reduce (AWS EMR), I use 10 cc2.8xlarge slaves and an m1.xlarge as master. There ends up being 1200 map tasks, and 54 reduce tasks. Exactly half of the reduce tasks finish immediately after the map tasks finish, and the output of them all is 0 bytes. I'm assuming the input is 0 bytes, but I haven't dug through the logs enough to confirm. The other 27 reduce tasks finish eventually, and file size is consistent across them all (2.3gb each). For the output files (part-r-00000, ..., part-r-00053), the even-numbered files are the 0-byte files.
When I run this locally on a very small sample w/ 2 reducers, each reducer output has data.
My mapper and reducer are as follows (Java w/ extras stripped out):
// ...
public void map(LongWritable key, Text val, Context context) throws IOException, InterruptedException {
String[] parts = val.toString().split("\t");
if (parts.length > 1) {
keyOut.set(parts[0]);
valOut.set(Integer.parseInt(parts[1]));
context.write(keyOut, valOut);
}
}
// ...
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
// ...
Has anyone else experienced this? Any idea why this happened, or how I can debug this further? I had EMR debugging turned on in case you got an idea of something to look for in logs. Thanks
Edit: I should note that I'm reading and storing my data on S3
Edit 2: I ran this same job once before, and I saw the 0-byte files, and assumed I had a bug in my Reducer, so canceled the job. Thus, I know it isn't a one-off event. The job was run on the same cluster. I originally compiled my Java Classes on the Cloudera 4 (CDH4) libraries which have "Hadoop 2.0", so I thought this might be the issue. When I ran it the second time, I used Java Classes compiled with the Cloudera 3 (CDH3) libraries w/ Hadoop 0.20, basically same version as AWS. I've also used CDH3 to compile in past w/out this behavior.