Why are half of my "word count" Hadoop Reducer output files 0 bytes when run on AWS/EMR?

Question

I have a set of data that is basically the Mapping results of a simple Word Count (text files w/ word & count pairs, tab delimited), and I need to reduce it. There's about 160 GB of data, compressed into bz2 files.

When I run my job on Amazon Web Services Elastic Map Reduce (AWS EMR), I use 10 cc2.8xlarge slaves and an m1.xlarge as master. There ends up being 1200 map tasks, and 54 reduce tasks. Exactly half of the reduce tasks finish immediately after the map tasks finish, and the output of them all is 0 bytes. I'm assuming the input is 0 bytes, but I haven't dug through the logs enough to confirm. The other 27 reduce tasks finish eventually, and file size is consistent across them all (2.3gb each). For the output files (part-r-00000, ..., part-r-00053), the even-numbered files are the 0-byte files.

When I run this locally on a very small sample w/ 2 reducers, each reducer output has data.

My mapper and reducer are as follows (Java w/ extras stripped out):

// ...

public void map(LongWritable key, Text val, Context context) throws IOException, InterruptedException {
  String[] parts = val.toString().split("\t");
  if (parts.length > 1) {
    keyOut.set(parts[0]);
    valOut.set(Integer.parseInt(parts[1]));
    context.write(keyOut, valOut);
  }
}

// ...

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  }
  result.set(sum);
  context.write(key, result);
}

// ...

Has anyone else experienced this? Any idea why this happened, or how I can debug this further? I had EMR debugging turned on in case you got an idea of something to look for in logs. Thanks

Edit: I should note that I'm reading and storing my data on S3

Edit 2: I ran this same job once before, and I saw the 0-byte files, and assumed I had a bug in my Reducer, so canceled the job. Thus, I know it isn't a one-off event. The job was run on the same cluster. I originally compiled my Java Classes on the Cloudera 4 (CDH4) libraries which have "Hadoop 2.0", so I thought this might be the issue. When I ran it the second time, I used Java Classes compiled with the Cloudera 3 (CDH3) libraries w/ Hadoop 0.20, basically same version as AWS. I've also used CDH3 to compile in past w/out this behavior.

What partitioner is being used? I think that's the thing that actually allocates records to the reducers/partitions. If some reducers are getting 0 inputs then I'd look to that logic. — Chris Gerken, Aug 19 '12 at 16:43
That is not a problem of your Hadoop version, it is the natural behaviour of hash partitioning. There is no gurantee that a reducer gets records. — Thomas Jungblut, Aug 19 '12 at 17:07
You're probably right in that the partitioning is responsible. I guess I'm just surprised that exactly half of the reducers didn't get records, with every other reducer being the one w/ no data. I didn't specify my own partitioner, so the default one must be the one used. The final output matches what I would expect, so I don't think there is an error per say, but the potential reducing power of my cluster is not fully being taken advantage of. Anyway, not a huge deal, but just wanted to share this in case others ran into a similar issue and/or a "fix" presented itself. — Dolan Antenucci, Aug 19 '12 at 17:40

Why are half of my "word count" Hadoop Reducer output files 0 bytes when run on AWS/EMR?

0 Answers0