Hadoop reducers receiving wrong data

Question

I have a load of JobControls running at the same time, all with the same set of ControlledJobs. Each JobControl is dealing with a different set of input / output files, by date range, but they are all of the type. The problem that I am observing is that the reduce steps are receiving data designed to be processed by a reducer handling a different date range. The date range is set by the Job, used to determine the input and output, and read from the context within the reducer.

This stops if I submit the JobControls sequentially but that's no good. Is this something I should be solving with a custom partitioner? How would I even determine the correct reducer for a key if I don't know which reducer is dealing with my current date-range? Why would the instantiated reducers not be locked to their JobControl?

I have writing all the JobControls, Jobs, Maps and Reduces against their base implementations in Java.

I'm using the 2.0.3-alpha with yarn. Could that have anything to do with it?

I have to be a little careful sharing the code, but here's a sanitised mapper:

protected void map(LongWritable key, ProtobufWritable<Model> value, Context context) 
    throws IOException, InterruptedException {
  context.write(new Text(value.get().getSessionId()), 
                new ProtobufModelWritable(value.get()));
}

And Reducer:

protected void reduce(Text sessionId, Iterable<ProtobufModelWritable> models, Context context) 
     throws IOException, InterruptedException {
  Interval interval = getIntervalFromConfig(context);
  Model2 model2 = collapseModels(Iterables.transform(models, TO_MODEL));

  Preconditions.checkArgument(interval.contains(model2.getTimeStamp()), 
      "model2: " + model2 + " does not belong in " + interval);
}

private Interval getIntervalFromConfig(Context context) {
  String i = context.getConfiguration().get(INTERVAL_KEY);
  return Utils.interval(i);
}

Can you elaborate a bit more? What kind of key type are you using and can you show us your mapper code? — Thomas Jungblut, Mar 19 '13 at 19:15
foremost, kudos for good readable code +1! However you set the interval for the whole job. So if you have multiple reducers you are just having the same interval in all reduce tasks. So if you want different intervals in reducers, you have to override the partitioner with your date range splitting logic. — Thomas Jungblut, Mar 20 '13 at 09:25
The thing is, I have a separate Job for each interval. The Job and Reduce classes are the same but they have different Configurations created for each instantiation. This means that there are multiple reducers of the same class running at the same time, but for different jobs. — Ben Smith, Mar 20 '13 at 13:14
I'd go with the approach @ThomasJungblut suggested - override the partitioner and split according to the interval_key. In your map task, you should read a list of desired intervals from the configuration and emit each record `N` times, each time for another interval. Using this approach you need one job only and the reducers will compute the desired result in parallel. — harpun, Mar 20 '13 at 19:52

score 0 · Accepted Answer · answered Mar 22 '13 at 08:08

For reference, I fixed this with 2 things. The most important problem was that, while I was creating separate Jobs for each interval, I was giving them each the same name. By appending the serialised interval to the job name, Hadoop knew which reducers to send the map results to.

Additionally I started creating individual Configuration objects for each job, rather than copying an initial Configuration. This is probably unnecessary but at least I know I can't make a mistake and start sharing the same Configuration object.

Hadoop reducers receiving wrong data

1 Answers1