1

I've been trying to run a hadoop 0.20.205.0 MapReduce job (single-thread, locally) which is exhibiting all kinds of weird and unexpected behavior. I finally figured out why. This looks to me like a bug in hadoop, but maybe there's something I don't understand. Could someone give me some advice? My setMapOutputKeyClass class implements Configurable. The readFields method won't properly read unless setConf is called first (I believe that's the point of the Configurable interface) But looking at the code for WritableComparator I see that when the framework is sorting them, it instantiates its internal key objects with:

70      key1 = newKey();
71      key2 = newKey();

And newKey() uses a null Configuration to construct the keys:

83  public WritableComparable newKey() {
84    return ReflectionUtils.newInstance(keyClass, null);
85  }

Indeed when I run in debugger I find that at

91      key1.readFields(buffer);

conf inside key1 is null, so setConf has not been called.

Is this a bug in hadoop or am I supposed to be using something other than Configurable to configure the keys? And if this is a bug, does anybody know any workarounds?

EDIT: Here's a short (somewhat contrived) example of a job which fails for this reason:

// example/WrapperKey.java

package example;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;

/**
 * This class wraps a WritableComparable class to add one extra possible value
 * (namely null) to the range of values available for that class.
 */
public class WrapperKey<T extends WritableComparable> implements
        WritableComparable<WrapperKey<T>>, Configurable {
    private T myInstance;
    private boolean isNull;
    private Configuration conf;

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
        Class<T> heldClass = (Class<T>) conf.getClass("example.held.class",
                null, WritableComparable.class);
        myInstance = ReflectionUtils.newInstance(heldClass, conf);
    }

    @Override
    public Configuration getConf() {
        return conf;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeBoolean(isNull);
        if (!isNull)
            myInstance.write(out);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        isNull = in.readBoolean();
        if (!isNull)
            myInstance.readFields(in);
    }

    @Override
    public int compareTo(WrapperKey<T> o) {
        if (isNull) {
            if (o.isNull)
                return 0;
            else
                return -1;
        } else if (o.isNull)
            return 1;
        else
            return myInstance.compareTo(o.myInstance);
    }

    public void clear() {
        isNull = true;
    }

    public T get() {
        return myInstance;
    }

    /**
     * Should sort the KV pairs (5,0), (3,0), and (null,0) to [(null,0), (3,0), (5,0)], but instead fails
     * with a NullPointerException because WritableComparator's internal keys
     * are not properly configured
     */
    public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        conf.setClass("example.held.class", ByteWritable.class,
                WritableComparable.class);
        Path p = new Path("input");
        Path startFile = new Path(p, "inputFile");
        SequenceFile.Writer writer = new SequenceFile.Writer(
                p.getFileSystem(conf), conf, startFile, WrapperKey.class,
                ByteWritable.class);
        WrapperKey<ByteWritable> key = new WrapperKey<ByteWritable>();
        key.setConf(conf);
        ByteWritable value = new ByteWritable((byte) 0);
        key.get().set((byte) 5);
        writer.append(key, value);
        key.get().set((byte) 3);
        writer.append(key, value);
        key.clear();
        writer.append(key, value);
        writer.close();

        Job j = new Job(conf, "Example job");
        j.setInputFormatClass(SequenceFileInputFormat.class);
        j.setOutputKeyClass(WrapperKey.class);
        j.setOutputValueClass(ByteWritable.class);
        j.setOutputFormatClass(SequenceFileOutputFormat.class);
        FileInputFormat.setInputPaths(j, p);
        FileOutputFormat.setOutputPath(j, new Path("output"));
        boolean completed = j.waitForCompletion(true);
        if (completed) {
            System.out
                    .println("Successfully sorted byte-pairs by key (putting all null pairs first)");
        } else {
            throw new RuntimeException("Failed to sort");
        }
    }
}
dspyz
  • 5,280
  • 2
  • 25
  • 63
  • 1
    Why does it implement Configurable? – Joseph Shraibman Dec 21 '11 at 04:21
  • The keys are board-states for a game. I need to specify the width and height of the game being solved. Then the number of bytes read is width*height (one for each cell on the board). I realize I could just pass the width and height, but this isn't a general solution. For instance, suppose my keys actually have a generic type, and the class which they might contain an instance of depends on some configuration parameter. Then there's no way to efficiently read in and parse the class name for every single call to readFields. I should expect to only have to know that once for each instance – dspyz Dec 21 '11 at 19:26
  • Keys need to be WritableComparable so they can be written to HDFS and sorted by Hadoop for input into the reduce phase. Hadoop will use the WritableComparable methods when doing its work. It will create new instances of them but has no reason to see if they are Configurable and call setConf(). Configurable is for job conf classes, not for any arbitrary class you use in your code. – Joseph Shraibman Dec 28 '11 at 03:40
  • Hey Can you tell me how did you solve it?.I will be really thankful. – rakesh Jan 08 '15 at 17:50

1 Answers1

0

WrapperKey is implementing the Configurable and implements the setConf. Just implementing an interface doesn't mean some other class is going to call this. Hadoop framework might not be calling the setConf method on the keys.

I don't think this is a bug. All the types I have seen implemented WritableComparable only and not Configurable. Not sure for a workaround for this, you may have to define concrete types in the key.

Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117
  • I was under the impression that Configurable is hadoop's way of letting you pass configuration options into the key as if in a constructor (since they are instantiated by reflection). That's why ReflectionUtils.newInstance takes the configuration as an argument – dspyz Dec 22 '11 at 09:02
  • Can you point me to the code where the keys are instantiated using RefectionUtils.newInstance()? – Praveen Sripati Dec 22 '11 at 10:06
  • Its in ReflectionUtils.newInstance(Class, Configuration) – Thomas Jungblut Dec 23 '11 at 06:55