5

Hadoop Version: 0.20.2 (On Amazon EMR)

Problem: I have a custom key that i write during map phase which i added below. During the reduce call, I do some simple aggregation on values for a given key. Issue I am facing is that during the iteration of values in reduce call, my key got changed and i got values of that new key.

My key type:

 class MyKey implements WritableComparable<MyKey>, Serializable {
    private MyEnum type; //MyEnum is a simple enumeration.
    private TreeMap<String, String> subKeys;

    MyKey() {} //for hadoop
    public MyKey(MyEnum t, Map<String, String> sK) { type = t; subKeys = new TreeMap(sk); }

    public void readFields(DataInput in) throws IOException {
      Text typeT = new Text();
      typeT.readFields(in);
      this.type = MyEnum.valueOf(typeT.toString());

      subKeys.clear();
      int i = WritableUtils.readVInt(in);
      while ( 0 != i-- ) {
        Text keyText = new Text();
        keyText.readFields(in);

        Text valueText = new Text();
        valueText.readFields(in);

        subKeys.put(keyText.toString(), valueText.toString());
    }
  }

  public void write(DataOutput out) throws IOException {
    new Text(type.name()).write(out);

    WritableUtils.writeVInt(out, subKeys.size());
    for (Entry<String, String> each: subKeys.entrySet()) {
        new Text(each.getKey()).write(out);
        new Text(each.getValue()).write(out);
    }
  }

  public int compareTo(MyKey o) {
    if (o == null) {
        return 1;
    }

    int typeComparison = this.type.compareTo(o.type); 
    if (typeComparison == 0) {
        if (this.subKeys.equals(o.subKeys)) {
            return 0;
        }
        int x = this.subKeys.hashCode() - o.subKeys.hashCode();
        return (x != 0 ? x : -1);
    }
    return typeComparison;
  }
}

Is there anything wrong with this implementation of key? Following is the code where I am facing the mixup of keys in reduce call:

reduce(MyKey k, Iterable<MyValue> values, Context context) {
   Iterator<MyValue> iterator = values.iterator();
   int sum = 0;
   while(iterator.hasNext()) {
        MyValue value = iterator.next();
        //when i come here in the 2nd iteration, if i print k, it is different from what it was in iteration 1.
        sum += value.getResult();
   }
   //write sum to context
}

Any help in this would be greatly appreciated.

Bhargava
  • 189
  • 3
  • 12
  • 1
    It might sound strange, but I think if you remove the hashcode part of your compareTo method and the return -1 then it should work well. – Thomas Jungblut May 23 '11 at 14:50
  • Thank you Thomas. It did sound strange but I was so desperate I tried it anyway. First, above code works. I am shamefully admitting that the code I pasted was exactly not the code i was running. I was simply doing `return this.subKeys.hashCode() - o.subKeys.hashCode()` which I knew was wrong but instead of fixing the actual code, I just tried comparing the hashCode of two keys which seemed to be colliding. I made a mistake in testing that and assumed something else is wrong and fixed the code here. – Bhargava May 23 '11 at 17:51
  • So how did you solve the problem? Was hashCode() the reason, or? – Serob_b Mar 14 '19 at 00:27

1 Answers1

5

This is expected behavior (with the new API at least).

When the next method for the underlying iterator of the values Iterable is called, the next key/value pair is read from the sorted mapper / combiner output, and checked that the key is still part of the same group as the previous key.

Because hadoop re-uses the objects passed to the reduce method (just calling the readFields method of the same object) the underlying contents of the Key parameter 'k' will change with each iteration of the values Iterable.

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • I have such kind of situation on 2.7 version. How to avoid this behavior? I just want to count my objects (just like standard wordCounter example, the only difference is that I use my custom type instead of hadoop Text), but in the end I get sum of all values for just one object. reduce() is called once form run(), instead of being called for every key separately. – Serob_b Mar 14 '19 at 00:23