4

Why should a Writable datatype be Mutable? What are the advantages of using Text (vs String) as a datatype for Key/Value in Map, Combine, Shuffle or Reduce processes?

Thanks & Regards, Raja

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
Raja
  • 63
  • 5

2 Answers2

4

You can't choose, these datatypes must be mutable.

The reason is the serialization mechanism. Let's look at the code:

// version 1.x MapRunner#run()
K1 key = input.createKey();
V1 value = input.createValue();

while (input.next(key, value)) {
   // map pair to output
   mapper.map(key, value, output, reporter);
   ...

So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.

The real reason why you can't make the Writable type really immutable is that you can't declare fields as final. Let's make a simple example with the IntWritable:

public class IntWritable implements WritableComparable {
  private int value;

  public IntWritable() {}

  public IntWritable(int value) { set(value); }
...

If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define value final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus the InputFormat can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.

However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:

void reduce(IntWritable key, Iterable<Text> values, Context context) ...

Every value in the iterable refers to the same shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.

In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.

Thomas Jungblut
  • 20,854
  • 6
  • 68
  • 91
  • Thanks Thomas. That clarifies me on why the Writable data types were designed to be mutable. Can you elaborate on the Reduce example you've provided. Does it mean that Reduce will always use only one instance of the collection object and it's value is never modified? – Raja Oct 30 '13 at 07:30
  • @Raja there is no collection object, the `Iterable` is a direct proxy to the data on disk that will be serialized and constantly refill only one value object. – Thomas Jungblut Oct 30 '13 at 09:21
2

Simply put, the reason Writable cannot be Immutable is the readFields(DataInput) method in Writable. The way the Hadoop deserializes instances in to create an instance using the default (no argument) constructor and calling readFields to parse the values. Since the values are not being assigned in the construction, the object must be mutable.

John B
  • 32,493
  • 6
  • 77
  • 98
  • Thanks John. It clarifies on the need for a datatype to be Mutable in order to participate in Serialization/De-Serialisation process. At what level will Hadoop create an instance (of object) during de-serilasation process. One object per Task run? If the Mapper is multithreaded, can there be multiple objects created to read fields from the same InputSplit in parallel? – Raja Oct 30 '13 at 07:40
  • I do believe it is one per task run. It creates an instance in order to do a Comparable lookup prior to Deserialization. As far as I know, Hadoop does not multithread Mappers within a single JVM. If this is incorrect, please let me know. – John B Oct 30 '13 at 10:34