2

I am kind of newbie in Big Data world and trying to learn Hadoop. One thing that surprises me is that BIG data or Hadoop is by default supports immutability nature as we want to write the data once and read many times as well as immutability is best option in distributed storage and processing world. At the same time, I read that all Data types in Hadoop which implements Writable interface are mutable in nature to support the serialization in framework. I am confused here when all data types are mutable then how immutability will be supported in Hadoop as whole ? Is not both things contradictory ?

Thanks in advance for answering my question.

2 Answers2

2
Hadoop immutable

With Hadoop, all records written are immutable because Hadoop doesn't support random writes. Sometimes this can be a real pain, but it scales really well. You'll even find that more and more languages are bringing back this concept of immutable objects. Why? Well, because mutable objects present several problems. For one, a mutable object must deal with concurrency. This alone requires additional programming to ensure an object is only updated by a single source at a time. When your updating a mutable object that's been written to disk, you need to rewrite the entire file below the change. And that can be costly. ref-https://streever.atlassian.net/wiki/display/HADOOP/2014/03/06/Managing+Mutable+Data+in+an+Immutable+Big+Data+World

Data type mutable

The reason is the serialization mechanism. Let's look at the code:

// version 1.x MapRunner#run() K1 key = input.createKey(); V1 value = input.createValue();

while (input.next(key, value)) {
   // map pair to output
   mapper.map(key, value, output, reporter);

... So we are reusing the same instance of key/value pairs all over again. Why? I don't know about the design decisions back then, but I assume it was to reduce the amount of garbage objects. Note that Hadoop is quite old and back then the garbage collectors were not as efficient as they are today, however even today it makes a big difference in runtime if you would map billions of objects and directly throw them away as garbage.

The real reason why you can't make the Writable type really immutable is that you can't declare fields as final. Let's make a simple example with the IntWritable:

public class IntWritable implements WritableComparable {
  private int value;

  public IntWritable() {}

  public IntWritable(int value) { set(value); }

... If you would make it immutable it would certainly not work with the serialization process anymore, because you would need to define value final. This can't work, because the keys and values are instantiated at runtime via reflection. This requires a default constructor and thus the InputFormat can't guess the parameter that would be necessary to fill the final data fields. So the whole concept of reusing instances obviously contradicts the concept of immutability.

However, you should ask yourself what kind of benefit an immutable key/value should have in Map/Reduce. In Joshua Bloch's- Effective Java, Item 15 he states that immutable classes are easier to design, implement and use. And he is right, because Hadoop's reducer is the worst possible example for mutability:

void reduce(IntWritable key, Iterable<Text> values, Context context) ...

Every value in the iterable refers to the same shared object. Thus many people are confused if they buffer their values into a normal collection and ask themselves why it always retains the same values.

In the end it boils down to the trade-off of performance (cpu and memory- imagine many billions of value objects for a single key must reside in RAM) vs. simplicity.

ref- Why should a Writable datatype be Mutable?

Community
  • 1
  • 1
Kishore
  • 5,761
  • 5
  • 28
  • 53
1

I think you may be confusing HDFS, i.e. stored files, which are typically written once and do not support arbitrary overwriting with in-memory objects (Writables). These can be edited as they are not committed to disk, and it would be to expensive to create a new Writeable for every operation (think of the GC costs).

Jedi
  • 3,088
  • 2
  • 28
  • 47