3

I am learning Hadoop MapReduce framework . I am struggling to find that Why can't we use Java primitive data types in Map Reduce.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
rraghuva
  • 131
  • 1
  • 10
  • Most likely because of the way that data is passed around. In many places you need Objects (and would need special handling for primitives). But: does it matter? – Thilo Nov 24 '15 at 11:23

2 Answers2

3

In Hadoop, interprocess communication was built with remote procedure calls ( RPC). The RPC protocol uses serialization to render the message into a binary stream at sender and it will be deserialized into the original message from binary stream at receiver.

For effectiveness of Hadoop, the serialization/de-serialization process should be optimized because huge number of remote calls happen between the nodes in the cluster. So the serialization format should be fast, compact, extensible and interoperable. Due to this reason, Hadoop framework has come up with one IO classes to replace java primitive data types. e.g. IntWritbale for int, LongWritable for long, Text for String etc.

Refer to Hadoop the definitive guide 4th edition for more details.

From Apache website, Purpose of Writable is quoted as :

A serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
3

The Java serialization requires the hash of the class to be prefixed before each instance of the object in the serialized format. Hence, to read the object, you do not need to specify the class name. This causes an overhead to read the object since each object can be an instance of different classes.

In Hadoop Serialization, we specify the class name while retrieving it. Hence, there is no need for a prefix since we already have knowledge of what we are retrieving. Hence we set the InputFormat. This increases the speed and performance in various aspect during RPC's.

Tanveer Dayan
  • 496
  • 1
  • 7
  • 18