The splitting logic of HDFS?

Question

what is the significance of the isSplittable() method of FileInputFormat class? http://hadoop.apache.org/docs/r2.2.0/api/index.html

score 2 · Answer 1 · answered Feb 26 '14 at 07:58

2

When isSplitable returns false only a single mapper processes the entire file.

You can provide your own implementation of FileInputFormat and return true/false for isSplitable depending on your needs.

answered Feb 26 '14 at 07:58

Jasper

8,440
31
92
133

But if, my file size is greater than the block size: e.g: 129MB, (with 128MB as max block size); then even if I set isSplittable() as false, the file will be split in two blocks. Then what is the use of this function? Also, when you mean to say single Mapper, you mean single machine/core. Right? – Sugandha Feb 26 '14 at 09:26
1

Here splittable does not mean at HDFS Storage level (where block size applies) - it means how the INPUT is split for passing it to your mapper. Mapper will get the entire file (size does not matter). By Single Mapper - i do not mean single machine/core - i mean single Mapper task. Pls see: http://wiki.apache.org/hadoop/HadoopMapReduce – Jasper Feb 26 '14 at 09:36

score 1 · Answer 2 · answered Feb 26 '14 at 10:53

If the files are stream compressed like tar.gz or zip files, and when your records have variable number of lines; there might be a possibility that a part of the same record may land up in one block and the rest of the part of record in another block. And thus, the program written to read the records might crash.

Thus, in scenarios like these, one would set the isSplittable() as false.

The splitting logic of HDFS?

2 Answers2