processing the last block of data in HADOOP

Question

Suppose the data size of a file XYZ, is 68MB. So the blocks (where default block is 64MB) will be A- 64MB and B - 4MB. In the B block, rest of the space is occupied by another data block.

So when processing is done for XYZ data file, the A anb B blocks data will be processed. Since B block contains data for another file too, how does the HADOOP know which part of the block is to processed in case of B block?

score 1 · Accepted Answer · answered Nov 06 '14 at 07:09

1

If you have file (XYZ) of 68 MB and assuming your block size being 64MB then the data will be split into 2 blocks. Block-A will store 64MB of data and then the Block-B will store rest of the 4MB and the block will be closed (there is no wastage of space here), no other file's data will be put into Block-B.

So while processing, MapReduce knows exactly which blocks to process for a specific file. Of course, there are other constraints like input split's which are taken into consideration by MapReduce while processing the blocks to figure out record boundaries.

answered Nov 06 '14 at 07:09

Ashrith

6,745
2
29
36

You mean to say, in case of Block B it will be again divided into two chunks. One with size of 4MB and remaining for other data. Right!!! – user4221591 Nov 06 '14 at 07:28
1

No, HDFS blocks are just logical abstractions around physical linux file system, so the second block is just stored as 4 MB logical file in HDFS and 1 metadata entry in the NameNode for that block. But under the hoods block is technically stored as 1000 linux blocks = 4MB (assuming `4KB` ext4 block size). Take a look at this [question](http://stackoverflow.com/questions/15062457/hdfs-block-size-vs-actual-file-size) for more info. – Ashrith Nov 06 '14 at 07:43

processing the last block of data in HADOOP

1 Answers1