0

Suppose the data size of a file XYZ, is 68MB. So the blocks (where default block is 64MB) will be A- 64MB and B - 4MB. In the B block, rest of the space is occupied by another data block.

So when processing is done for XYZ data file, the A anb B blocks data will be processed. Since B block contains data for another file too, how does the HADOOP know which part of the block is to processed in case of B block?

user4221591
  • 2,084
  • 7
  • 34
  • 68

1 Answers1

1

If you have file (XYZ) of 68 MB and assuming your block size being 64MB then the data will be split into 2 blocks. Block-A will store 64MB of data and then the Block-B will store rest of the 4MB and the block will be closed (there is no wastage of space here), no other file's data will be put into Block-B.

So while processing, MapReduce knows exactly which blocks to process for a specific file. Of course, there are other constraints like input split's which are taken into consideration by MapReduce while processing the blocks to figure out record boundaries.

Ashrith
  • 6,745
  • 2
  • 29
  • 36
  • You mean to say, in case of Block B it will be again divided into two chunks. One with size of 4MB and remaining for other data. Right!!! – user4221591 Nov 06 '14 at 07:28
  • 1
    No, HDFS blocks are just logical abstractions around physical linux file system, so the second block is just stored as 4 MB logical file in HDFS and 1 metadata entry in the NameNode for that block. But under the hoods block is technically stored as 1000 linux blocks = 4MB (assuming `4KB` ext4 block size). Take a look at this [question](http://stackoverflow.com/questions/15062457/hdfs-block-size-vs-actual-file-size) for more info. – Ashrith Nov 06 '14 at 07:43