I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:
From what I Understand:
Hadoop is specifically used when there is a huge amount of data involved. When we store a file in HDFS, what happens is that, the file is split into various blocks (block size is typically 64MB or 128MB...or whatever is configured for the current system). Now, once the big file is split in to various blocks, then these blocks are stored over the cluster. This is internally handled by the hadoop environment.
The background for the question is:
Let us say there are multiple such huge files being stored in the system. Now, blocks of these different files may be stored at a data node, A(There are 3 data-nodes, A ,B and C ). And also, multiple blocks of the same file may also be stored at the same data node, A.
Scenario1:
If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen? Will there be multiple mappers assigned to these different blocks or the same mapper will process the multiple blocks?
Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing? As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?
Scenario2
If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?