Basics of Hadoop and MapReduce functioning

Question

I have just started to learn Hadoop and map-reduce concepts and have following few questions that I would like to get cleared before moving forward:

From what I Understand:

Hadoop is specifically used when there is a huge amount of data involved. When we store a file in HDFS, what happens is that, the file is split into various blocks (block size is typically 64MB or 128MB...or whatever is configured for the current system). Now, once the big file is split in to various blocks, then these blocks are stored over the cluster. This is internally handled by the hadoop environment.

The background for the question is:

Let us say there are multiple such huge files being stored in the system. Now, blocks of these different files may be stored at a data node, A(There are 3 data-nodes, A ,B and C ). And also, multiple blocks of the same file may also be stored at the same data node, A.

Scenario1:

If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen? Will there be multiple mappers assigned to these different blocks or the same mapper will process the multiple blocks?

Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing? As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?

Scenario2

If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?

Azim · Answer 1 · 2016-10-09T18:32:26.273

1) No. of mappers = No. of blocks of the file. That is, separate mapper for each block. Ideally, the no. of nodes in the cluster should be very high and no of two blocks of the same file stored on same machine.

2) Whenever client submits a job, the job will be executing on the whole file and not on particular the chunks.

3) When a client submits a job or stores a file inside HDFS, its upto the framework how it functions. Client should not be aware of the hadoop functionality, basically its none of his business. Client should know two things only - file and job(.jar).

4) Namenode stores all the metadata information about all the files stored inside HDFS. It stores information about within how many blocks a file gets distributed/divided. Each block of the file is stored across how many nodes/machines. On an average, for storing metadata information for each block, namenode needs 150 Bytes.

5) Scenario 2: Namenode manages such issues very well. HDFS has be defult replication factor as 3, which means every block will be stored on 3 different nodes. So, through these other nodes, HDFS manages load balancing but yes, primary goal of replication is make sure data availability. But take into consideration that there will be very less requests for reading the contents of the file. The Hadoop is meant for processing the data and not for just reading the contents.

I hope this will solve some of your doubts.

score 0 · Answer 2 · answered Oct 08 '16 at 20:57

If a client request comes that requires to access the multiple blocks of the same file on the same data node, then what will happen?

A client is not required to be a mapper, at this level we are working on HDFS and the data node will serve the same data to any client which requests them.

Will there be multiple mappers assigned to these different blocks or the > same mapper will process the multiple blocks?

Each map reduce jobs has its own mappers. More jobs which involves the same data block means more mappers which works on the same data.

Another part in the same question is, how does the client know what blocks or lets say what part of the file will be required for processing? As the client doesn't know how the files are stored, how will it ask the NameNode for block locations and etc? Or for every such processing, ALL the blocks of the respective file are processed? I mean to ask, what metadata is stored on the NameNode?

Clients knows about which blocks are required due to namenode. At the begin of file access, clients go to the namenode with the file name and gets back a list of blocks where data are stored together the datanode which holds them. namenode holds the "directory information" together the block list where the data are, all these info are stored in RAM and are updated on each system startup. Also datanode sends heartbeat to namenode together block allocation informations. EVERY datanode reports to EVERY namenode.

If there are two different requests to access blocks of different files on the same data node, then what will happen? In this case, there will be other data nodes with no work to do and won't there be a bottleneck at a single data node?

Unless the datanode does not respond ( failure ) access goes always on the same datanode. The replication is not used to make things work fast, it's all about to be sure no data will be lost. I.E: When you write to HDFS your data will be forwarded to any replication block and this makes writes very slow. We need to be sure about data are safe.

Basics of Hadoop and MapReduce functioning

2 Answers2