How to distribute data between Datanodes/Slavenodes in Hadoop?

Question

I have a multinode cluster in Hadoop consisting of two machines. first machine (configured master and slave) has name node and data node running and the second machine (configured slave) has data node running.

I want to upload and distribute the data between them almost equally?

I have two scenarios:

First: suppose I have a file file1 of 500MB in size and I uploaded to first machine using:

hadoop fs -put file1 hdfspath

Will it be divided into both of the data nodes or only stored in first machine?

When the distribution will happen: is it after after exceeding the block size in first machine then it will distribute or there is another criteria.

Will it be equally divided 250mb for each datanode?

Second: suppose I have 250 files each one 2mb in size and I uploaded the folder containing them dir1 to first machine using:

hadoop fs -put dir1 hdfspath

same question: will the data be distributed in both machines or only in first machine. Also when and how the distribution will occur?

Thank you.

score 1 · Answer 1 · answered Mar 24 '17 at 11:00

When we write a file to HDFS, it is split up into chucks called data blocks and the size of blocks is controlled by the parameter dfs.block.size in hdfs-site.xml (Normally 128 MB). Each block is stored on one/more nodes, which is controlled by the parameter dfs.replication in the same file (Default is 3). Each copy of a block in the nodes is called a replica.

Way it is done :-

When writing data to an HDFS file, data is first written to local cache at client. When the cache reaches the certain thresold ( block size , default 128 MB), the client request and retrieves a list of DataNodes from the NameNode( which maintains the meta data ). This list contains the DataNodes that has space and can have a replica of that block. The number of DataNodes that can have the replica data is based on replication factor. The client then creates a pipeline in between DataNodes to flush the data. The first DataNode starts receiving the data (underlying io.file.buffer.size is 4kb , hadoop uses for I/O operations), writes the buffered data to local dir of node and transfers the same buffered data to the second DataNode in the list. The second DataNode, in turn starts receiving the buffered data of the data block, writes to its local dir and then flushes the same data to the third DataNode. Finally, the third DataNode writes the data to its local dir.

When the first block is filled, the client requests new DataNodes to be chosen from NameNode to host replicas of the next block.This flow continues till last block of the file. Choice of DataNodes for each block may be different.

Thanks for replying, I think the data should be re-balanced otherwise it will all be stored in one datanode. I read that we should use: hdfs hdfs balancer. By the way how to see the data on each node in the web interface of Hadoop i.e. what is the URL for that? — Mosab Shaheen, Mar 24 '17 at 14:45

How to distribute data between Datanodes/Slavenodes in Hadoop?

1 Answers1