2

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB. When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.

Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??

Can someone help me understand what i am missing here ???

Thanks Pawan

1 Answers1

2

Pawan,

In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.

resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.

Cheers Rags

Rags
  • 1,891
  • 18
  • 19