I have a very small 2 node Hadoop-HBase
cluster. I am executing MapReduce
jobs on it. I use Hadoop-2.5.2
. I have 32GB(nodes have 64GB memory each) free for MapReduce
in each node with the configuration in yarn site as follows
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>32768</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>15</value>
</property>
My resource requirements are 2GB for each mapper/reducer that gets executed. I have configured this in the mapred-site.xml
Given these configurations, with a total of about 64GB in memory and 30 vcores, I see about 31 mappers or 31 reducers getting executed in parallel.
While all this is fine, there is one part that I am trying to figure out. The number of mappers or reducers executing in parallel, is not the same on both nodes, one of the nodes has higher number of tasks than the other. Why does this happen? Can this be controlled? If so, how?
I suppose YARN does not see this as resources of a node rather resources of a cluster and spawns the tasks wherever it can in the cluster. Is this understanding correct? If not, what is the correct explanation to the said behaviour during a MR execution?