I'm running Giraph, which executes on our small CDH4 Hadoop cluster of five hosts (four compute nodes and a head node - call them 0-3 and 'w') - see versions below. All five hosts are running mapreduce tasktracker services, and 'w' is also running the jobtracker. Resources are tight for my particular Giraph application (a kind of path-finding), and I've discovered that some configurations of the automatically-scheduled hosts for tasks work better than others.
More specifically, my Giraph command (see below) specifies four Giraph workers, and when executing, Hadoop (Zookeeper actually, IIUC) creates five tasks that I can see in the jobtracker web UI: one master and four slaves. When it puts three or more of the map tasks on 'w' (e.g., 01www or 1wwww), then that host maxes out ram, cpu, and swap, and the job hangs. However, when the system spreads the work out more evenly so that 'w' has only two or fewer tasks (e.g., 123ww or 0321w), then the job finishes fine.
My question is, 1) what program is deciding the task-to-host assignment, and 2) how do I control that?
Thanks very much!
Versions
- CDH: 4.7.3
- Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha" (CHANGELOG starts with: Release 1.0.0 - 2013-04-15)
- Zookeeper Client environment: zookeeper.version=3.4.5-cdh4.4.0--1, built on 09/04/2013 01:46 GMT
Giraph command
hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=wright.cs.umass.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.JsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.JsonAdjacencyListTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 4