1

I'm running Giraph, which executes on our small CDH4 Hadoop cluster of five hosts (four compute nodes and a head node - call them 0-3 and 'w') - see versions below. All five hosts are running mapreduce tasktracker services, and 'w' is also running the jobtracker. Resources are tight for my particular Giraph application (a kind of path-finding), and I've discovered that some configurations of the automatically-scheduled hosts for tasks work better than others.

More specifically, my Giraph command (see below) specifies four Giraph workers, and when executing, Hadoop (Zookeeper actually, IIUC) creates five tasks that I can see in the jobtracker web UI: one master and four slaves. When it puts three or more of the map tasks on 'w' (e.g., 01www or 1wwww), then that host maxes out ram, cpu, and swap, and the job hangs. However, when the system spreads the work out more evenly so that 'w' has only two or fewer tasks (e.g., 123ww or 0321w), then the job finishes fine.

My question is, 1) what program is deciding the task-to-host assignment, and 2) how do I control that?

Thanks very much!

Versions

  • CDH: 4.7.3
  • Giraph: Compiled as "giraph-1.0.0-for-hadoop-2.0.0-alpha" (CHANGELOG starts with: Release 1.0.0 - 2013-04-15)
  • Zookeeper Client environment: zookeeper.version=3.4.5-cdh4.4.0--1, built on 09/04/2013 01:46 GMT

Giraph command

hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=wright.cs.umass.edu:2181 \
-libjars ${LIBJARS} \
relpath.RelPathVertex \
-wc relpath.RelPathWorkerContext \
-mc relpath.RelPathMasterCompute \
-vif relpath.JsonAdjacencyListVertexInputFormat \
-vip $REL_PATH_INPUT \
-of relpath.JsonAdjacencyListTextOutputFormat \
-op $REL_PATH_OUTPUT \
-ca RelPathVertex.path=$REL_PATH_PATH \
-w 4
Matthew Cornell
  • 4,114
  • 3
  • 27
  • 40
  • Hi - So 'W' is the head node and it's not a compute node, right? Can you tell me why are you running a tasktracker there? – SSaikia_JtheRocker Sep 30 '14 at 17:46
  • @SSaikia_JtheRocker Yes, 'w' is the head. In Cloudera we have it set up running 25 roles including JobTracker and TaskTracker, where 0-3 are only running TaskTrackers. Is there a better way to set this up? When we bought the cluster we opted for a 'big' head node to run postgres on, which I believe is a different use of the head from other clusters where the head is thin compared to compute nodes. Any advice would be great. – Matthew Cornell Sep 30 '14 at 21:52
  • I can understand that 'w' is kind of a powerful node (used for postgres and all as you have mentioned), but I think, it won't be helpful in a performance perspective if you don't intent to use it as a datanode and still run a tasktracker there. Check if your cluster performs better if you remove the tasktracker service from there (might be a tasktracker(w) to datanode(0/1/2/3) communication issue?Not sure though). Anyways, it just my thought as to why your '1wwww' hangs and '0321w' succeeds. – SSaikia_JtheRocker Oct 03 '14 at 18:51
  • I'll try removing "w"'s tasktracker - thanks for the idea. – Matthew Cornell Oct 06 '14 at 13:37

0 Answers0