0

I'm using an ec2 hadoop cluster that is comprised of 20 c3.8xlarge machines, each having 60 GB RAM and 32 virtual CPUs. In every machine I set up yarn and mapreduce settings as documented here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html, i.e. as showed below:

c3.8xlarge
Configuration Option    Default Value
mapreduce.map.java.opts -Xmx1331m
mapreduce.reduce.java.opts  -Xmx2662m
mapreduce.map.memory.mb 1664
mapreduce.reduce.memory.mb  3328
yarn.app.mapreduce.am.resource.mb   3328
yarn.scheduler.minimum-allocation-mb    32
yarn.scheduler.maximum-allocation-mb    53248
yarn.nodemanager.resource.memory-mb 53248

Now what criteria I have to use in order to determine the most appropriate number of workers to use with giraph? I.e. what number I have to use for -w argument? Is that criteria related to above settings?

Francesco Sclano
  • 145
  • 1
  • 12

1 Answers1

0

There's is no optimal number, but the most parallel workers you can have roughly can be calculated like so.

Every NodeManager has 53248 MB, multiply that by your slave node count

Subtract only one am.resource.mb amount from that, since all the jobs need a application master.

Then divide that by the larger of one of your mapper or reducer memory for the total number of MapReduce tasks that can run at once

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245