0

I am running a sample hadoop job over ~500 documents on S3, and when ran locally it takes <15min to complete. However, when I tried running the same job on EMR, it takes over 2 hours and still didn't complete the reduction step, so I terminated it. Would there be a particular reason why a MapReduce job takes so long on EMR?

Also, along the same lines, what would be the best way to profile EMR to see where the bottleneck is? I can't seem to get the log files from the reducers until they complete, but they are taking way too long to complete..

Jin
  • 6,055
  • 2
  • 39
  • 72
  • 1
    You can name a bucket to put the EMR logs to. It will allow you to check them after you kill the cluster as well. – Guy May 03 '13 at 10:58

1 Answers1

1

From my experience with AWS EMR, I've found that the memory settings (how much you allocate to map or reduce tasks), the overall RAM you allocate for the task, and heap size configuration play a large role in performance. The link below contains some information, and a Google search should reveal the rest.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration.html

Saul
  • 911
  • 1
  • 8
  • 19