0

I've prepared a streaming boto jobflow on AWS/EMR that runs perfectly well using the familiar test pipe:

 sed -n '0~10000p'  Big.csv | ./map.py | sort -t$'\t' -k1 | ./reduce.py

The boto emr job run also works well as I increase the size of the input data, until some threshold where jobs fail with a python broken pipe error:

 Traceback (most recent call last):
   File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201504151813_0001/attempt_201504151813_0001_r_000002_0/work/./reduce.py", line 18, in <module>
json.dump( { "cid":cur_key , "promo_hx":kc } , sys.stdout ) 
   File "/usr/lib/python2.6/json/__init__.py", line 181, in dump
fp.write(chunk)
 IOError: [Errno 32] Broken pipe

and the following java error:

  org.apache.hadoop.streaming.PipeMapRed (Thread-38): java.lang.OutOfMemoryError: Java heap space

I'm assuming the memory error occurs first, leading to the broken pipeline.

Mapping tasks all complete for any input data size; the error occurs at the reducer stage. My reducer is the usual streaming reducer (I am using AMI 3.2.3 with the jason package built into Python 2.6.9):

 for line in sys.stdin:
      line                = line.strip()
      key  , value        = line.split('\t')
      ...
      print json.dumps( { "cid":cur_key , "promo_hx":kc } , sort_keys=True , separators=(',',': ') )

Any idea what is going on? Thanks.

user2105469
  • 1,413
  • 3
  • 20
  • 37

1 Answers1

1

It appears you need to increase the reducer memory size. This can be done by instance type (see http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html for defaults by instance type) or by adjusting the mapreduce.reduce.* properties either at the job level or cluster level (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop).

ChristopherB
  • 2,038
  • 14
  • 18
  • This sounds like the right solution path. I'm looking into how to change `mapreduce.reduce.memory.mb` from within `boto`. Otherwise the solution is to select the next larger EC2 instance type. Right now I'm using m1.medium, and this should really get the job done. I'll follow-up. – user2105469 Apr 16 '15 at 15:00
  • Accessing hadoop parameters within boto: http://stackoverflow.com/questions/12071436/setting-hadoop-parameters-with-boto – user2105469 Apr 16 '15 at 15:12
  • The link below also helped: hadoop parameters are passed via bootstrap, and it's important to use the right `BootstrapAction` ( a function from `boto.emr.boostrap_action`): https://groups.google.com/forum/#!topic/boto-users/cEUJvertqWY – user2105469 Apr 16 '15 at 15:41
  • List of hadoop tasks parameters (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) and AWS's own task config parms (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html) – user2105469 Apr 16 '15 at 15:43