1

I'm trying to do some data analysis on Amazon Elastic MapReduce. The mapper step is a python script which includes a call to a compiled C++ binary called "./formatData". For example:

# myMapper.py
from subprocess import *
inputData = sys.stdin.readline()
# ...
p1 = Popen('./formatData', stdin=PIPE, stdout=PIPE)
p1Output = p1.communicate(input=inputData)
result = ... # manipulate the formatted data
print "%s\t%s" % (result,1)

Can I call a binary executable like this on Amazon EMR? If so, where would I store the binary (in S3?), for what platform should I compile it, and how I ensure my mapper script has access to it (ideally it would be in the current working directory).

Thanks!

tba
  • 6,229
  • 8
  • 43
  • 63

1 Answers1

3

You can call the binary that way, if you make sure the binary gets copied to the worker nodes correctly.

See:

https://forums.aws.amazon.com/thread.jspa?threadID=35158

For an explanation on how to use the distributed cache to make the binary files accessible on the worker nodes.

gwt
  • 126
  • 5
  • Thanks. Also, what platform should I compile for? – tba Feb 07 '12 at 01:23
  • Linux (elastic map-reduce runs on Amazon Linux). – gwt Feb 07 '12 at 01:52
  • 1
    Building on any old linux box won't work, so I found this guide to building on EMR: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/usingemr_buildingmodules.html – tba Feb 07 '12 at 05:10