1

I have two Python scripts that are intended to run on Amazon Elastic MapReduce - one as a mapper and one as a reducer. I've just recently expanded the mapper script to require a couple more local models that I've created that both live in a package called SentimentAnalysis. What's the right way to have a Python script import from a local Python package on S3? I tried creating S3 keys that mimic my file system in hopes that the relative paths will work, but alas it didn't. Here's what I see in the log files on S3 after the step failed:

Traceback (most recent call last):
File "/mnt/var/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201407250000_0001/attempt_201407250000_0001_m_000000_0/work/./sa_mapper.py", line 15, in <module>
from SentimentAnalysis import NB, LR
ImportError: No module named SentimentAnalysis

The relevant file structure is like this:

sa_mapper.py
sa_reducer.py
SentimentAnalysis/NB.py
SentimentAnalysis/LR.py

And the mapper.py has:

from SentimentAnalysis import NB, LR

I tried to mirror the file structure in S3, but that doesn't seem to work.

What's the best way to setup S3 or EMR so that sa_mapper.py can import NB.py and LR.py? Is there some special trick to doing this?

Sean Azlin
  • 886
  • 7
  • 21

2 Answers2

0

do you have

__init__.py

in SentimentAnalysis folder?

canufeel
  • 893
  • 1
  • 11
  • 22
0

What the command that you are running?
The only way to do it, is when you want to run step, you have additional fields that you can had to the step, for example: if you are using the boto package to run task on emr, you have the class: StreamingStep

in it you have the parameters: (if you use version 2.43)
cache_files (list(str)) – A list of cache files to be bundled with the job
cache_archives (list(str)) – A list of jar archives to be bundled with the job

meaning that you need to pass the files path of folder that you want to be taken from s3 into your cluster. the syntax is:
s3://{s3 bucket path}/EMR_Config.py#EMR_Config.py
Where the hashtag is the separator that you use, the part before the (#) is the location in your s3 and the part after is the name you want it to have and the location, currently it will be located in the same place as your task that you are running.

Ones you have them in your cluster you can't simply do import, What worked is:

# we added a file named EMR_Config.py, 
sys.path.append(".")

#loading the module this way because of the EMR file system
module_name = 'EMR_Config'
__import__(module_name)
Config = sys.modules[module_name]

#now you can access the methods in the file, for example:
topic_name = Config.clean_key(row.get("Topic"))
ohad edelstain
  • 1,425
  • 2
  • 14
  • 22