0

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the files into a tarball called helper_classes.tar and uploaded the tarball to an Amazon S3 bucket. When creating my MapReduce job on the console, I specified the argument as:

cacheArchive s3://folder1/folder2/helper_classes.tar#helper_classes

At the beginning of my Python mapper script, I included the following code to import the library:

import sys
sys.path.append('./helper_classes')
import geoip.database

When I run the MapReduce job, it fails with an ImportError: No module named geoip.database. (geoip is a folder in the top level of helper_classes.tar and database is the module I'm trying to import.)
Any ideas what I could be doing wrong?

Jerry YY Rain
  • 4,134
  • 7
  • 35
  • 52
user296554
  • 11
  • 2

1 Answers1

0

This might be late for the topic.

Reason is that the module geoip.database is not installed on all the Hadoop nodes. You can either try not use uncommon imports in your map/reduce code, or try to install the needed modules on all Hadoop nodes.

John Knight
  • 3,083
  • 2
  • 14
  • 9