3

I've created a python UDF to convert datetimes into different timezones. The script uses pytz which doesn't ship with python (or jython). I've tried a couple things:

  1. Bootstrapping PIG to install it's own jython and including pytz in that jython installation. I can't get PIG to use the newly installed jython, it keeps reverting to Amazon's jython.
  2. Setting PYTHONPATH to a local directory where the new modules have been installed
  3. Setting HADOOP_CLASSPATH/PIG_CLASSPATH to the new installation of jython

Each of these ends up with "ImportError: No module named pytz" when I try to load the UDF script. The script loads fine if I remove pytz so it's definitely the external module that's giving it problems.

Edit: Originally put this as a comment but I thought I'd just make it an edit:

I've tried every way I know of to get PIG to recognize another jython jar. That hasn't worked. Amazon's jython is here: /home/hadoop/.versions/pig-0.9.2/lib/pig/jython.jar, with is recognizing this sys.path: /home/hadoop/lib/Lib. I can't figure out how to build external libraries against this jar.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Bob Briski
  • 590
  • 4
  • 13
  • http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job/6811775#6811775 may help you (they are trying to load a different module, but the method should be the same) – Chris White Jun 05 '12 at 10:43
  • Or http://stackoverflow.com/questions/8129543/hadoop-streaming-importing-modules-on-emr – Chris White Jun 05 '12 at 10:44
  • Yes, I've tried to bootstrap the package to each slave. It worked but the problem is that I can't get PIG to use the jython jar that I've installed. Instead it always picks Amazon's jython jar which doesn't have any external libraries installed. – Bob Briski Jun 06 '12 at 19:07
  • I guess the runtime resolved classpath has their jython jar ahead of yours - are you able to amend the hadoop-env.sh file? (i haven't worked with EMR, sorry) – Chris White Jun 06 '12 at 19:09
  • I haven't tried that yet but I have directly assigned the HADOOP_CLASSPATH and PIG_CLASSPATH on the line calling the pig executable like so: http://stackoverflow.com/questions/9300509/ – Bob Briski Jun 06 '12 at 20:06
  • what does `hadoop classpath` show as the order? Also i had it wrong, you probably need to amend the hadoop script in the bin folder rather than hadoop-env.sh in the conf folder – Chris White Jun 06 '12 at 20:13
  • To avoid further clutter of the comment chain: https://raw.github.com/gist/2884719/ba2aceb88e9049c6f49454fba991589e4c49654b/gistfile1.txt – Bob Briski Jun 06 '12 at 20:58
  • Well jython isn't in that list, is it somewhere in with the JRE: `/usr/lib/jvm/java-6-sun/`? – Chris White Jun 06 '12 at 23:39
  • is there a version of the jython jar in PIG_HOME/lib? – Chris White Jun 07 '12 at 00:00

1 Answers1

0

could you manually hack sys.path inside of your jython script?

mhawthorne
  • 43
  • 5
  • I can try but I think the jython jar is loaded by Hadoop. Since I'd be switching the path to a library that wasn't built against the loaded jython jar, I'm not sure if that would work. I'll try it out though. – Bob Briski Jun 06 '12 at 21:32
  • I appended the new jython jar path to the sys.path list but it still complains that it can't find the module. – Bob Briski Jun 06 '12 at 21:48
  • are you able to print sys.path from your UDF script? I wonder if you hack PYTHONPATH to include a directory containing pytz, if that directory actually makes it into sys.path, or if hadoop is overwriting it. if it's being overwritten, then maybe you can manually add a directory to sys.path from your script, instead of adding your special jython jar. – mhawthorne Jun 07 '12 at 00:50
  • This is promising. I found this path (/home/hadoop/.versions/0.20.205/libexec/../share/hadoop/lib/jython.jar/Lib) when printing sys.path from the loaded UDF. I'm going to try and bootstrap against this jython. – Bob Briski Jun 07 '12 at 14:44