1

Write a Python UDF for Hadoop/Pig, and need to use some Python libraries like "request" which I installed locally by pip when doing local box UDF testing. Wondering how to deploy the pip package on Hadoop cluster so that no matter my Python UDF runs on which node, it automatically consumes?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Lin Ma
  • 9,739
  • 32
  • 105
  • 175
  • 1
    This question is close to http://stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job which provides methods for distributing packages with jobs, however common sense dictates starting with a uniform installation of Python on all nodes including commonly required packages. –  Aug 27 '15 at 19:28
  • @TrisNefzger, thanks for the sharing and would like to try zipimport option. Want to clarify my usage is correct, for example, if I want to import package requests, https://pypi.python.org/pypi/requests#downloads, shall I download requests-2.7.0.tar.gz source package and zipimport the zip file "requests-2.7.0.tar.gz"? Thanks. – Lin Ma Aug 27 '15 at 22:34
  • 1
    From looking at https://docs.python.org/2/library/zipimport.html and my sys.path which includes 'C:\\Anaconda3\\envs\\python2\\python27.zip' I think you need a zip file not a tar.gz file. Suggest experimenting with zipimport to determine how it works first. –  Aug 27 '15 at 22:57
  • @TrisNefzger, thanks for the information. Wondering what is the zip file? For example, in my case of requests package, the official web site of requests does not offer download option for a zip. :) – Lin Ma Aug 27 '15 at 22:59

1 Answers1

1

Information about zip file format can be found at Zip (file format). Practically speaking it is a compressed archive format sort of like tar (an archive format) plus gzip (a file compression format). Java jar (Java ARchive) format is compatible with zip.

On Linux and Unix platforms a directory dir can be zipped with 'zip -r dir dir' to create a dir.zip file. On Windows 7-Zip is most useful for creating and unbundling zip files plus it can be used to unbundle and browse files having other compression and archive formats including tar and gzip.

Given a file dir.tar.gz it can be unbundled and zipped interactively using the 7-Zip GUI on Windows while on Linux and Unix systems the following commands can do the same thing:

tar zxf dir.tar.gz # creates directory dir by extraction and decompression
zip -r dir dir # creates dir.zip by bundling without removing dir
  • thanks for the details. I mean, what is the tar.gz to be extracted in my case for requests pip package? Is it the source code part? which is requests-2.7.0.tar.gz (https://pypi.python.org/pypi/requests#downloads)? – Lin Ma Aug 27 '15 at 23:34
  • 1
    If you are downloading them they probably need to be built and are not usable packages as is. The usable packages should be in your Python installed directory somewhere typically in Lib/site-packages for packages installed with tools like pip after the main Python installation. For example, my importable requests package is at C:\Anaconda3\Lib\site-packages\requests. –  Aug 27 '15 at 23:40
  • Thanks Tris, what do you mean build? requests should be implemented by python and no compile/build needed? Thanks. – Lin Ma Aug 29 '15 at 05:44