1

Does anyone know of a way to embed pig in a cpython script, similar to what is available for RDBMS? I searched, but no luck.

I'd rather not use Jython because I'm trying to work with the data using various cpython libraries not available in jython.

dave
  • 12,406
  • 10
  • 42
  • 59
  • It appears that you can run python scripts with pig commands in them with pig http://techblug.wordpress.com/2011/07/29/pagerank-implementation-in-pig/ – Nick ODell Aug 22 '12 at 16:53
  • @NickODell, thanks for the pointer. Unfortunately that's using Jython, which doesn't allow me to use the cpython libraries (scipy, etc.). – dave Aug 22 '12 at 18:30

3 Answers3

1

Jython seems to be the most popular option, like here, here and here, but you might find this thread helpful, although it is also focused on Jython. It definitely seems like the focus on UDFs through Python is decidedly on Jython so unless you absolutely require CPython libraries, you may consider biting the bullet and going with that instead. Another thing to consider is that Jython is reaching maturity for version 2.7 (source) although this may not be practical for your needs.

Community
  • 1
  • 1
jaypb
  • 1,544
  • 10
  • 23
1

Support for CPython was recently added in Pig 0.12: http://blog.mortardata.com/post/62334142398/hadoop-python-pig-trunk

Ian Stevens
  • 794
  • 4
  • 19
1

If by "similar to what is available for RDBMS" you mean an API, you could build out an object model using subprocess. I have used something like the following in the past.

import subprocess
from subprocess import Popen, PIPE

def execute(command):
    print command + "\n"
    p = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)
    stdout, stderr = p.communicate()
    print stdout
    return p.returncode

command = "pig.9 -p input=" + input + "/* -p output=" + output + " -f my.pig"
execute(command)
wdckwrth
  • 36
  • 1