5

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code", does Jython come into the picture? Is it more efficient if it does come into the picture?

Thanks

P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.

dvk
  • 111
  • 2
  • 5

3 Answers3

5

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.

A word counting example looks like this:

@map(produces=['word'])
def split_words(tuple):
    # This is called for each line of text
    for word in tuple.get(1).split():
        yield [word]

def main():
    flow = Flow()
    input = flow.source(Hfs(TextLine(), 'input.txt'))
    output = flow.tsv_sink('output')

    # This is the processing pipeline
    input | split_words | GroupBy('word') | Count() | output

    flow.run()
Gabor Szabo
  • 51
  • 1
  • 1
3

When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep or a C program.

You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • 1
    Are there any pros and cons of either approach? (Apart from the obvious differences between Jython and cPython that I would have to live with..) – dvk Jul 08 '11 at 17:15
  • 1
    I don't think you'll notice any significant slowdown using either. – Donald Miner Jul 08 '11 at 18:16
  • 1
    Apart from speed, would there be any design limitations between the two approaches? – dvk Jul 08 '11 at 23:35
0

The Programming Pig book discusses using UDFs. The book is indispensable in general. On a recent project, we used Python UDFs and occasionally had issues with Floats vs. Doubles mismatches, so be warned. My impression is that the support for Python UDFs may not be as solid as the support for Java UDFs, but overall, it works pretty well.

Dean Wampler
  • 2,141
  • 13
  • 10
  • I briefly read the section in the book on UDFs: I am not really clear about something: Why would one use UDFs if one could do Python/ and embed whatever libraries needed as part of the code? Apologies if this question has a really obvious question but I have not really looked at streaming yet. – dvk Jul 08 '11 at 22:19