Can I use mrjob python library on partitioned hive tables?

Question

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally on text files stored on CDH5 and I am impressed by the ease of development.

After some research I discovered there is a library called HCatalog, but as far as I know it's not available for python (only Java). Unfortunately, I do not have much time to learn Java and I would like to stick to Python.

Do you know any way to run mrjob on hive stored data?

If this is impossible, is there a way to stream python-written mapreduce code to hive? (I would rather not upload mapreduce python files to hive)

Mr.Job doesn't currently work with Avro files. If you want to use Mr.Job, you could de-avro the data first. Michael Noll has a good blogpost for Avro Tools: http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/ — Alex Woolford, Oct 14 '14 at 05:39

score 0 · Answer 1 · answered Oct 15 '14 at 08:13

As Alex stated currently Mr.Job does not work with avro formated files. However, there is a way to perform python code on hive tables directly (no Mr.Job needed, unfortunatelly with loss of flexibility). Eventually, I managed to add python file as a resource to hive by executing "ADD FILE mapper.py" and performing SELECT clause with TRANSFORM ... USING ...., storing the results of a mapper in a separate table. Example Hive query:

INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Full example is available here (at the bottom): link

Can I use mrjob python library on partitioned hive tables?

1 Answers1