3

I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally on text files stored on CDH5 and I am impressed by the ease of development.

After some research I discovered there is a library called HCatalog, but as far as I know it's not available for python (only Java). Unfortunately, I do not have much time to learn Java and I would like to stick to Python.

Do you know any way to run mrjob on hive stored data?

If this is impossible, is there a way to stream python-written mapreduce code to hive? (I would rather not upload mapreduce python files to hive)

Tomasz Sosiński
  • 849
  • 1
  • 10
  • 12
  • Mr.Job doesn't currently work with Avro files. If you want to use Mr.Job, you could de-avro the data first. Michael Noll has a good blogpost for Avro Tools: http://www.michael-noll.com/blog/2013/03/17/reading-and-writing-avro-files-from-the-command-line/ – Alex Woolford Oct 14 '14 at 05:39

1 Answers1

0

As Alex stated currently Mr.Job does not work with avro formated files. However, there is a way to perform python code on hive tables directly (no Mr.Job needed, unfortunatelly with loss of flexibility). Eventually, I managed to add python file as a resource to hive by executing "ADD FILE mapper.py" and performing SELECT clause with TRANSFORM ... USING ...., storing the results of a mapper in a separate table. Example Hive query:

INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data;

Full example is available here (at the bottom): link

Tomasz Sosiński
  • 849
  • 1
  • 10
  • 12