How to use avro files as input to a MRJob job?

Question

I need to take avro files as input to a mrjob hadoop job. I can't find any documentation on how to do that unless I pass extra commands to the hadoop streaming jar. This will complicate development though because I've been using the inline runner to test locally.

Is it possible to use the inline runner to read avro files with MRJob?

Chiron · Answer 1 · 2014-03-13T10:41:25.977

0

What you need is to tell Hadoop what is the format of your "Input Format" of your Hadoop job:

hadoop jar hadoop-streaming.jar 
  ;; other params go here
  -inputformat org.apache.avro.mapred.AvroAsTextInputFormat

But I'm not sure how you run MRJobs. My previous solution is working if you are using Plain Hadoop.

edited Mar 13 '14 at 10:41

answered Mar 13 '14 at 10:32

Chiron

20,081
17
81
133

Thanks I found that yesterday. I guess I'll just have to install a local hadoop instance and develop against that instead unless there are any other answers... – jbrown Mar 13 '14 at 10:47

score 0 · Answer 2 · answered Jul 10 '16 at 02:01

As Chiron explained you need to specify the Hadoop Input Format. This can be done with by setting the HADOOP_INPUT_FORMAT option in MRJob

from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol
class MRAvro(MRJob):
    # Converts each AVRO record into one JSON record per line 
    HADOOP_INPUT_FORMAT = 'org.apache.avro.mapred.AvroAsTextInputFormat'
    # Reads each JSON line into 
    INPUT_PROTOCOL = JSONProtocol

    def mapper(self, avro_record, _):
        # TODO

    def reducer(self, key, values):
        # TODO

In your configuration you need to make sure that the .jar file for AvroAsTextInputFormat is available on the cluster; as of v0.5.3 you can use --libjar at the command line, or configure libjars in the mrjob configuration file (at the time of v0.5.3 is not released; see the discussion on --libjar in the feature request).

I am not aware of an easy way to integrate local testing with AVRO (HADOOP_INPUT_FORMAT is ignored by local runners). One solution is to convert your test data with the tojson method of Apache avro-tools.

java -jar avro-tools-1.8.1.jar test_data.avro > test_data.json

Otherwise you could write your own function in python using the avro or fastavro libraries to prepare the data for local execution.

How to use avro files as input to a MRJob job?

2 Answers2