As Chiron explained you need to specify the Hadoop Input Format.
This can be done with by setting the HADOOP_INPUT_FORMAT
option in MRJob
from mrjob.job import MRJob
from mrjob.protocol import JSONProtocol
class MRAvro(MRJob):
# Converts each AVRO record into one JSON record per line
HADOOP_INPUT_FORMAT = 'org.apache.avro.mapred.AvroAsTextInputFormat'
# Reads each JSON line into
INPUT_PROTOCOL = JSONProtocol
def mapper(self, avro_record, _):
# TODO
def reducer(self, key, values):
# TODO
In your configuration you need to make sure that the .jar file for AvroAsTextInputFormat
is available on the cluster; as of v0.5.3 you can use --libjar
at the command line, or configure libjars in the mrjob configuration file (at the time of v0.5.3 is not released; see the discussion on --libjar
in the feature request).
I am not aware of an easy way to integrate local testing with AVRO (HADOOP_INPUT_FORMAT
is ignored by local runners). One solution is to convert your test data with the tojson method of Apache avro-tools.
java -jar avro-tools-1.8.1.jar test_data.avro > test_data.json
Otherwise you could write your own function in python using the avro or fastavro libraries to prepare the data for local execution.