2

I’m using Hadoop to convert JSONs into CSV files to access them with Hive.

At the moment the Mapper is filling an own data structure parsing the JSONs with JSON-Smart. Then the reducer is reading out that object and is writing it to a file, separated by commas. For making this faster I already implemented the writable interface in the data structure...

Now I want to use Avro for the data structure object to have more flexibility and performance. How could I change my classes to make them exchange an Avro object instead of a writable?

Tim Bittersohl
  • 149
  • 2
  • 11

1 Answers1

5

Hadoop offers a pluggable serialization mechanism via the SerializationFactory.

By default, Hadoop uses the WritableSerialization class to handle the deserialization of classes which implement the Writable interface, but you can register custom serializers that implement the Serialization interface by setting the Hadoop configuration property io.serializations (a CSV list of classes that implement the Serialization interface).

Avro has an implementation of the Serialization interface in the AvroSerialization class - so this would be the class you configure in the io.serializations property.

Avro actually has a whole bunch of helper classes which help you write Map / Reduce jobs to use Avro as input / output - there's some examples in the source (Git copy)

I can't seem to find any good documentation for Avro & Map Reduce at the moment, but i'm sure there are some other good examples out there.

Chris White
  • 29,949
  • 4
  • 71
  • 93