Wrong runtime type in RDD when reading from avro with custom serializer

Question

I am trying to read data from avro files into an RDD using Kryo. My code compiles fine, but in runtime I'm getting a ClassCastException. Here is what my code does:

SparkConf conf = new SparkConf()...
conf.set("spark.serializer", KryoSerializer.class.getCanonicalName());
conf.set("spark.kryo.registrator", MyKryoRegistrator.class.getName());
JavaSparkContext sc = new JavaSparkContext(conf);

Where MyKryoRegistrator registers a Serializer for MyCustomClass:

public void registerClasses(Kryo kryo) {
    kryo.register(MyCustomClass.class, new MyCustomClassSerializer());
}

Then, I read my datafile:

JavaPairRDD<MyCustomClass, NullWritable> records =
                sc.newAPIHadoopFile("file:/path/to/datafile.avro",
                AvroKeyInputFormat.class, MyCustomClass.class, NullWritable.class,
                sc.hadoopConfiguration());
Tuple2<MyCustomClass, NullWritable> first = records.first();

This seems to work fine, but using a debugger I can see that while the RDD has a kClassTag of my.package.containing.MyCustomClass, the variable first contains a Tuple2<AvroKey, NullWritable>, not Tuple2<MyCustomClass, NullWritable>! And indeed, when the following line executes:

System.out.println("Got a result, custom field is: " + first._1.getSomeCustomField());

I get an exception:

java.lang.ClassCastException: org.apache.avro.mapred.AvroKey cannot be cast to my.package.containing.MyCustomClass

Am I doing something wrong? And even so, shouldn't I get a compilation error rather than a runtime error?

Did you see [this](http://stackoverflow.com/questions/34999783/read-avro-with-spark-in-java) question? — Yuval Itzchakov, Jan 24 '17 at 20:09
@YuvalItzchakov yes, but this is in scala. I tried my best to translate it to java but can't get it to compile :-/. Do you know how to do the same in java? — Nira, Jan 25 '17 at 12:40
@YuvalItzchakov I actually managed to run this in Java but I think this doesn't work with NullWritable. I'm getting a runtime exception: `org.apache.avro.AvroTypeException: Found Root, expecting org.apache.avro.mapreduce.KeyValuePair, missing required field key`. I gave it an empty schema cause NullWritable has no fields: `SchemaBuilder.record("NullWritable").namespace("org.apache.hadoop.io").endRecord()` — Nira, Jan 27 '17 at 12:31

Nira · Answer 1 · 2017-03-14T15:40:29.470

*************EDIT**************

I managed to load custom objects from avro files and created a GitHub repository with the code. However, if the avro lib fails to load the data into the custom class it returns GenericData$Record objects instead. And in this case the Spark Java API doesn't check the assignment to the custom-class, which is why you only get a ClassCastException when trying to access the datum of the AvroKey. This is a violation of the data-safety guarantee.

*************EDIT**************

For anybody else trying to do this, I have a hack to get around this problem, but this can't be the right solution: I created a class for reading GenericData.Record from avro files:

public class GenericRecordFileInputFormat extends FileInputFormat<GenericData.Record, NullWritable> {
    private static final Logger LOG = LoggerFactory.getLogger(GenericRecordFileInputFormat.class);

    /**
     * {@inheritDoc}
     */
    @Override
    public RecordReader<GenericData.Record, NullWritable> createRecordReader(
            InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        Schema readerSchema = AvroJob.getInputKeySchema(context.getConfiguration());
        if (null == readerSchema) {
            LOG.warn("Reader schema was not set. Use AvroJob.setInputKeySchema() if desired.");
            LOG.info("Using a reader schema equal to the writer schema.");
        }
        return new GenericDataRecordReader(readerSchema);
    }


    public static class GenericDataRecordReader extends RecordReader<GenericData.Record, NullWritable> {

        AvroKeyRecordReader<GenericData.Record> avroReader;

        public GenericDataRecordReader(Schema readerSchema) {
            super();
            avroReader = new AvroKeyRecordReader<>(readerSchema);
        }

        @Override
        public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
            avroReader.initialize(inputSplit, taskAttemptContext);
        }

        @Override
        public boolean nextKeyValue() throws IOException, InterruptedException {
            return avroReader.nextKeyValue();
        }

        @Override
        public GenericData.Record getCurrentKey() throws IOException, InterruptedException {
            AvroKey<GenericData.Record> currentKey = avroReader.getCurrentKey();
            return currentKey.datum();
        }

        @Override
        public NullWritable getCurrentValue() throws IOException, InterruptedException {
            return avroReader.getCurrentValue();
        }

        @Override
        public float getProgress() throws IOException, InterruptedException {
            return avroReader.getProgress();
        }

        @Override
        public void close() throws IOException {
            avroReader.close();
        }
    }
}

Then I load the records:

JavaRDD<GenericData.Record> records = sc.newAPIHadoopFile("file:/path/to/datafile.avro",
                GenericRecordFileInputFormat.class, GenericData.Record.class, NullWritable.class,
                sc.hadoopConfiguration()).keys();

Then I convert the records to my custom class using a constructor that accepts GenericData.Record.

Again - not pretty, but works.

Wrong runtime type in RDD when reading from avro with custom serializer

1 Answers1