3

I'm consuming Avro serialized messages from Kafka using the "automatic" deserializer like:

props.put(
    ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
    "io.confluent.kafka.serializers.KafkaAvroDeserializer"
);
props.put("schema.registry.url", "https://example.com");

This works brilliantly, and is right out of the docs at https://docs.confluent.io/current/schema-registry/serializer-formatter.html#serializer.

The problem I'm facing is that I actually just want to forward these messages, but to do the routing I need some metadata from inside. Some technical constraints mean that I can't feasibly compile-in generated class files to use the KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG => true, so I am using a regular decoder without being tied into Kafka, specifically just reading the bytes as a Array[Byte] and passing them to a manually constructed deserializer:

var maxSchemasToCache = 1000;
var schemaRegistryURL = "https://example.com/"
var specificDeserializerProps = Map(
  "schema.registry.url" 
      -> schemaRegistryURL,
  KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG 
      -> "false"
);
var client = new CachedSchemaRegistryClient(
                     schemaRegistryURL, 
                     maxSchemasToCache
                 );
var deserializer = new KafkaAvroDeserializer(
                         client,
                         specificDeserializerProps.asJava
                   );

The messages are a "container" type, with the really interesting part one of about ~25 types in a union { A, B, C } msg record field:

record Event {
    timestamp_ms created_at;
    union {
        Online,
        Offline,
        Available,
        Unavailable,
        ...
        ...Failed,
        ...Updated
    } msg;
}

So I'm successfully reading a Array[Byte] into record and feeding it into the deserializer like this:

var genericRecord = deserializer.deserialize(topic, consumerRecord.value())
                       .asInstanceOf[GenericRecord];
var schema = genericRecord.getSchema();
var msgSchema = schema.getField("msg").schema();

The problem however is that I can find no to discern, discriminate or "resolve" the "type" of the msg field through the union:

System.out.printf(
    "msg.schema = %s msg.schema.getType = %s\n", 
    msgSchema.getFullName(),  
    msgSchema.getType().name());
=> msg.schema = union msg.schema.getType = union

How to discriminate types in this scenario? The confluent registry knows, these things have names, they have "types", even if I'm treating them as GenericRecords,

My goal here is to know that record.msg is of "type" Online | Offline | Available rather than just knowing it's a union.

Lee Hambley
  • 6,270
  • 5
  • 49
  • 81
  • 1
    So just to be clear, you are interest in which particular type of message you are receiving, yes? You'd want `type = Online` instead of `type = union`? – fresskoma Jan 21 '20 at 10:05
  • Not clear what you mean by "Automatic deserializer"... You could define your own that accepts the `Array[Byte]` and deserialize to GenericRecord just as you show. Or you could just use the KafkaAvroDeserializer to get GenericRecord, by default, since it already knows how to handle that – OneCricketeer Jan 21 '20 at 10:10
  • Yes, exactly, will update the question to clarify. – Lee Hambley Jan 21 '20 at 10:10
  • Even if you did compile to Java, I'm curious how you would get the type there. Seems to me, you want an ENUM, anyway, not a union – OneCricketeer Jan 21 '20 at 10:14
  • My assumption is (untested) that I'd get a code-generated class for `Event` that had a `getMsg()` accessor, that would give me back a typed class (maybe after some coercion) - but indeed, I'm not sure. FWIW I don't think that Avro supports `enums` for complex types, just for primitives. – Lee Hambley Jan 21 '20 at 10:19
  • I'm curious to know what limitations are stopping you from generating that? And what are your complex types here? Looks like you've just defined "states of being" – OneCricketeer Jan 21 '20 at 13:50
  • No suitable plugin to pull the types in at build time for sbt (one exists for mvn, but we're not using that). This proxy doesn't need to know anything about the messages, except their name so it can route them appropriately. There's an opportunity to learn something here, I hope. – Lee Hambley Jan 21 '20 at 14:17
  • In `msgSchema` you have schema with all possible union types, not this one which is inside the message, unfortunately. You can get that list by calling `msgSchema.getTypes(): List[Schema.Type]`. I don't remember how union of records is deserialized into `GenericRecord` unfortunately but I know you can generate java class from avro schema and then deserialize into `SpecificRecord[T]`. Then you can simply do `instanceof` checks or sth like that. Consider this is an anti-pattern, maybe 25 nullable fields of specific types would be a better option? – wikp Jan 21 '20 at 14:24

3 Answers3

1

After having looked into the implementation of the AVRO Java library, it think it's safe to say that this is impossible given the current API. I've found the following way of extracting the types while parsing, using a custom GenericDatumReader subclass, but it needs a lot of polishing before I'd use something like this in production code :D

So here's the subclass:

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.io.ResolvingDecoder;

import java.io.IOException;
import java.util.List;

public class CustomReader<D> extends GenericDatumReader<D> {
    private final GenericData data;
    private Schema actual;
    private Schema expected;

    private ResolvingDecoder creatorResolver = null;
    private final Thread creator;
    private List<Schema> unionTypes;

    // vvv This is the constructor I've modified, added a list of types
    public CustomReader(Schema schema, List<Schema> unionTypes) {
        this(schema, schema, GenericData.get());
        this.unionTypes = unionTypes;
    }

    public CustomReader(Schema writer, Schema reader, GenericData data) {
        this(data);
        this.actual = writer;
        this.expected = reader;
    }

    protected CustomReader(GenericData data) {
        this.data = data;
        this.creator = Thread.currentThread();
    }

    protected Object readWithoutConversion(Object old, Schema expected, ResolvingDecoder in) throws IOException {
        switch (expected.getType()) {
            case RECORD:
                return super.readRecord(old, expected, in);
            case ENUM:
                return super.readEnum(expected, in);
            case ARRAY:
                return super.readArray(old, expected, in);
            case MAP:
                return super.readMap(old, expected, in);
            case UNION:
                // vvv The magic happens here
                Schema type = expected.getTypes().get(in.readIndex());
                unionTypes.add(type);
                return super.read(old, type, in);
            case FIXED:
                return super.readFixed(old, expected, in);
            case STRING:
                return super.readString(old, expected, in);
            case BYTES:
                return super.readBytes(old, expected, in);
            case INT:
                return super.readInt(old, expected, in);
            case LONG:
                return in.readLong();
            case FLOAT:
                return in.readFloat();
            case DOUBLE:
                return in.readDouble();
            case BOOLEAN:
                return in.readBoolean();
            case NULL:
                in.readNull();
                return null;
            default:
                return super.readWithoutConversion(old, expected, in);
        }
    }
}

I've added comments to the code for the interesting parts, as it's mostly boilerplate.

Then you can use this custom reader like this:

        List<Schema> unionTypes = new ArrayList<>();
        DatumReader<GenericRecord> datumReader = new CustomReader<GenericRecord>(schema, unionTypes);
        DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(eventFile, datumReader);
        GenericRecord event = null;

        while (dataFileReader.hasNext()) {
            event = dataFileReader.next(event);
        }

        System.out.println(unionTypes);

This will print, for each union parsed, the type of that union. Note that you'll have to figure out which element of that list is interesting to you depending on how many unions you have in a record, etc.

Not pretty tbh :D

fresskoma
  • 25,481
  • 10
  • 85
  • 128
1

I was able to come up with a single-use solution after a lot of digging:

val records: ConsumerRecords[String, Array[Byte]] = consumer.poll(100);
for (consumerRecord <- asScalaIterator(records.iterator)) {
  var genericRecord = deserializer.deserialize(topic, consumerRecord.value()).asInstanceOf[GenericRecord];
  var msgSchema = genericRecord.get("msg").asInstanceOf[GenericRecord].getSchema();
  System.out.printf("%s \n", msgSchema.getFullName());

Prints com.myorg.SomeSchemaFromTheEnum and works perfectly in my use-case.

The confusing thing, is that because of the use of GenericRecord, .get("msg") returns Object, which, in a general way I have no way to safely typecast. In this limited case, I know the cast is safe.

In my limited use-case the solution in the 5 lines above is suitable, but for a more general solution the answer https://stackoverflow.com/a/59844401/119669 posted by https://stackoverflow.com/users/124257/fresskoma seems more appropriate.

Whether using DatumReader or GenericRecord is probably a matter of preference and whether the Kafka ecosystem is in mind, alone with Avro I'd probably prefer a DatumReader solution, but in this instance I can live with having Kafak-esque nomenclature in my code.

Lee Hambley
  • 6,270
  • 5
  • 49
  • 81
1

To retrieve the schema of the value of a field, you can use

new GenericData().induce(genericRecord.get("msg"))
Thomas Pocreau
  • 470
  • 5
  • 12