0

Situation

I'm currently writing a consumer/producer using AVRO and a schema repository.

From what I gather My options for serializing this data is either use the Confluent's avro serializer, or go with Twitter's Bijection.

It seemed Bijection looked the most straightforward.

So I want to produce date in the following format ProducerRecord[String,Array[Byte]], this comes down to [some string ID, serialized GenericRecord]

(note: I'm going for Generic records as this codebase has to handle thousands of schema's that get parsed from Json/csv/...)

Question:

The whole reason I serialize and use AVRO, is that you don't need to have a schema in the data itself (like you would with Json/XML/...).
When checking the data in the topic however, I see the whole scheme is contained together with the data. Am I doing something fundamentally wrong, is this by design, or should I use the confluent serializer instead?

Code:

  def jsonStringToAvro(jString: String, schema: Schema): GenericRecord = {
    val converter = new JsonAvroConverter
    val genericRecord = converter.convertToGenericDataRecord(jString.replaceAll("\\\\/","_").getBytes(), schema)

    genericRecord
  }
def serializeAsByteArray(avroRecord: GenericRecord): Array[Byte] = {
    //val genericRecordInjection = GenericAvroCodecs.toBinary(avroRecord.getSchema)
    val r: Array[Byte] = GenericAvroCodecs.toBinary(avroRecord.getSchema).apply(avroRecord)

    r
  }

//schema comes from a rest call to the schema repository
new ProducerRecord[String, Array[Byte]](topic, myStringKeyGoesHere, serializeAsByteArray(jsonStringToAvro(jsonObjectAsStringGoesHere, schema)))


        producer.send(producerRecord, new Callback {...})
Phantômaxx
  • 37,901
  • 21
  • 84
  • 115
Havnar
  • 2,558
  • 7
  • 33
  • 62
  • The bijection library doesn't interact with a schema registry, and you're not putting an ID anywhere yourself like the Confluent serializers do. Therefore, the entire schema will be part of the message – OneCricketeer Aug 01 '18 at 13:37
  • Also, what's wrong with `ProducerRecord[String, GenericRecord]`? And put the REST call in the serializer? – OneCricketeer Aug 01 '18 at 13:39
  • Some of the projects could have large volumes, so I figured serialization would net me some performance gains. So, when I would use the Confluent serializer, this would strip the schema from the generic records? – Havnar Aug 01 '18 at 13:43
  • The Kafka Serializer interface is meant to get the byte array for you, though. There's no benefit of writing that logic "in the main method" of your class – OneCricketeer Aug 01 '18 at 13:51

1 Answers1

2

If you look at the Confluent source code , you'll see that order of operations for interacting with a schema repository are

  1. Take the schema from the Avro record, and compute its ID. Ideally POST-ing the Schema to the repository, or otherwise hashing it should give you an ID.
  2. Allocate a ByteBuffer
  3. Write the returned ID to the buffer
  4. Write the Avro object value (excluding the schema) as bytes into the buffer
  5. Send that byte buffer to Kafka

Presently, your Bijection usage will include the schema in the bytes, not replace it with an ID

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Thanks, I guess I made the wrong assumption where the GenericRecord would just be binary avro data that needed an external schema to be read. – Havnar Aug 01 '18 at 13:54
  • A GenericRecord is like a Hashmap. There's still named fields, so it needs some way to parse those and know if they're present – OneCricketeer Aug 01 '18 at 13:56