How to infer avro schema from a kafka topic in Apache Beam KafkaIO

Question

I'm using Apache Beam's kafkaIO to read from a topic that has an avro schema in Confluent schema registry. I'm able to deserialize the message and write to files. But ultimately i want to write to BigQuery. My pipeline isn't able to infer the schema. How do I extract/infer the schema and attach it to the data in the pipeline so that my downstream processes (write to BigQuery) can infer the schema?

Here is the code where I use the schema registry url to set the deserializer and where i read from Kafka:

    consumerConfig.put(
                        AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, 
                        options.getSchemaRegistryUrl());

String schemaUrl = options.getSchemaRegistryUrl().get();
String subj = options.getSubject().get();

ConfluentSchemaRegistryDeserializerProvider<GenericRecord> valDeserializerProvider =
            ConfluentSchemaRegistryDeserializerProvider.of(schemaUrl, subj);

pipeline
        .apply("Read from Kafka",
                KafkaIO
                        .<byte[], GenericRecord>read()
                        .withBootstrapServers(options.getKafkaBrokers().get())
                        .withTopics(Utils.getListFromString(options.getKafkaTopics()))
                        .withConsumerConfigUpdates(consumerConfig)
                        .withValueDeserializer(valDeserializerProvider)
                        .withKeyDeserializer(ByteArrayDeserializer.class)

                        .commitOffsetsInFinalize()
                        .withoutMetadata()

        );

I initially thought that this would be enough for beam to infer the schema, but it does not since hasSchema() returns false.

Any help would be appreciated.

Did you find a working solution to this problem? – Tobias Hermann Jul 01 '21 at 16:09 — Tobias Hermann, Jul 01 '21 at 16:09

score 1 · Answer 1 · answered Jun 26 '20 at 14:24

1

There is ongoing work to support inferring of Avro schema, stored in Confluent Schema Registry, in KafkaIO. Though, it's possible to do now in user pipeline code as well.

answered Jun 26 '20 at 14:24

Alexey Romanenko

1,353
5
11

thanks for the reply. How would i go about it? Getting the schema with an http call and using AvroIO to set the schema? – artofdoe Jun 26 '20 at 15:30
I answered in a separate response. – Alexey Romanenko Jun 30 '20 at 15:48

score 0 · Answer 2 · answered Jun 30 '20 at 15:48

This code probably will work but I have not tested yet.

// Fetch Avro schema from CSR
SchemaRegistryClient registryClient = new CachedSchemaRegistryClient("schema_registry_url", 10);
SchemaMetadata latestSchemaMetadata = registryClient.getLatestSchemaMetadata("schema_name");
Schema avroSchema = new Schema.Parser().parse(latestSchemaMetadata.getSchema());

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);


// Create KafkaIO.Read with Avro schema deserializer
KafkaIO.Read<String, GenericRecord> read = KafkaIO.<String, GenericRecord>read()
    .withBootstrapServers("host:port")
    .withTopic("topic_name")
    .withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", schemaRegistryUrl))
    .withKeyDeserializer(StringDeserializer.class)
    .withValueDeserializerAndCoder((Class) KafkaAvroDeserializer.class, AvroCoder.of(avroSchema));

// Apply Kafka.Read and set Beam schema based on Avro Schema
p.apply(read)
 .apply(Values.<GenericRecord>create()).setSchema(schema,
    AvroUtils.getToRowFunction(GenericRecord.class, avroSchema),
    AvroUtils.getFromRowFunction(GenericRecord.class))

Then I think you can use BigQueryIO.Write with useBeamSchema().

thanks for this! i tried using `.withValueDeserializerAndCoder((Class extends Deserializer>) KafkaAvroDeserializer.class, AvroCoder.of(avroSchema))`, but that doesn't work, so I tried calling withValueDeserializer and setCoder separately: `.setSchema(AvroUtils.toBeamSchema(avroSchema), TypeDescriptor.of(GenericRecord.class), AvroUtils.getToRowFunction(GenericRecord.class, avroSchema), AvroUtils.getFromRowFunction(GenericRecord.class)) .setCoder(AvroCoder.of(avroSchema))` which gave me the same error — artofdoe, Jun 30 '20 at 19:18

How to infer avro schema from a kafka topic in Apache Beam KafkaIO

2 Answers2

Linked