0

In order to process Avro-encoded messages with Apache Beam using KafkaIO, one needs to pass an instance of ConfluentSchemaRegistryDeserializerProvider as the value deserializer.

A typical example looks like this:

PCollection<KafkaRecord<Long, GenericRecord>> input = pipeline
  .apply(KafkaIO.<Long, GenericRecord>read()
     .withBootstrapServers("kafka-broker:9092")
     .withTopic("my_topic")
     .withKeyDeserializer(LongDeserializer.class)
     .withValueDeserializer(
         ConfluentSchemaRegistryDeserializerProvider.of("http://my-local-schema-registry:8081", "my_subject"))

However, some of the Kafka topics, that I want to consume, have multiple different subjects (event types) on them (for ordering reasons). Thus, I can't provide one fixed subject name in advance. How can this dilemma be solved?

(My goal is to, in the end, use BigQueryIO to push these events to the cloud.)

Tobias Hermann
  • 9,936
  • 6
  • 61
  • 134

1 Answers1

1

You could do multiple reads, one per subject, and then Flatten them.

robertwb
  • 4,891
  • 18
  • 21
  • Ah, good idea, thanks. I'll try that and see how the skipped elements are handled (errors?) and how this plays out with the consumer offsets. – Tobias Hermann Jul 01 '21 at 07:33
  • The non-decodable events are skipped, and regarding the offsets, I just set separate values for `group.id` using `.withConsumerConfigUpdates`. Thanks again. :) – Tobias Hermann Jul 01 '21 at 16:35