How to set AvroCoder with KafkaIO and Apache Beam with Java

Question

I'm trying to create a pipeline that streams data from a Kafka topic to google's Bigquery. Data in the topic is in Avro.

I call the apply function 3 times. Once to read from Kafka, once to extract record and once to write to Bigquery. Here is the main part of the code:

        pipeline
            .apply("Read from Kafka",
                    KafkaIO
                            .<byte[], GenericRecord>read()
                            .withBootstrapServers(options.getKafkaBrokers().get())
                            .withTopics(Utils.getListFromString(options.getKafkaTopics()))
                            .withKeyDeserializer(
                                    ConfluentSchemaRegistryDeserializerProvider.of(
                                            options.getSchemaRegistryUrl().get(),
                                            options.getSubject().get())
                            )
                            .withValueDeserializer(
                                    ConfluentSchemaRegistryDeserializerProvider.of(
                                            options.getSchemaRegistryUrl().get(),
                                            options.getSubject().get()))
                            .withoutMetadata()
            )

            .apply("Extract GenericRecord",
                    MapElements.into(TypeDescriptor.of(GenericRecord.class)).via(KV::getValue)
            )
            .apply(
                    "Write data to BQ",
                    BigQueryIO
                            .<GenericRecord>write()
                            .optimizedWrites()
                            .useBeamSchema()
                            .useAvroLogicalTypes()
                            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
                            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                            .withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
                            //Temporary location to save files in GCS before loading to BQ
                            .withCustomGcsTempLocation(options.getGcsTempLocation())
                            .withNumFileShards(options.getNumShards().get())
                            .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
                            .withMethod(FILE_LOADS)
                            .withTriggeringFrequency(Utils.parseDuration(options.getWindowDuration().get()))
                            .to(new TableReference()
                                    .setProjectId(options.getGcpProjectId().get())
                                    .setDatasetId(options.getGcpDatasetId().get())
                                    .setTableId(options.getGcpTableId().get()))

            );

When running, i get the following error:

    Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Extract GenericRecord/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes:  No Coder has been manually specified;  you may do so using .setCoder().
  Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.avro.generic.GenericRecord.
  Building a Coder using a registered CoderProvider failed.

How do I set the coder to properly read Avro?

``` .withKeyDeserializer( ConfluentSchemaRegistryDeserializerProvider.of( options.getSchemaRegistryUrl().get(), options.getSubject().get()) ) ``` Is your key type an avro as well? Otherwise you can just use BytesDeserializer. — autodidacticon, Jun 21 '20 at 01:12
Does this answer your question? [How to infer avro schema from a kafka topic in Apache Beam KafkaIO](https://stackoverflow.com/questions/62544980/how-to-infer-avro-schema-from-a-kafka-topic-in-apache-beam-kafkaio) — Tlaquetzal, Aug 11 '20 at 13:26

autodidacticon · Answer 1 · 2020-06-24T02:10:09.177

0

There are at least three approaches to this:

Set the coder inline:

     pipeline.apply("Read from Kafka", ....)  
    .apply("Dropping key", Values.create())
    .setCoder(AvroCoder.of(Schema schemaOfGenericRecord))
    .apply("Write data to BQ", ....);

Note that the key is dropped because its unused, with this you wont need MapElements any more.

Register the coder in the pipeline's instance of CoderRegistry:

pipeline.getCoderRegistry().registerCoderForClass(GenericRecord.class, AvroCoder.of(Schema genericSchema));

Get the coder from the schema registry via:

ConfluentSchemaRegistryDeserializerProvider.getCoder(CoderRegistry registry)

https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry-

edited Jun 24 '20 at 02:10

answered Jun 21 '20 at 01:03

autodidacticon

1,310
2
14
33

thanks for the reply. I tried both approaches and get the following exception `Exception in thread "main" org.apache.avro.AvroRuntimeException: avro.shaded.com.google.common.util.concurrent.UncheckedExecutionException: org.apache.avro.AvroRuntimeException: Not a Specific class: interface org.apache.avro.generic.GenericRecord ` – artofdoe Jun 22 '20 at 20:20
Youll want to use AvroCoder.of(Schema schema) https://beam.apache.org/releases/javadoc/2.21.0/org/apache/beam/sdk/coders/AvroCoder.html#of-org.apache.avro.Schema- – autodidacticon Jun 23 '20 at 19:37
sorry for being daft, but i'm not following. What i really want to do is to use the avro schema that is associated with the topic, to be inferred in the pipeline. I only have the schema registry url in form of: `consumerConfig.put( AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, options.getSchemaRegistryUrl());` I'm not sure how to extract the schema and to associated it with the data.. – artofdoe Jun 23 '20 at 22:34
Np. It looks like there is an even better option: https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry- – autodidacticon Jun 23 '20 at 23:02
actually i have tried that - i posted a similar question here where i use ConfluentSchemaRegistryDeserializerProvider https://stackoverflow.com/questions/62544980/how-to-use-avro-schema-from-a-kafka-topic-in-apache-beamm-kafkaio – artofdoe Jun 23 '20 at 23:24
You used `getCoder` and what happens? – autodidacticon Jun 24 '20 at 02:10
yes, i tried `.setCoder( ConfluentSchemaRegistryDeserializerProvider.of(schemaUrl, subj).getCoder(CoderRegistry.createDefault()))` and got `IllegalArgumentException`, which is from the missing schema – artofdoe Jun 24 '20 at 02:25
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216535/discussion-between-autodidacticon-and-artofdoe). – autodidacticon Jun 24 '20 at 02:49

How to set AvroCoder with KafkaIO and Apache Beam with Java

1 Answers1