1

I'm trying to create an Apache Beam pipeline where I read from a kafka topic and load it into Bigquery. Using Confluent's schema registry, I should be able to infer the schema when loading into Bigquery. However, the schema is not being inferred when the loading and it fails.

Below is the entire pipeline code.

    pipeline
        .apply("Read from Kafka",
                KafkaIO
                        .<byte[], GenericRecord>read()
                        .withBootstrapServers("broker-url:9092")
                        .withTopic("beam-in")
                        .withConsumerConfigUpdates(consumerConfig)
                        .withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider.of(schemaRegUrl, subj))
                        .withKeyDeserializer(ByteArrayDeserializer.class)
                        .commitOffsetsInFinalize()
                        .withoutMetadata()

        )
        .apply("Drop Kafka message key", Values.create())
        .apply(
                "Write data to BQ",
                BigQueryIO
                        .<GenericRecord>write()
                        .optimizedWrites()
                        .useBeamSchema()
                        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
                        .withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
                        .withCustomGcsTempLocation("gs://beam-tmp-load")
                        .withNumFileShards(10)
                        .withMethod(FILE_LOADS)
                        .withTriggeringFrequency(Utils.parseDuration("10s"))
                        .to(new TableReference()
                                .setProjectId("my-project")
                                .setDatasetId("loaded-data")
                                .setTableId("beam-load-test")
        );
return pipeline.run();

When running this I get the following error, which is from the fact that i'm calling useBeamSchema() and hasSchema() returns false:

Exception in thread "main" java.lang.IllegalArgumentException
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:127)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expandTyped(BigQueryIO.java:2595)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:2579)
at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write.expand(BigQueryIO.java:1726)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:542)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:493)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:368)
at KafkaToBigQuery.run(KafkaToBigQuery.java:159)
at KafkaToBigQuery.main(KafkaToBigQuery.java:64)
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
artofdoe
  • 167
  • 2
  • 14
  • title seems misleading, you're using Kafka Schema Registry, right ? – vdolez Jul 01 '20 at 11:48
  • Is the subject registered with a schema? ``` https://docs.confluent.io/current/schema-registry/develop/api.html#subjects ``` the subject by default will be 'topic-name'-'value' – autodidacticon Jul 02 '20 at 16:39

0 Answers0