I have a typical KafkaIO based source for reading Avro formatted key and value from a Kafka topic.
PCollection<KafkaRecord<GenericRecord, GenericRecord>> records =
pipeline.apply(
"Read from Kafka",
KafkaIO.<GenericRecord, GenericRecord>read()
.withConsumerFactoryFn(new SslConsumerFactoryFn(sslConfig))
.withConsumerConfigUpdates(
ImmutableMap.copyOf(consumerConfigUpdates)
)
.withBootstrapServers(options.getBootstrapServer())
.withTopics(topicsList)
.withKeyDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl(),
options.getInputTopic()+"-key"
)
)
.withValueDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl(),
options.getInputTopic()+"-value")
)
);
The issue is that the values can be null for some of the events and the Avro Schema doesn't allow the values to be null and thus this throws an exception causing the pipeline to halt. I'm ok with logging the keys for null values and moving on with the correct values in my pipeline.
I figured that I could NOT deserialize at the source and instead filter out the null Values and then work with only the non-null records.
pipeline.apply(
"Read from Kafka",
KafkaIO.<byte[], byte[]>read()
.withConsumerFactoryFn(new SslConsumerFactoryFn(sslConfig))
.withConsumerConfigUpdates(
ImmutableMap.copyOf(consumerConfigUpdates)
)
.withBootstrapServers(options.getBootstrapServer())
.withTopics(topicsList)
.withKeyDeserializerAndCoder(ByteArrayDeserializer.class, ByteArrayCoder.of())
.withValueDeserializerAndCoder(ByteArrayDeserializer.class, ByteArrayCoder.of())
)
.apply("filter non-null payloads",
ParDo.of(new FilterNonNullPayloads(nullPayloadTag))
.withOutputTags(deserializedAvroOutputTag, TupleTagList.of(nullPayloadTag))
)
.get(deserializedAvroOutputTag)
.apply(ParDo.of(new DeserializeKeyValueToGenericRecord(keyAvroSchema, valueAvroSchema)))
.setCoder(KvCoder.of(AvroGenericCoder.of(keyAvroSchema), AvroGenericCoder.of(valueAvroSchema)))
.apply(
MapElements
.into(TypeDescriptors.strings())
.via(e -> e.toString())
)
.apply(new ConsoleWriter<String>());
The DeserializeKeyValueToGenericRecord
is a DoFn as follows -
public class DeserializeKeyValueToGenericRecord extends DoFn<KafkaRecord<byte[], byte[]>, KV<GenericRecord, GenericRecord>> implements Serializable {
private static final Logger LOG = LoggerFactory.getLogger(DeserializeKeyToGenericRecord.class);
private final Schema keySchema;
private final Schema valueSchema;
public DeserializeKeyValueToGenericRecord(Schema keySchema, Schema valueSchema){
this.keySchema = keySchema;
this.valueSchema = valueSchema;
}
@ProcessElement
public void processElement(@Element KafkaRecord<byte[], byte[]> input, OutputReceiver<KV<GenericRecord, GenericRecord>> receiver) {
GenericRecord genericRecordKey = new GenericData.Record(keySchema);
GenericRecord genericRecordValue = new GenericData.Record(valueSchema);
try {
org.apache.avro.io.DatumReader<GenericRecord> keyReader = new org.apache.avro.generic.GenericDatumReader<>(keySchema);
org.apache.avro.io.Decoder decoder = org.apache.avro.io.DecoderFactory.get().binaryDecoder(input.getKV().getKey(), null);
genericRecordKey = keyReader.read(null, decoder);
} catch (Exception e) {
LOG.error( "Key issue - " + e.getMessage());
}
try {
org.apache.avro.io.DatumReader<GenericRecord> valueReader = new org.apache.avro.generic.GenericDatumReader<>(valueSchema);
org.apache.avro.io.Decoder valueDecoder = org.apache.avro.io.DecoderFactory.get().binaryDecoder(input.getKV().getValue(), null);
genericRecordValue = valueReader.read(null, valueDecoder);
} catch (Exception e) {
LOG.error( "Value issue - " + e.getMessage() + " for key " + genericRecordKey.toString());
}
receiver.output(KV.of(genericRecordKey, genericRecordValue));
}
}
The .setCoder(KvCoder.of(AvroGenericCoder.of(keyAvroSchema), AvroGenericCoder.of(valueAvroSchema)))
throws an exception since I'm not decoding the byte[]
properly there. I don't fully understand how I should be deserializing the input and what roles does the setCoder
play.