Using samples from various sources I have written this method (relevant section shown below) where I am pulling parquet-avro messages from kafka for a test application. Based on the code I could find to get this working (some of which is from http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html), I am using a passed in schema instead of a schema extracted from the messages themselves. Am I missing something or could I be extracting the schema from each message instead of needing to pass it in. I am new to all this so I wanted to be sure that I am doing this the best possible way.
import com.twitter.bijection.Injection;
import com.twitter.bijection.avro.GenericAvroCodecs;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.commons.collections.CollectionUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
....
public List awaitAndConsumeParquet(String topic, List fieldValues, Schema avroSchema, String field, int minutesTimeout)
throws InterruptedException {
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.subscribe(Collections.singletonList(topic));
List<String> foundValues = new ArrayList<>();
long startTime = System.currentTimeMillis();
long elapsedTime;
while (true) {
ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(Duration.ofMillis(1000));
for (ConsumerRecord<String, byte[]> record : consumerRecords) {
Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(avroSchema);
GenericRecord genericRecord = recordInjection.invert(record.value()).get();
String k = genericRecord.get(field).toString();
if (fieldValues.contains(k)) {
foundValues.add(k);
}
}
consumer.commitAsync();
...
Caller:
....
TestConsumer tc = new TestConsumer();
tc.setBootstrapServers("localhost:9092");
tc.setKeyDeserializer("org.apache.kafka.common.serialization.StringDeserializer");
tc.setValueDeserializer("org.apache.kafka.common.serialization.ByteArrayDeserializer");
tc.setGroupId("consumerGroup1");
// await expected data - provide jsonpath to use to query expected strings from json
List<String> notFoundMessages = tc.awaitAndConsumeParquet("demo", known_items_list, avroSchema, "known_item_key", 1);
...