0

Using samples from various sources I have written this method (relevant section shown below) where I am pulling parquet-avro messages from kafka for a test application. Based on the code I could find to get this working (some of which is from http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html), I am using a passed in schema instead of a schema extracted from the messages themselves. Am I missing something or could I be extracting the schema from each message instead of needing to pass it in. I am new to all this so I wanted to be sure that I am doing this the best possible way.


import com.twitter.bijection.Injection;
import com.twitter.bijection.avro.GenericAvroCodecs;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.commons.collections.CollectionUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

....

  public List awaitAndConsumeParquet(String topic, List fieldValues, Schema avroSchema, String field, int minutesTimeout)
            throws InterruptedException {

        KafkaConsumer consumer = new KafkaConsumer(props);
        consumer.subscribe(Collections.singletonList(topic));

        List<String> foundValues = new ArrayList<>();

        long startTime = System.currentTimeMillis();
        long elapsedTime;
        while (true) {

            ConsumerRecords<String, byte[]> consumerRecords = consumer.poll(Duration.ofMillis(1000));

            for (ConsumerRecord<String, byte[]> record : consumerRecords) {

                Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(avroSchema);
                GenericRecord genericRecord = recordInjection.invert(record.value()).get();
                String k = genericRecord.get(field).toString();
                if (fieldValues.contains(k)) {
                    foundValues.add(k);
                }
            }

            consumer.commitAsync();

...

Caller:


....
        TestConsumer tc = new TestConsumer();
        tc.setBootstrapServers("localhost:9092");
        tc.setKeyDeserializer("org.apache.kafka.common.serialization.StringDeserializer");
        tc.setValueDeserializer("org.apache.kafka.common.serialization.ByteArrayDeserializer");
        tc.setGroupId("consumerGroup1");


        // await expected data - provide jsonpath to use to query expected strings from json
        List<String> notFoundMessages = tc.awaitAndConsumeParquet("demo", known_items_list, avroSchema, "known_item_key", 1);
...
chrismead
  • 2,163
  • 3
  • 24
  • 36
  • 1
    why use parquet? – thebluephantom Sep 26 '19 at 18:25
  • 1
    Looks like you're just getting regular Avro messages here. And you have a while (true) loop, so how are you returning a list to the calling method? Plus, storing every message into a list without clearing it is just asking for OOM exceptions – OneCricketeer Sep 27 '19 at 00:34
  • @thebluephantom using parquet is not my choice. I am just working on a test app. – chrismead Sep 27 '19 at 14:35
  • @cricket_007 perhaps i chopped too much off when providing the excerpt. I build up the list for a period of time, then check to see that all expected messages are present. Thanks for the feedback about regular avro and OOM. I will look into this. – chrismead Sep 27 '19 at 14:35
  • @cricket_007 those were just avro and not parquet! doh. thanks again for your comment – chrismead Sep 27 '19 at 17:28
  • As far as getting the schema from the message, I do know it is _possible_, but not with the Bijection library. It requires manually scanning a ByteArrayInputStream, I think, then getting & setting the "reader schema" parameter as the "writer schema" that is embedded within the event – OneCricketeer Sep 27 '19 at 18:24

0 Answers0