2

Before sending an Avro GenericRecord to Kafka, a Header is inserted like so.

ProducerRecord<String, byte[]> record = new ProducerRecord<>(topicName, key, message);
record.headers().add("schema", schema);

Consuming the record.

When using Spark Streaming, the header from the ConsumerRecord is intact.

    KafkaUtils.createDirectStream(streamingContext, LocationStrategies.PreferConsistent(),
        ConsumerStrategies.<String, byte[]>Subscribe(topics, kafkaParams)).foreachRDD(rdd -> {
          rdd.foreach(record -> {

            System.out.println(new String(record.headers().headers("schema").iterator().next().value()));
          });
        });
    ;

But when using Spark SQL Streaming, the header seems to be missing.

   StreamingQuery query = dataset.writeStream().foreach(new ForeachWriter<>() {

      ...

      @Override
      public void process(Row row) {
        String topic = (String) row.get(2);
        int partition = (int) row.get(3);
        long offset = (long) row.get(4);
        String key = new String((byte[]) row.get(0));
        byte[] value = (byte[]) row.get(1);

        ConsumerRecord<String, byte[]> record = new ConsumerRecord<String, byte[]>(topic, partition, offset, key,
            value);

        //I need the schema to decode the Avro!

      }
    }).start();

Where can I find the custom header value when using Spark SQL Streaming approach?

Version:

<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>

UPDATE

I tried 3.0.0-preview2 of spark-sql_2.12 and spark-sql-kafka-0-10_2.12. I added

.option("includeHeaders", true)

But I still only get these columns from the Row.

+---+-----+-----+---------+------+---------+-------------+
|key|value|topic|partition|offset|timestamp|timestampType|
+---+-----+-----+---------+------+---------+-------------+
Dale Angus
  • 43
  • 6

1 Answers1

2

Kafka headers in Structured Streaming supported only from 3.0: https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html Please look for includeHeaders for more details.

Gabor Somogyi
  • 136
  • 1
  • 5
  • As of 3.0.0-preview2, the Row schema doesn't include the headers yet? I get Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`headers`' given input columns: [offset, value, topic, timestamp, timestampType, partition, key]; – Dale Angus Jun 10 '20 at 04:50
  • Please see the code: https://github.com/apache/spark/blob/bcadd5c3096109878fe26fb0d57a9b7d6fdaa257/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaRecordToRowConverter.scala#L91 – Gabor Somogyi Jun 10 '20 at 07:27
  • One must set `includeHeaders` otherwise it's not working. – Gabor Somogyi Jun 10 '20 at 07:27
  • Hmm, as I see you've already added `includeHeaders`. I'll take a deeper look at it and fix it if there is a bug inside... – Gabor Somogyi Jun 10 '20 at 07:34
  • Not sure what's the issue on your side but it's working flawlessly here: https://github.com/gaborgsomogyi/spark/blob/f05e46e7d5362cfe238072d685933a676848b1e8/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala#L1655 When I've executed the test the following entry produced: `aaaaaaaaaaaa: WrappedArray([a,[B@3fcf9f74], [c,[B@2bd9204b])` Please analyze the difference between your application and the mentioned Spark code. – Gabor Somogyi Jun 10 '20 at 11:42
  • Got it to work! But encountered the bug described here. https://issues.apache.org/jira/browse/SPARK-30495 – Dale Angus Jun 10 '20 at 14:38