Is it possible to read protobuf message from kafka using spark structured streaming?
Asked
Active
Viewed 2,101 times
1 Answers
2
Approach 1
sparkSession.udf().register("deserialize", getDeserializer(), schema);
DataStreamReader dataStreamReader = sparkSession.readStream().format("kafka");
for (Map.Entry<String, String> kafkaPropEntry : kafkaProps.entrySet()) {
dataStreamReader.option(kafkaPropEntry.getKey(), kafkaPropEntry.getValue());
}
Dataset<Row> kafkaRecords =
dataStreamReader.load()
.selectExpr("deserialize(value) as event").select("event.*");
Approach 2
final StructType schema = getSchema();
DataStreamReader dataStreamReader = sparkSession.readStream().format("kafka");
for (Map.Entry<String, String> kafkaPropEntry : kafkaProps.entrySet()) {
dataStreamReader.option(kafkaPropEntry.getKey(), kafkaPropEntry.getValue());
}
Dataset<Row> kafkaRecords = dataStreamReader.load()
.map(row -> getOutputRow((byte[]) row.get(VALUE_INDEX)), RowEncoder.apply(schema))
Approach 1 has one flaw as deserialize method is called multiple times (for evert column in event) https://issues.apache.org/jira/browse/SPARK-17728. Approach 2 maps protobuf to row directly using map method.

Niket Arora
- 71
- 3
-
1What is this `getOutputRow()`? – Steephen Sep 22 '20 at 23:59