How to make streaming query with select over avro struct faster?

Question

I'm using Spark structured streaming with Kafka streaming source and Avro format and the creation of dataframe is very slow!

In order to measure the streaming query I have to add an action in order to evaluate the DAG and calculate the time. If I use foreachBatch() I persist and count the dataframe. Or another way is to write a file and calculate the time from Spark UI. the drawback is that I can't isolate the transformation time because the Kafka read time + transformation + write or persist time are in the same stage.

NOTE: Implemented in Scala(2.11) and Python(2.7.3) with the same results.

The implementation is based on the Spark 2.4.3 documentation (https://spark.apache.org/docs/latest/sql-data-sources-avro.html). The problem is that the .select('avro_struct.*') is slow.

streaming_df = streaming_df.select(
  self._from_avro('value', json.dumps(avro_schema_string))
    .alias('avro_struct'))\
  .select(F.col('avro_struct.*'))

Dependencies:

Spark 2.4.3
Kafka 2.11-2.10
Scala 2.11.8

Why don't you use [Monitoring Streaming Queries](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)? You could also use `ProcessingTime.Once` to measure execution time of a single batch. — Jacek Laskowski, Aug 12 '19 at 16:45
This `json.dumps(avro_schema_string)` made me scratch my head. Why would one transform a string (not object) into a JSON string again? I hope it's a typo somewhere. — 9000, Aug 12 '19 at 17:03
@9000 you have right the avro_schema_string it's confusing name because it's dictionary/json and after ```json.dumps(avro_schema_string)``` is a string. — ggeop, Aug 13 '19 at 06:55
@JacekLaskowski yes the ```ProcessingTime.Once``` is a good choice for a single batch execution, but It's another trigger method, it's not affect the execution time or solve my problem with the ```.select('avro_struct.*')``` — ggeop, Aug 13 '19 at 08:15

How to make streaming query with select over avro struct faster?

0 Answers0