2

I'm using Spark structured streaming with Kafka streaming source and Avro format and the creation of dataframe is very slow!

In order to measure the streaming query I have to add an action in order to evaluate the DAG and calculate the time. If I use foreachBatch() I persist and count the dataframe. Or another way is to write a file and calculate the time from Spark UI. the drawback is that I can't isolate the transformation time because the Kafka read time + transformation + write or persist time are in the same stage.

NOTE: Implemented in Scala(2.11) and Python(2.7.3) with the same results.

The implementation is based on the Spark 2.4.3 documentation (https://spark.apache.org/docs/latest/sql-data-sources-avro.html). The problem is that the .select('avro_struct.*') is slow.

streaming_df = streaming_df.select(
  self._from_avro('value', json.dumps(avro_schema_string))
    .alias('avro_struct'))\
  .select(F.col('avro_struct.*'))

Dependencies:

  • Spark 2.4.3
  • Kafka 2.11-2.10
  • Scala 2.11.8
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
ggeop
  • 1,230
  • 12
  • 24
  • 1
    Why don't you use [Monitoring Streaming Queries](http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries)? You could also use `ProcessingTime.Once` to measure execution time of a single batch. – Jacek Laskowski Aug 12 '19 at 16:45
  • This `json.dumps(avro_schema_string)` made me scratch my head. Why would one transform a string (not object) into a JSON string again? I hope it's a typo somewhere. – 9000 Aug 12 '19 at 17:03
  • @9000 you have right the avro_schema_string it's confusing name because it's dictionary/json and after ```json.dumps(avro_schema_string)``` is a string. – ggeop Aug 13 '19 at 06:55
  • @JacekLaskowski yes the ```ProcessingTime.Once``` is a good choice for a single batch execution, but It's another trigger method, it's not affect the execution time or solve my problem with the ```.select('avro_struct.*')``` – ggeop Aug 13 '19 at 08:15

0 Answers0